Posted by Ben Clark on , last updated
This week it's my turn to write weeknotes so I'm going to touch on some ongoing work we're doing evaluating speaker identification from audio - figuring out who is speaking using machine learning. A lot of the work on the data team revolves around experimenting with, and evaluating algorithms which extract information from audio and video. This doesn't seem much like conventional software development, but there is one similarity - finding a good way to test the system is critical.
Above: Members of IRFS check out the new offices ahead of the big move
Whether you're developing an API or a speaker ID system, once you've got a good test, the problem is half-solved. You know what "good behaviour" looks like, and its a simple (or not-so-simple) matter of making changes to the system until you pass the test.
With speaker ID the first challenge was to develop the test. The system had been trained on our own dataset and seemed to perform well by some measures, but running the system on actual footage from BBC programmes gave very bad results. Not what we expected.
Why was this? Firstly the accuracy during training (on a held-out subset of the data) didn't match the performance of the system end-to-end, where upstream components such as voice activity detection and speaker change detection can introduce error.
Secondly, the classifier assumed a closed-world world where every voice in the dataset was known. In reality we want to run any TV or radio program on the BBC through this system and we will only be able to classify a tiny fraction of the voices, so we need the system to be able to classify voices as "unknown" when they don't match a known person. If the algorithm confidently identifies every voice as someone and generates a lot of false positives the output will be useless for just about any practical application of the technology.
Finally, the metrics we used to evaluate the data didn't match our use case. We used the excellent pyannote.metrics package, but overall identification error rate (defined here) didn't tell us whether our system was good enough or not. We want to correctly classify known voices, but will accept them being classified as unknown over missclassifying them as other famous people. Users should have a high confidence that any audio clip returned is correct, so we need a system that favors precision over recall.
By taking a step back and thinking about what our users want from speaker ID, we are able to make changes to our speaker id system and explain the consequence of those changes more easily. Now that we're able to do that we're moving on to the other half of the problem, trying to make it work well.
In the rest of the Data Team Matt and Misa have been working on the processing pipeline for the Content Analysis Toolkit demo. They have a system that processes incoming media through the COMMA platform to extract speech-to-text transcripts, entity recognition based on that text, recognized faces and recognized speakers. The data is collated and hosted online for use in the front end being built by Mathieu. Denise has finished retraining our speaker id system and is now researching fusing face recognition and speaker id to improve the overall accuracy.
The Discovery Team have been supporting a trial of the Quote Database by the BBC's Reality Check team. Chris and Andrew have been scoping a iCase studentship with Dr André Freitas from the School of Computer Science at Manchester University. The studentship will explore novel uses of Natural Language Processing in the Newsroom. Meanwhile David has been preparing a Smart Survey questionnaire and other arrangements for the user testing of the Radio Clips and Mixtape prototypes. Finally, Chris has added readability and sentiment API methods to our Vox text classification system.
Emma has launched a user survey to find out the range of emotional responses people have while using voice technologies. The survey will run until the end of the month. She also wrote a blog post to promote the survey.
Oscar has completed the first round of user testing. He asked users to try three different methods of pairing smart speakers with a smartphone, and then feed back their thoughts regarding the experience.
Talking with Machines
The team have been continuing to work on the Next Episode prototype ahead of their mid February deadline. The aim of the prototype is to test the ability to send and receive multimedia images and voice messages across two devices whilst a linear story is taking place. So you can listen to a drama on a smart speaker whilst interacting with characters in the drama via your phone. Oscar and Henry have been looking at how to send and receive voice, text and picture messages over a mobile app. Anthony has been using Orator to plot out the recorded audio, and Nicky and Andrew have been planning some qualitative research with 18-31 year olds.
Tristan and Mathieu briefed the News app team on the results of the newnews project - formats and audience behaviours. Tristan also had a session with News Labs and NTB, the Norwegian news agency, about structured and modular news.
Chris has continued work on the use case and requirements document for media-timed events, looking specifically at MPEG Common Media Application Format, the HTML5 DataCue API, and the time marches on algorithm in HTML. He has also been organising future activities for the W3C Media & Entertainment Interest Group, including standards requirements for subtitles in 360-degree video and VR experiences.
We've been preparing to move to The Lighthouse from our offices in Centre House in a few weeks.
Interview with Tristan about how we work in R&D
Excellent paper on evaluating Speaker Identification "What Makes a Speaker Recognizable in TV Broadcast? Going Beyond Speaker Identification Error Rate"