Posted by BBC Research and Development on , last updated
This week I'm doing weeknotes so I'm going to talk about some of the work the data team has been doing trying to expand the vocabulary of our Kaldi-based speech to text system.
Members of the W3C Accessibility Conformance Testing Rules Community Group
Speech to text systems translate audio (speech) into words (text). They all have a list of words which they recognise. Most also use a list of pronunciations for each word, written in a phonetic alphabet, which the system uses to map between the way the word sounds and the way the word is written. We're interested in expanding the vocabulary of our speech to text system to make it more useful, particularly for journalists and programme makers. Language changes all the time and new words become important as the focus of journalists changes, politicians are appointed and step down, and so on. When a word is not in the dictionary, the output is incorrect (for example "Brexit" might become "breaks it").
We've broken this problem down into different stages: finding the (important) missing words, finding pronunciations for those missing words, making a new speech to text model, and then testing it. At each stage we're aiming to automate this process so that we can repeat it in the future.
To start with, we needed to find the words which are missing from our dictionary. We looked at three sources: subtitles from several years of BBC programmes, news articles published online by the BBC, and lists of article titles from the English version of wikipedia. By parsing these sources, discarding the known words and counting the remainder we created three lists of words and their frequencies.
The next stage was to try to find pronunciations for the most important of these missing words. Some words, especially acronyms and numbers are easy to generate automatically. Others can be sourced from pronunciation dictionaries. We have been working with the dictionary from the BBC Pronunciation Unit which provides pronunciations for use by journalists and presenters. If a word exists in their dictionary we can convert it into the same phonetic alphabet which Kaldi uses.
We retrained the Kaldi model and tested it in two ways: first to check that the overall word error rate had not deteriorated compared to the old model, and secondly to see if the new model was able to recognise the new words we've added. Unfortunately the new words were not in our existing test set, so we needed to make new test data which included a subset of the new words. We did this by searching the BBC archive for programmes containing those words, snipping short relevant sections of the programme, aligning the audio with the subtitles, and using this data as our test material. We could calculate the overall word error rate for this test material, as well as the word error rate for the specific words we've added.
By doing this we've shown that we can now recognise new words (like Brexit, Corbyn, Rohinga and millenials). We also detected where our pronunciation mappings had errors and needed to be corrected. Most importantly, we've got a process in place to continue to find and add new words to our system as the words spoken on TV and Radio change, keeping it relevant.
Over in the Disco Team they have been following up our Mental Health hackday, and exploring the emotional impact of News services. They have also been trying to understand how the consumption of online news contributes to political knowledge.
Meanwhile, Jakub and David have been completing final touches to the "Radio Clips" prototype and demonstrating both "Radio Clips" and "Mixtape". Jakub has also been modelling sequences of music tracks, in order of appearance in radio programmes, and doing a statistical analysis to see if there are any interesting patterns.
The Experiences Team
Emma Pratt Richens and Chris hosted a meeting of the W3C Accessibility Conformance Testing Rules Community Group meeting in Broadcast Centre. Chris is drafting a W3C use case and requirements document for media-timed events, which will lead to development of a Web API for timed metadata events.
Alicia and Libby have been getting Tellybox ready for its Taster launch, figuring out what analytics to use and testing to make sure it works on all the target devices.
Talking with Machines
Andrew and Joanna been testing The Unfortunates and Nicky, Ant and Henry have been making some tweaks to pass Amazon’s certification this week. They have also been demoing their work to the Voice + AI engineers, Senior Leadership team and the Editorial team. Nicky talked about our voice work at the International Radio Festival.
They also kicked off our formal collaboration with the National Film and Television School at The Next Episode workshop.
This post is part of the Internet Research and Future Services section