One of the biggest challenges for the BBC Archive is how to open up our enormous collection of radio programmes. As we’ve been broadcasting since 1922 we’ve got an archive of almost 100 years of audio recordings, representing a unique cultural and historical resource.
But the big problem is how to make it searchable. Many of the programmes have little or no meta-data, and the whole collection is far too large to process through human efforts alone.
Help is at hand. Over the last five years or so, technologies such as automated speech recognition, speaker identification and automated tagging have reached a level of accuracy where we can start to get impressive results for the right type of audio. By automatically analysing sound files and making informed decisions about the content and speakers, these tools can effectively help to fill in the missing gaps in our archive’s meta-data.
BBC R&D decided to develop these automatic meta-data extraction technologies in a way that would allow large-scale audio processing. Building them into a cloud-based platform (more on this later) allows us to work through very large archives quickly, cheaply and many times over.
The name of the platform is COMMA and it’s currently being built by engineers from R&D’s Internet Research & Future Services team alongside colleagues at Kite and Somethin’ Else. The technology and thinking behind it are currently driving a number of BBC initiatives, such as the BBC World Service Prototype and Window on the Newsroom prototype, but as COMMA won’t be available to the general public until April 2015 we thought it would be helpful to give you a sneaky peek now.
The Kiwi set of speech processing algorithms
COMMA is built on a set of speech processing algorithms called Kiwi. Back in 2011, BBC R&D were given access to a very large speech radio archive, the BBC World Service archive, which at the time had very little meta-data. In order to build our prototype around this archive we developed a number of speech processing algorithms, reusing open-source building blocks where possible. We then built the following workflow out of these algorithms:
- Speaker segmentation, identification and gender detection (using LIUM diarization toolkit, diarize-jruby and ruby-lsh). This process is also known as diarisation. Essentially an audio file is automatically divided into segments according to the identity of the speaker. The algorithm can show us who is speaking and at what point in the sound clip.
- Speech-to-text for the detected speech segments (using CMU Sphinx). At this point the spoken audio is translated as accurately as possible into readable text. This algorithm uses models built from a wide range of BBC data.
- Automated tagging with DBpedia identifiers. DBpedia is a large database holding structured data extracted from Wikipedia. The automatic tagging process creates the searchable meta-data that ultimately allows us to access the archives much more easily. This process uses a tool we developed called 'Mango'.
When we were looking to apply such a workflow to the entire World Service archive there were a few things that needed to be taken into account. One was that the approach had to be as efficient as possible, as we were working with a very large store of media. The other was that, as programmes within the archive could potentially be about any subject, our algorithm needed to be able to operate within a large, unbounded vocabulary. With the resulting data we were able to make the archive searchable, and you can read about this process in more detail here.
From this experience we realised that the algorithms and the associated infrastructure could be run in the cloud for a very low cost, and that this would be very useful to other large media datasets, both inside and outside the BBC.
The COMMA platform
So we built them into a scalable platform that we called COMMA. The nature of COMMA is essentially that of a large batch processing system. We ingest collections of media and process them to extract useful meta-data.
This sort of workflow lends itself very well to a modern cloud infrastructure. Essentially the 'cloud' allows us to remotely access a network of computing power. Whereas traditionally we might be limited by the amount of computers we have to hand on which to run the processing, using a cloud infrastructure the possibilities are seemingly endless. One great advantage of this is that resources can be purchased on an hourly basis and can be scaled massively to meet usage demands.
We have chosen to use Amazon's AWS platform as our primary resource whilst keeping the core of the system separate. This gives us the ability to deploy processing machines on internal networks too, such as within the BBC News Room, should we need to.
The supporting cloud infrastructure remains the same for whatever type of job we are running – loading files, running an algorithm, monitoring progress and harvesting results. In the first instance we are processing media using Kiwi, but by keeping our meta-data extraction process separate, we can easily deploy new and arbitrary algorithms in the future.
Using a cloud platform also enables us to easily integrate practices such as continuous deployment into our development process. The inherent temporary nature of the Amazon virtual machines along with the considerable processing power the platform provides means that we can easily build new versions of the COMMA infrastructure using software such as Puppet and Packer. We are able to automatically produce a suite of disk images that can be deployed as a stack of many virtual instances, ensuring we can scale quickly and reliably.
COMMA in action
Currently COMMA is being used in the Window on the Newsroom prototype developed by BBC News Labs. Like the extensive World Service archive, journalists in the news room have an enormous, constantly growing database of footage at their disposal. Much of this may lack the kind of useful meta-data that makes it easy to find and use, especially as news events unfold. COMMA is able to automatically take the audio from the video clip, identify the different speakers and provide a transcript of the news footage. Again, the Mango tool then adds the important meta-data to the media within the database. COMMA is helping to make the workflow in the news room as efficient as possible by bringing the most relevant material to journalists and as quickly as possible.
COMMA is due to launch some time in April 2015. If you’d like to be kept informed of our progress you can sign up for occasional email updates here. We’re also looking for early adopters to test the platform, so please contact us if you’re a cultural institution, media company or business that has large audio data-set you want to make searchable.