Putting the World Service radio archive online with machine-generated and crowd-sourced metadata.
Project from -
What we've done
We built a prototype website containing the whole of the BBC World Service English-language radio archive. We did this by developing algorithms that listen to the radio programmes and create new descriptive metadata automatically and we then provided the ability for people to correct or add to this data. The video above shows the prototype in action.
Why it matters
We want to make it easier to catalogue and cross-reference large video and audio collections such as the BBC's archive, and therefore create enjoyable and useful ways to explore our wealth of programmes and discover hidden gems when the archives are made public. To do this we need metadata about these programmes, and often it doesn't exist in a useful form.
Manually tagging programmes with metadata about them is expensive and time-consuming, so we are researching advanced algorithms and machine-learning techniques that can do it automatically. And where these methods aren't good enough, we want to harness the power of crowd-sourcing data to improve the metadata.
- To develop automated methods to create metadata for audio-visual archives where none, or not much, exists
- To develop features that encourage people to add to this automated metadata, and to understand if this leads to increased accuracy
- To determine if it is acceptable to launch an archive where the metadata hasn't been comprehensively checked by hand
- To explore the features required to make such an archive proposition work
- To understand what kind of metadata and tags are good and useful
How it works
Our starting point is the massive audio archive of the World Service in English, dating back six decades and covering over 70,000 radio programmes, or more than three years' worth of continuous audio. Metadata for this archive is currently sparse or non-existent.
To counter this, we are first using speech-to-text technology to create transcripts, albeit "noisy" ones. We have then built a "semantic tagger" called KiWi, specially designed to work on the "noisy" transcripts, that automatically assigns topics, drawn from DBpedia, Wikipedia's store of structured data, to the radio programmes.
From this data we have built a prototype website that lets people explore this archive. And while doing so they can approve, correct, or add to this machine-generated metadata to make the whole thing better for all. You can read more on our blog about how we designed and built the site.
This project started as part of the ABC-IP workstream and is a followup to KiWi, a project aimed at using Amazon Web Services to process the large amount of audio in the World Service archive. Some components of this have been made available on Github: ruby parsers for Wikipedia’s “On this day” and "In The News" boxes and a ruby gem for SPARQL endpoints.
As well as giving us the chance to explore Kiwi and cloud processing further, this project resulted in a prototype for the World Service audio archive.
Following this project, BBC World Service worked to transfer many of the programmes into iPlayer, resulting in over 20,000 additional archive programmes available to the public.
This project is part of the ABC-IP work stream
People & Partners
Senior R&D Engineer
Senior Research Engineer
Interaction & User Experience Designer
Creative Director UX
User Interface Developer
Principal Software Engineer
Project R&D Engineer