Crowdsourcing the World Service Radio Archive: an experiment from BBC R&D
The BBC World Service Archive prototype allows you to search, browse and listen to over 36,000 radio programmes from the BBC World Service archive spanning the past 45 years. For a limited time you can explore this archive and help us improve it by validating and adding topic tags that describe the programmes. You just need an email address to register for the prototype.
I work for the Internet Research and Future Services team in BBC R&D and for the last few months we have been running an experiment on how to put a large media archive online using a combination of algorithms and people. With your help we aim to comprehensively and accurately tag this collection of BBC programmes. This video explains how:
A guide to using BBC R&D's World Service Archive prototype.
The BBC, and many other organisations, have massive archives of TV and radio but it is expensive to put them online in a navigable, findable way so we are researching ways to make it cheaper and easier. R&D is pioneering ways of generating metadata for programmes automatically using innovative algorithms that can "listen to" and tag programmes with topics and speakers.
The World Service Archive prototype is an experiment to apply this research to a real-world archive, and then to improve the results using crowdsourcing. We want to learn about how good the algorithms are, whether and how people tag, and how to combine algorithms with people.
The archive problem
Digitising programmes in the archive
Between 2005 and 2008 the BBC World Service digitised the contents of its recorded programme library. This included programmes archived from the English-language radio services over the past 45 years - over 50,000 programmes (of which 36,000 factual programmes were available to put online - see footnote) covering a wide range of subjects from weekly African news reporting the civil war in Sierra Leone as it happened to interviews with Stephen Spielberg.
The digitisation project was a great success but the metadata was of limited quality and quantity. Metadata is data to describe digital media items and without it content is hard to find and navigate around. So although we might have had a programme title and broadcast date we didn't really know what each programme was about without listening to every one - or indeed know the shape and contents of the whole archive. In 2012 the World Service and engineers from BBC Research & Development joined together to demonstrate a way to put massive media archives online using a combination of computers and people. We thought we could use advanced algorithms to listen to all the programmes in the archive, automatically generate metadata, use this data to put it online and then ask listeners to validate and improve it.
We ran the audio through an automated speech-to-text process and this generated fairly noisy transcripts with lots of errors. So we used robust algorithms that we had developed for this purpose to extract key topics from each programme, using Linked Data to ensure each topic is unambiguous and linked to the web. In total we created around 1 million topics, about 20 per programme. You can read more about the technology we used on the R&D blog.
Although the results were fairly good, the automatically generated topic tags for programmes were often wrong - the computers aren't really listening to and understanding the programmes. But we thought that these automatically generated tags, together with the original metadata, were good enough to design and build a browsable and searchable website for the archive. Listeners could use this online prototype and help improve it by validating the automatically generated data and adding their own - "crowdsourcing" the final part of the problem.
Registered users of the experiment can now search, browse and listen to the programmes in the archive, vote on whether automatically generated tags are correct, add their own tags and even correct errors in the programme titles and synopses. You can read more about the latest crowdsourcing features on the R&D blog.
Progress so far
Programmes tagged over time
So far, users of the prototype have listened to around 12,000 of the 36,000 programmes that are available and tagged or edited about 7,000 of these. This has generated over 70,000 individual metadata "edits" (votes, new tags etc). We've even had some dedicated listeners send us recordings of programmes that were missing from the archive. We are currently analysing the data so far to see how good the tags are by comparing professional archivists, listeners and our algorithms. We're also interested in what the most common tags are, what kind of tags are added by people (are they more often people, events or places?) and which kinds of programmes are most popular.
We want your help!
Could you help us do more? The World Service archive is being made available online for a limited time while we conduct this experiment and we want to get as much data as possible. Try finding the oldest programme we've got, look for old episodes of your favourite programmes, see how we can identify individual journalists speaking on From Our Own Correspondent or just explore the archive:
Footnote: Some of the audio in this experiment is unavailable due to rights considerations. This mainly affects programmes with drama, readings, comedy performances, music or sport. Although you cannot listen to these programmes they all retain a page in the prototype that describes them. The original digitised archive only contained pre-recorded programmes, so there are no news bulletins present.
Tristran Ferne is Executive Producer, IRFS, BBC R&D