ABC-IP and work on audio archiving research
We're a few months in to a new collaboration with MetaBroadcast where we're looking at how to unlock archive content by making more effective use of metadata. The Automatic Broadcast Content Interlinking Project (ABC-IP for short) is researching and developing advanced text processing techniques to link together different sources of metadata around large video and audio collections. The project is part funded by the Technology Strategy Board (TSB) under its 'Metadata: increasing the value of digital content' competition which was awarded earlier this year. The idea is that by cross-matching the various sources of data we have access to - many of which are incomplete - we will be able to build a range of new services that help audiences find their way around content from the BBC and beyond.
Our starting point is the English component of the massive World Service audio archive. The World Service has been broadcasting since 1932 so deriving tags from this content gives us a hugely rich dataset of places, people, subjects and organisations within a wide range of genres, all mapped against the times they've been broadcast.
The distribution of programme genres in the World Service radio archive
One of the early innovations on the project has been to improve the way topic tags can be derived from automated speech-to-text transcriptions which gives us a whole new set of metadata to work with for comparatively little effort. We've optimised various algorithms to work with the sorts of transcriptions with high word error rate that speech recognition creates and the results so far have been quite impressive.
Other sources of data include everything from BBC Programmes, including the topics manually added by BBC producers, and everything from BBC Redux, an internal playable archive of almost everything broadcast since mid-2007. In later stages of the project we'll also be adding data about what people watch and listen to as well. Blending all this together provides many different views of BBC programmes and related content including, for example, topics over time or mappings of where people and topics intersect. The end result is a far richer set of metadata for each unique programme than would be possible with either automatic or manual metadata generation alone.
Based on the work so far our project partners MetaBroadcast have built the first user-facing prototype for the project, called Tellytopic, which lets users navigate between programmes using each of the new tags available. You can find more on MetaBroadcast's own blog.
The plan is that the work we're doing will eventually complement projects in other parts of the BBC, such as Audiopedia which was announced by Mark Thompson last week. We'll talk more about other ways we're going to use the data on this blog over the coming months.