Rapid search of media archives using subtitle keyword search and other techniques.
Project from - present
What we're doing
Finding TV shows is easy, but finding content within those shows is hard.
BBC Snippets allows users to quickly search the content of programmes using precise keyword searches. Results can be filtered by genre, date and other facets. When you've found the programme you want, you can move around it using 'clickable transcripts' (click on a word to jump to that point in the video) and a range of visual navigation tools.
Why it matters
Opening up the the BBC's TV and Radio archives is one of the biggest challenges we face.
Even when digitisation is complete and rights have been cleared, the problem still remains: how do you find what you're looking for amongst hundreds of thousands of hours of programmes?
This is our challenge. We believe that the BBC Archive will only achieve its full potential when the task of finding what you want has been reduced to a few seconds effort.
Our first goal is to explore and develop automated methods for producing high quality metadata. We're focusing our effort on seven key data types.
- Programme details (Title, description, transmission dates, genre, format etc.)
- Full Transcript (time aligned to the second, with full speaker diarisation)
- Intelligent Chapters (with key terms and concepts extracted and linked in to useful resources like Wikipedia)
- Actor/Presenter/Contributor segmentation down to the level of the individual scene
- Geolocation of scenes
- Meaningful objects tagged or identifiable via search
- Background music identified
Our second goal is to develop easy-to-use tools to navigate this metadata. Because the output of automated metadata generation is often inaccurate (think speech-to-text), the tools must be designed with this inaccuracy in mind.
How it works
Snippets is built on the BBC Redux video archive which contains over 300,000 BBC TV and radio programmes available in a variety of formats. Because Redux captures raw Freeview transport streams we get the subtitle data for each programme. This is is extracted using OCR software and indexed in a Solr database.
Snippets then matches the Redux programme to its equivalent broadcast in the BBC's PIPS database. This cross-matching gives us additional metadata such as genre, format and cast info.
Programmes are then run through FFMppeg to produce a series of 480 x 270 pixel screengrabs at a granularity of 1 per second. A variety of visual scanning tools use these screengrabs to help. As we have over 500,000,000 images containing a large number of faces, objects and landmarks these grabs provide a rich dataset for computer vision research.
These three elements are the basic building blocks of the Snippets web app. We've also developed tools for snipping, sharing and transcoding and built APIs to many of these functions.
Snippets is being currently being trialled by a number of production teams in BBC Vision and BBC News. In addition it's being looked at by BBC Complaints Handling and Media Monitoring services authorised by the BBC Trust.
It's also being considered as one of the components of the BBC's Research & Education Space initiative. In addition we're working with a number of universities on image recognition projects using the Snippets Framestore. We welcome any enquiries from academics.