BBC R&D, the BBC Archive and digital public space: an overview of our work on the archive from preservation to multimedia classifications
The BBC has about a million hours of video and audio content, plus a wealth of documents, including the original scripts. Most of this content is still on magnetic tape, film, records or paper and so needs to be digitised and made searchable before it can be contributed to the Digital Public Space which was the subject of a recent technology podcast from the Guardian. BBC R&D has a long track record of developing innovative technology for the BBC’s archives, including the Ingex digitisation process for D3 videotape [BBC R&D White Paper WHP 155], and Reverse Standards Conversion, which reverse engineers the processes applied by pioneering standards converters of the 1960's to programmes of that era provided to broadcasters abroad and lost from our own archives. Research is continuing to extend the digitisation process to other types of video tape and to develop automated methods to detect picture and sound faults to ensure good quality digitisation and to assist restoration.
Finding interesting content in such a large archive is a key research challenge. If you already know the title, then it is simple, but there is likely to be interesting content that you don’t know about, perhaps programmes you’ve forgotten about or broadcast before you were born. The BBC has a formal catalogue of its archive that has been developed for professional media archivists to find content for professional programme makers. This catalogue uses the LONCLASS classification system, which is very powerful, allowing complex searches for specific items of content. However, this does depend on the content being manually catalogued in detail, which is not always the case for some of the early content in the archive.
In our Multimedia Classification project BBC R&D is researching automated techniques that can allow content to be searched by the general public. Most people will be familiar with the concept of genre, comedy, drama, news, sport, etc. This would be a good starting point for searching the archive, but these classifications are very broad and so could still produce too many suggestions. Also, it wouldn’t help with more unusual searches, such as finding light-hearted drama or the more exciting football matches. The approach we are taking is to analyse the content directly for video and audio features that could indicate its emotional mood. For example, detecting scenes with head and shoulders shots of two people in a brightly lit room suggests that this could be a current affairs or news programme. Adding speech recognition could identify the subject of discussion and detecting any non-voiced sounds may give a clue as to the type of location. Performing this analysis on a scene by scene basis enables searches for specific items, such as extracting the exciting parts of a football match.
Analysing music from content is proving powerful way to find something you like that fits with your mood. Musical Moods is a project with Salford University and the British Science Association (BSA) launched in March 2011 during the BSA‘s National Science and Engineering Week. One of the key aims of this project was to identify what types of mood or emotion, people identify with particular theme tunes. Participants were asked rate the theme tunes on a range of different scales to see if all programmes in the same genre have the same overall mood; scales used included Happy/ Sad, Dramatic/ Calm, Masculine/Feminine, etc. The data obtained is now being use to train computers to identify the mood of a programme from its theme tune. Sam Davies from my team will go into more detail about our Multimedia Classification project in a post that will be published here tomorrow.
Storing content as digital files in large Petabyte storage systems to produce a permanent archive for long-term preservation of the content brings many new problems to a media archive. Film, records and tape can last many years on shelves in an environmentally controlled facility. However, digital storage systems need to be managed to avoid loss of files due to a variety of storage system failures, accidental deletion or overwriting of files, etc. and the short-term obsolescence of digital storage media. We are researching into the best way to store content together with all the other information associated with it, i.e. its metadata, to create packages that can be managed in the digital archive and exchanged with other archives.
BBC R&D is working together with several major European archives in the PrestoPrime European collaborative project that is developing techniques to manage media files in the digital world. The experiences and recommendations from the project are made publicly available through the PrestoCentre.
The BBC’s ambition to digitise the whole archive is an important enabling goal for the Digital Public Space. It is important to produce digital files at the right quality to ensure that future generations can see the content in pristine condition and so R&D is working on how best to apply video and audio coding to archive content. We are also working on the automatic analysis of content to detect faults, such as tape drop outs, and on techniques for restoration. This is a challenging research topic that is expected to take several years; however, it has many similarities with the technologies used for automatically searching content.