Posted by BBC Research and Development on , last updated

So, imagine you're in the business of making and broadcasting content. You've been doing this for a while now and you're getting pretty good at it. People like your material, they get attached to it. They'd prefer that you didn't record over it. As productions get more expensive and the relative cost of recording media drops, you start to hang on to them. Fast forward a few decades, those tapes are really piling up now, and you're still making content, and now you're sitting on top of something that can be reasonably called 'An Archive'. At the same time powerful computers are becoming cheaper and more ubiquitous, no longer just lab curios.

Accordingly, some bright spark insists that you should store content as a precise digital copy, exactly as broadcast. You can do a lot with 1s and 0s. However, give it a few more years those 1s and 0s too are beginning to pile up, plus you have a copy of every broadcast rather than just every programme, repeats and all. Shows may have gone out with different trails, or worse, minor changes in the edit for obscure reasons. Worse yet, schedules are capricious beasts which aren't always adhered to, so you might not have recorded the start or end of the show. Anyone who had a VCR remembers this problem.

On top of that, your information about the programme comes from the broadcast stream itself. This is seriously minimal- the same stuff your Freeview box gets- broadcast time, channel, title and brief description. If you need to look something up, and you don't know exactly when it went out, you might be trawling through a lot of content. Pity those who are just looking for a clip; a snippet of a documentary, something you need to reuse from a radio broadcast, a news bite that has suddenly become relevant.

The logical extension is that you start to think about digitising all your piles of tapes too - you (probably) don't run the risk of chopping the end off your programmes, but now you have no automatic metadata at all.

Ok, busted. What we're actually talking about here is metadata. Sorry. Don't worry, it's going to be OK.

The Audience Experience research section has been thinking about different types and sources of metadata and why some might be more useful than others. In the past, we investigated unobtrusively embedding metadata, making it possible to embed images and text data inside MP3 files without harming compatibility with the MP3 specification and players.

mark-kermode-radio-show-in-id3v2-tag-player.jpg MP3 of the Mark Kermode radio show with embedded chapters, text and images.

We have also investigated the possibility of inferring metadata from the content itself. Sticking with the audio example, traditionally an audio waveform is used to help navigate around audio content - we've turned that audio data into insane striped multi-coloured visualisations, suspecting that seeing patterns could tell us something about the content

vine_visualised.jpgVisualisation of the Jeremy Vine show

It turned out that the colours provided some useful signposts. Speech separated clearly from music, we could often distinguish between male and female speakers, telephone interviews showed up as low bandwidth dark spots. We used these visualisations as a boot-strap for a human-controlled music annotation tool. While we couldn't always use the visualisation to tell what was happening in the audio at any given point, we could use them to tell when something different was afoot. We could look at all sorts of other audio features , possibly finding other novel ways to navigate through sound.

Metadata can tell you what your content is. Metadata can tell you what the ingredients of your content are. If we drill down to a single unit of content - let's say a song - it can tell you what that is made of too. All of this seems irrelevant until you consider the problems that it can solve- the thousands of hours of digital content, and the person searching for that specific clip.

When applying this to our giant digital Archive Of Content - two things become clear:

  • It would be useful to add metadata against the timeline of the content (thing X happened at time Y),
  • It would be a mammoth task to add this information from scratch, even if you knew at the outset exactly what kind of information is useful and what isn't.

Luckily we don't have to create metadata from scratch if we can get it from somewhere else - especially from elsewhere in the production chain. We could collect different kinds of metadata, from information about locations and sets to full cast and crew lists- all sorts of useful stuff that could help you dig up that piece of content again.

There's no reason to stop there either. Once someone has retrieved the content from your archive, they could also add their own annotations. For example, a production assistant looking for a particular clip might leave some metadata behind, making it easier to find again The next person seeking that clip would find a trail of breadcrumbs leading right to it. The archive should also apply any metadata belonging to a piece of content to its repeats. However, repeats aren't always identical to the original broadcast, so there's the possibility that the metadata for them is wrong.

Every metadata timestamp might be out by a few seconds if the programme was edited slightly, different trails accompanied it, or the channel was running late. Rather than just cursing "the system", the user could actually provide an effective fix with a few clicks- thus improving the quality of the archive overall.

It's all about taking advantage of the resources that you have. Not only do you need to make intelligent and innovative use of the metadata itself, the trick is to think laterally and capture it whenever you can too. In addition to the now accepted ways of filtering, searching and digging using metadata, you need to provide clear, easy and quick ways to allow people to improve it. Even something as seemingly mundane as a few clicks to mark the start of an interview in a programme enriches the value of the recording itself.

The best part? It's not anything that you wouldn't be doing anyway, if you were researching in the archive. You'd still need to make notes about where the things you were looking for start and end. Harnessing the power of a large network of users means that the more the system is used, the more usable it becomes.