Automatically tagging the World Service archive
A couple of months ago, Dominic blogged about ABC-IP, a collaborative project with MetaBroadcast looking at unlocking archive content by interlinking it with further data sources. In this post we describe some work we have been doing on automatic tagging of speech audio with DBpedia identifiers.
The World Service archive
One dataset we are looking at within this project is the World Service archive. This archive is isolated from other programme data sources at the BBC, like BBC Programmes or the Genome Project, and the associated programme data within it is very sparse. It would therefore benefit a lot from being automatically interlinked with further data sources which makes it such a particularly interesting use-case. The archive is also very large: it covers many decades and consists of about two and a half years of high-quality continuous audio content.
Automated semantic tagging of speech audio
One way of dealing with such a large programme archive with patchy metadata but high-quality content is to use the content itself in order to find links with related data sources. For example if a programme mentions 'London', 'Olympics' and '1948' a lot, then there is a high chance it is talking about the 1948 Summer Olympics. Using the structured data available in Wikipedia we can then draw a link between a recent programme on the 2012 London Olympics and that archive programme and use that link to provide further historical context.
When developing such an algorithm we need to take into account a couple of desirable properties: it needs to be efficient enough to be applicable to a large archive and it needs to use an unbounded target vocabulary, as programmes within an archive can virtually be about anything.
We therefore built such a 'semantic tagger', automatically assigning tags drawn from DBpedia (publishing structured data extracted from Wikipedia) to speech radio programmes.
We start by automatically transcribing the audio, using the CMU Sphinx tools. The outputted transcripts are very noisy - there are lots of different accents in the archive, and it covers a lot of genres and topics. Also, they don't include any punctuation or capitalisation, on which most existing Named Entity Extraction tools rely a lot. We then build a dictionary of terms from a list of labels extracted from DBpedia. We then look for those terms in the automated transcripts. In order to disambiguate and rank candidate terms, we use an approach inspired by the Enhanced Topic-based Vector Space Model proposed by D. Kuropka. We consider the subject classification in DBpedia derived from Wikipedia categories and encoded as a SKOS model. We start by constructing a vector space for those categories capturing the hierarchical relationships between them. Two categories that are siblings will have a high cosine similarity. Two categories that do not share any ancestor will have a null cosine similarity. The further away a common ancestor between two categories is, the lower the cosine similarity between those two categories will be. We published an implementation of such a vector space model in our Github repository and wrote about it in more details in our upcoming LDOW paper.
We consider a vector in that space for each DBpedia web identifier, corresponding to a weighted sum of all the categories attached to it. We then construct a vector modelling the whole programme, by summing all vectors of all possible corresponding DBpedia web identifiers for all candidate terms. DBpedia identifiers corresponding to wrong interpretations of specific terms will account for very little in the resulting vector, while web identifiers related with the main topics of the programme will overlap and add up. For each ambiguous term, we pick the corresponding DBpedia web identifer that is the closest to that programme vector. We then rank the outputted web identifiers by considering their TF-IDF score (taking into account how much the corresponding term is mentioned in the programme and how specific to the programme that term is) and their distance to the programme vector.
We end up with a ranked list of DBpedia identifiers for each programme. For example a 1970 profile of the composer Gustav Holst gets tagged with Benjamin Britten, music and Gustav Holst and a 1983 episode of the Medical Programme gets tagged with Hepatitis, Vaccine and medical research.
We evaluated the results against 150 programmes that have been manually tagged in BBC Programmes and found that the results, although by no means perfect, are good enough to efficiently bootstrap the tagging of a large collection of programmes. The results of our evaluation will be published as part of our LDOW paper.
Processing the World Service archive
Applying such an algorithm to a very large archive is a challenge. Even though the tagging step is quite fast, the transcription step is slightly slower than real-time on commodity hardware. However, all steps apart from the final IDF step can be parallelised. We can therefore throw a lot of machines at the problem to process an archive relatively quickly. We developed a message queue-based system to distribute computations across a large pool of EC2 machines and an API aggregating all the data generated at each step of the processing workflow. A screenshot of that API 'statistics' page as of today is available below.
Each EC2 'Compute Unit' has relatively predictable performances (for example, we transcribe 60 minutes of audio in 80 minutes on one Compute Unit), which means that the price for processing an entire archive can be estimated prior to running the computation. Also, that price won't depend on the time it takes to process the entire archive: throwing 100 machines at the problem will get results quickly and for the same price as 10 machines for 10 times longer. The only bottleneck is the bandwidth at which we can send audio to Amazon servers, which meant we could only process about 20,000 programmes per week.
Then at regular intervals a script sets the final ranking of the resulting tags and pushes the resulting data over to MetaBroadcast's systems. An example of those automated tags showing in their Tellytopic prototype is available below.
The underlying algorithm is described in more details in a paper accepted at LDOW, part of the WWW'12 conference. The application of that algorithm to the entire World Service archive is described in more details in an upcoming WWW'12 paper accepted in the demo track. We will post pointers to the papers as soon as they get published.
There is quite a lot more we could do to make that automated tagging algorithm work better. One of the first thing we could improve is the quality of the speech recognition. We use off-the-shelves acoustic and language models and we could probably get better results by using models trained on similar data. We are also looking at automated segmentation. Most programmes deal with a few different topics and it would be interesting to isolate topic-specific segments. We also recently started some work aiming at automatically identifying contributors in programmes.