In this project we have investigated the possibility of automatically assigning topics to large programme archives in a reasonable time.

What we've done

Kiwi is a framework aimed at automatically identifying topics in speech radio programmes, with topic identifiers being drawn from Linked Open Data sources such as DBpedia.

In order to generate such topics in a reasonable time for large programme archives, we built a processing infrastructure distributing computations on cloud resources (e.g. Amazon EC2). We used this infrastructure to automatically tag the entire BBC World Service archive (70,000 programmes) in around two weeks.

Project updates

More project info

Why it matters

The BBC manually tags recent programmes on its website. Editors draw and assign these tags from open datasets made available within the Linked Data cloud, but this is a time consuming process. Aside from recent programming, which is tagged, the BBC has a very large radio archive that is currently untagged.

Tags enable a wide variety of use cases, such as the dynamic building of topical aggregations, retrieval through topic-based search, or cross-domain navigation. Automatic tagging of archive content would ensure archive programmes are as findable as recent programmes. It would mean that topic-based collections of archive content can easily be built, for example to find archive content that relates to current news events. Kiwi provides an algorithm and an infrastructure to automatically tag very large programme archives in a cost-effective and scalable manner.

How it works

We built an automated tagging algorithm using speech audio as an input.

We use the open source CMU Sphinx-3 software, with the HUB4 acoustic model and a language model extracted from the Gigaword corpus. The resulting transcripts are very noisy and have no punctuation or capitalisation, which means off-the-shelf concept tagging tools perform badly on them. We therefore designed an alternative concept tagging algorithm.

We start by generating a list of web identifiers used by BBC editors to tag programmes. Those web identifiers identify people, places, subjects and organisations within DBpedia. This list of identifiers constitutes our target vocabulary. For each of those identifiers, we aggregate a number of textual label and look for them in the automated transcripts. The output of this process is a list of candidate terms found in the transcripts and a list of possible corresponding DBpedia web identifiers for them. For example if ‘paris’ was found in the transcripts it could correspond to at least two possible DBpedia identifiers: "Paris" and "Paris, Texas".

In order to disambiguate and rank candidate terms, we consider the subject classification in DBpedia, derived from Wikipedia categories and encoded as a SKOS hierarchy. We start by constructing a vector space for those SKOS categories, capturing hierarchical relationships between them. Two categories that are siblings will have a high cosine similarity. Two categories that do not share any ancestor will have a null cosine similarity. The further away a common ancestor between two categories is, the lower the cosine similarity between those two categories will be. We implemented such a vector space model within our RDF-Sim project. We consider a vector in the same space for each DBpedia web identifier, corresponding to a weighted sum of all the categories attached to it. We then construct a vector modelling the whole programme, by summing all vectors of all possible corresponding DBpedia web identifiers for all candidate terms. Web identifiers corresponding to wrong disambiguations of specific terms will account for very little in the resulting vector, while web identifiers related with the main topics of the programme will overlap and add up. For each ambiguous term, we pick the corresponding DBpedia web identifer that is the closest to that programme vector.

For example if the automated transcripts mention ‘paris’, ‘france’ and ‘tour eiffel’ a lot, the resulting programme vector will point towards France-related categories e.g. "France" or "Cities in France". The right disambiguation of ‘paris’ will be the one that is closest to that programme vector, hence d:Paris. We then rank the resulting web identifiers by considering their TF-IDF score and their distance to the programme vector. We end up with a ranked list of DBpedia web identifiers, for each programme.

We separated each step of this workflow into independent self-contained applications, or "workers". Each worker takes input in the form of the results of the previous step of the workflow, and produces output to be given to the next step of the workflow. We also configured a message-queueing system using RabbitMQ to allow workers to pickup new tasks and assign tasks to one-another. We built an HTTP interface centralising all intermediate and final results, as well as keeping track of the status of each worker. We deployed a number of workers on a cloud infrastructure in order to process as much data as we can in parallel.


We used this algorithm and infrastructure to automatically tag the entire BBC World Service archive (around 70,000 programmes or three years of continuous audio), for which we have very little annotations. The resulting tags were used to unlock this archive and make it available through the World Service Archive prototype. Some of the tools created have been made available as open source software on Github: a library for generating vector spaces from large hierarchies and an automated audio tagging evaluation framework.

People and partners

Project team