« Previous | Main | Next »

Muddy Boots

Jonathan Austin | 12:00 UK time, Wednesday, 10 December 2008

We've been experimenting with the Semantic Web using a prototype called Muddy Boots. The question that we're trying to answer is: Can a computer reliably identify the people and organisations in news stories? This is still work in progress, but we have a prototype and an API that you're welcome to explore.

MainActors.jpg

When a journalist refers to someone in a news story they usually give the person's full name and enough information so the reader can understand who they are talking about. If the full name is ambiguous they may have to add a title or give an explanation about who the person is. But sometimes, especially for household names, the reader is expected to infer the identity of the person from the context of the story and by applying a reasonable level of background knowledge.

Whilst a human reader takes for granted their abilities to pick up journalists' cues and understand context, a computer has to be programmed explicitly. It is difficult to design a system that can identify people from text and disambiguate them. It is even harder to build a system that meets editorial standards of accuracy. However, in theory, it should be possible. So we've been experimenting to develop an approach that could lead to a system that reliably identifies people (and organisations) in stories and marks up their textual names with semantic information. There are four key challenges:

  • Build working prototypes
  • Write tests for the prototypes that express editorial standards
  • Refine the prototypes to reach defined levels of reliability
  • Express the information usefully through semantic mark-up

Prototypes

The prototypes are ready to share with you. They have been built for us by a company called Rattle Research based in Sheffield. They were a successful participant in the BBC Innovation Labs.

There are two systems available. They are both based on DBpedia (the structured version of Wikipedia) which provides the controlled vocabulary of people and organisations. Therefore, in these prototypes each person in a news story is described by their Wikipedia entry. Potentially, Wikipedia is a good controlled vocabulary source for news because it has wide scope, is open and dynamic. It is certainly useful for prototyping.

  • The first method is called "Muddy". It works by extracting proper names from the story text and then matches them to entries in DBpedia. If a term is ambiguous, the system uses various strategies based on Wikipedia's disambiguation pages and the structure of DBpedia to resolve the conflict. More information can be found on Rattle's website here
  • The second method is called "conText". It was initially proposed by Chris Sizemore and is described in detail in his blog post here. This method uses search technology (Google and Lucene) to enhance the results further.

The good news for anyone who is not an expert in term or knowledge extraction is that Rattle implemented both methods behind a common abstract API. In effect we can treat both methods like black boxes. We don't need to know how they work to use them and evaluate their ability to identify people.

In addition, Rattle implemented some visualisations so that we can get a feel for how the systems work. Below are some sample stories that have had people and organisations identified. You can also submit additional stories by following the final link.

Testing

It doesn't take long to see that neither prototype is perfect. Sometimes they miss people and sometimes they get them wrong. But that is the point of the research. How good are they really and can they be improved? Our next step is to measure them against our editorial standards.

So currently we are working with another Innovation Labs entrant ThinkTankMaths to develop some tests. We're going to compare the performance of both systems (and any system that implements Rattle's API) to the performance of human beings. We will also be proposing measures that evaluate the systems from an editorial point of view. For example, is it editorially more acceptable for the system to fail to spot the name of cat owner whose cat gets stuck in a tree than the Prime Minister? And what should the system do when the name of that cat owner is Gordon Brown?

We will post more about this and our initial findings in the New Year, but in the meantime we'd like to hear your thoughts and feel free to have a look at the API and the prototypes.

Comments

BBC © 2014 The BBC is not responsible for the content of external sites. Read more.

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.