« Previous | Main | Next »

Muddy Boots

Jonathan Austin | 12:00 UK time, Wednesday, 10 December 2008

We've been experimenting with the Semantic Web using a prototype called Muddy Boots. The question that we're trying to answer is: Can a computer reliably identify the people and organisations in news stories? This is still work in progress, but we have a prototype and an API that you're welcome to explore.

MainActors.jpg

When a journalist refers to someone in a news story they usually give the person's full name and enough information so the reader can understand who they are talking about. If the full name is ambiguous they may have to add a title or give an explanation about who the person is. But sometimes, especially for household names, the reader is expected to infer the identity of the person from the context of the story and by applying a reasonable level of background knowledge.

Whilst a human reader takes for granted their abilities to pick up journalists' cues and understand context, a computer has to be programmed explicitly. It is difficult to design a system that can identify people from text and disambiguate them. It is even harder to build a system that meets editorial standards of accuracy. However, in theory, it should be possible. So we've been experimenting to develop an approach that could lead to a system that reliably identifies people (and organisations) in stories and marks up their textual names with semantic information. There are four key challenges:

  • Build working prototypes
  • Write tests for the prototypes that express editorial standards
  • Refine the prototypes to reach defined levels of reliability
  • Express the information usefully through semantic mark-up

Prototypes

The prototypes are ready to share with you. They have been built for us by a company called Rattle Research based in Sheffield. They were a successful participant in the BBC Innovation Labs.

There are two systems available. They are both based on DBpedia (the structured version of Wikipedia) which provides the controlled vocabulary of people and organisations. Therefore, in these prototypes each person in a news story is described by their Wikipedia entry. Potentially, Wikipedia is a good controlled vocabulary source for news because it has wide scope, is open and dynamic. It is certainly useful for prototyping.

  • The first method is called "Muddy". It works by extracting proper names from the story text and then matches them to entries in DBpedia. If a term is ambiguous, the system uses various strategies based on Wikipedia's disambiguation pages and the structure of DBpedia to resolve the conflict. More information can be found on Rattle's website here
  • The second method is called "conText". It was initially proposed by Chris Sizemore and is described in detail in his blog post here. This method uses search technology (Google and Lucene) to enhance the results further.

The good news for anyone who is not an expert in term or knowledge extraction is that Rattle implemented both methods behind a common abstract API. In effect we can treat both methods like black boxes. We don't need to know how they work to use them and evaluate their ability to identify people.

In addition, Rattle implemented some visualisations so that we can get a feel for how the systems work. Below are some sample stories that have had people and organisations identified. You can also submit additional stories by following the final link.

Testing

It doesn't take long to see that neither prototype is perfect. Sometimes they miss people and sometimes they get them wrong. But that is the point of the research. How good are they really and can they be improved? Our next step is to measure them against our editorial standards.

So currently we are working with another Innovation Labs entrant ThinkTankMaths to develop some tests. We're going to compare the performance of both systems (and any system that implements Rattle's API) to the performance of human beings. We will also be proposing measures that evaluate the systems from an editorial point of view. For example, is it editorially more acceptable for the system to fail to spot the name of cat owner whose cat gets stuck in a tree than the Prime Minister? And what should the system do when the name of that cat owner is Gordon Brown?

We will post more about this and our initial findings in the New Year, but in the meantime we'd like to hear your thoughts and feel free to have a look at the API and the prototypes.

Comments

  • 1. At 3:00pm on 10 Dec 2008, Briantist wrote:

    There are two more options, of course:

    1) Actually have BBC hacks sorting out their own links manually like Wikipedia.

    2) Given that BBC News pages are read by 100,000 people, use the community to suggest the necessary links (AJAX in realtime, perhaps) with BBC editorial approval.

    Complain about this comment

  • 2. At 02:01am on 13 Dec 2008, gladioolers wrote:

    I think it's a good method.
    so helpful for journalist.


    ---

    Busby SEO Test

    Complain about this comment

  • 3. At 3:27pm on 19 Jan 2009, corriedog69 wrote:

    This is very interesting. I've heard about Google using Latent Semantic Indexing (LSI) but didn't realize that it could be taken even further. Unfortunately it's a bit too over my head to blog about it.

    Complain about this comment

  • 4. At 06:22am on 21 Apr 2009, U13912239 wrote:

    That the BBC's Bowen could be criticised for saying "Zionism's innate instinct to push out the frontier" and that Israel showed a "defiance of everyone's interpretation of international law except its own" and that its generals felt that they were dealing with "unfinished business", is hard to believe.

    Those are facts, very concisely expressed.

    If anything, the BBC could amplify them.

    There is no valid "other side" to those facts that could be offered.

    What could be offered would be a discussion of today's Zionism and the psyche of Israel that exlains its dominant attitude and the actions of its government.
    More sympathy could be expressed for Israel if the psychological "why" were better understood. That, however, would be an opinion piece.

    My view is that Bowen has done a very good job.

    Here, on the Journalism Labs, one may discuss this.

    Complain about this comment

View these comments in RSS

BBC iD

Sign in

BBC navigation

BBC © 2014 The BBC is not responsible for the content of external sites. Read more.

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.