« Previous | Main | Next »

Wikipedia + Lucene's MoreLikeThis = useful bits about the bits?

Post categories:

Chris Sizemore Chris Sizemore | 14:49 UK time, Friday, 13 June 2008


'bits about the bits' -- those bits that describe the narrative...

My colleague Michael recently posted about Nicholas Negroponte's prescient 1995 musings into the info glut challenges traditional TV and radio broadcasters are now feeling as a result of going digital.

Negroponte: "...we need those bits that describe the narrative with key words... these will be inserted by humans aided by machines... the[se] bits about the bits change broadcasting totally... they give [audiences] a handle by which to grab what interests [them], and [they] provide the [broadcaster] with a means to ship [its programmes] into any nook or cranny that wants them..."

I've been working for some years now on methods of providing audiences with access to BBC Radio and TV programmes based on genre, topic, and subject. In other words, I, and many of my colleagues, have been concentrating on the "bits about the bits" part of the chain.

Recently, I managed to hack a promising little "bits about the bits" prototype together, something that attempts to address in particular Negroponte's notion of "...bits that describe the narrative with key words..." My approach begins by treating Wikipedia and its articles as a Web-scale collaborative taxonomy or controlled vocabulary. Yes, for these purposes, suspend disbelief and assume Wikipedia is useful fodder for semi-automated categorisation -- whether or not it's a trustworthy or authoritative journalistic resource is an interesting debate, but isn't relevant for the job we want to do here.

My proof-of-concept is based on vacuuming every Wikipedia article into the Lucene open source search engine to build a text categorisation tool prototype. It's possible you may find this approach useful in your own "bits about the bits" endeavours.

a bits about the bits recipe...

So, here's the basic recipe I've been using to create a bare bones, proof-of-concept, semi-automated text categoriser -- which I call "conText":

  1. Download entire contents of the English language Wikipedia (careful, that's a large 4GB+ xml file!)
  2. Parse that compressed XML file into individual text files, one per Wikipedia article (and this makes things much bigger, to the tune of 20GB+, so make sure you've got the hard drive space cleared)
  3. Use a Lucene indexer to create a searchable collection (inc. term vectors) of your new local Wikipedia text files, one Lucene document per Wikipedia article
  4. Use Lucene's MoreLikeThis to compare the similarity of a chunk of your own text content to the Wikipedia documents in your new collection
  5. Treat the ranked Wikipedia articles returned as suggested categories for your text

So, that's the very high level recipe I've been working with. You can have a look at the type of results this approach gives at my conText proof-of-concept page.

(If you do want further brutal details, let me know in the comments area below -- but for now, I think I'll leave it to another day to delve into the scary, hacky details of the XML and Lucene search collection document structures, the fields and weightings, the tokenisers used and discarded, the messy Perl, Rails, and Java code -- yes, nastily, all three are involved -- the confidence rankings, the result filtering heuristics, etc.)

the results of the recipe...

The main point of all this is to explore whether the results this recipe coughs up are useful for content categorisation, and whether they might help audiences navigate to and around content. So, let me give you a sense of what this Wikipedia + Lucene approach spits out. Take, for instance, the following descriptive text from the episode 'A Riot' of the Radio 4 programme Making of Music:

A Riot. James Naughtie presents a series chronicling the historical influences that affected the course of classical music. 10/30. A Riot Igor Stravinsky's The Rite of Spring caused a riot in Paris on its premiere in May 1913. Why did it create such an uproar and how did it become an iconic piece? Read by Simon Russell Beale.

Paste that text into the form on the conText proof-of-concept page and hit "Get Wikipedia URIs".

conText text submit page

The results are pretty interesting:

conText results page

The top six results are:

Those first 3 results seem like very useful categorisations for this episode of Making of Music, don't you agree? The result Classical music riot, in particular, thrills me and shows that the conText categorisation recipe offers something that, say, Yahoo! Term Extraction and Open Calais don't (at least not yet ;-) ). "Classical music riot" is a pretty darn accurate and (dare I say?) insightful category for this content, but that phrase wasn't present in our submitted text, per se. Turns out that our content shared enough with the text of the Classical music riot article to come up as a match. Language and words and computers and people are interesting like that.

could these bits about the bits be useful?

How can these results help with the "bits about the bits" challenge, though? Well, for instance, we could use this list of results to create category tags on the episode's page, which, when clicked, take users to a page pulling together all BBC programmes sharing that category (a little like this) -- and which offer an onward journey to a BBC /topics page featuring links to a complete collection of content on that subject. There's plenty of other relevant BBC and Web-wide content available on, say, The Rite of Spring. Let audiences know!

Here's a picture showing the kind of "related Topics badge" that could appear with content, based on categorisation techniques like conText:

topgear-badged.png

The idea is that a horizontal navigation badge like this can help audiences find more of what they are interested in.

getting even better bits about the bits...

Of course, the results above bring up some pretty obvious questions and potential limitations, too. Why hasn't the system picked out, say, "James Naughtie" and "Simon Russell Beale"? What could we do to improve the result ranking and filtering heuristics, so that we minimise (some might say) "noisy" results like "Concerto_for_Piano_and_Wind_Instruments_(Stravinsky)"?

And so on -- this is the kind of proof-of-concept that keeps asking questions even while it answers a few. If you have suggestions about all this, let me know in the comments below.

Now, I do have a possible tweak in mind to help address the "proper names" question above, i.e. how to get the system to send back James Naughtie's Wikipedia page as a category even when there's not much supporting text specifically about him.

Imagine running entity extraction on the content text first, then using the entities extracted as "narrow-the-scope guides" for the conText approach. For instance, pop the episode description text we used before into the Open Calais entity extraction service. Then, go back to the conText submit page, use the text as before, but this time, in the box that says "Use the text above to disambiguate a specific concept:", put in "James Naughtie" (imagine the entity extractor supplying it automatically) and submit the form.

conText term disambiguation

This time, the top 2 results are:

And, the top result's "confidence rank" of 1.6976633 makes it likely that http://en.wikipedia.org/wiki/James_Naughtie is the best page for the specific "James Naughtie" our text refers to.

So, what's the trick/heuristic here? Just throw the extracted text (in this case "James Naughtie") against a Google search constrained to the Wikipedia domain (I should actually use Google or Yahoo!'s search APIs, but haven't bothered yet). This will give you the page-ranked set of Wikipedia pages the Web is using to link to "James Naughtie"-ness. Use this sub-set of Wikipedia to constrain the conText MoreLikeThis similarity comparison, so that only a page from this set can be a result. More times than not, you can disambiguate very accurately with this approach.

better the bits you know...

And the best part? This disambiguation process gives you back a URL for the category, not just a text label. This URL can be used as a globally unique identifier for the category -- anywhere, on any system. You can use this URL to communicate the category unambiguously -- other people (and software) can simply use the URL to look up a webpage explaining it, to find out what it means and why it's relevant as a category. And this URL-as-category will help the overall user experience of the Web become more and more coherent and relevant, while still allowing the Web to play to its strengths as the crazy, amazing, useful mess we all know and love and which probably none of us could live without anymore. But that's another story.

finally, the bit at the end...

Mr Negroponte's insight into the importance of "the bits about the bits" -- i.e the context around a TV or radio programme: who made it, who's in it, what it's about, who else has listened to, commented on, blogged about, or emailed it to a friend -- well, reading that today, in my line of work, sends shivers of recognition up and down my geeky spine. And he could see it so clearly, way back in the halcyon days of '95 -- back when I was still trying to decide whether email was a good way to send Beat poems to a potential love interest or just a passing fad (email, that is, not Beat poetry).

The insights are many years old by now, and while the name's the same, do recent phenomenon like Wikipedia change the game? Go ahead and have a play with the conText prototype: do you agree with me that Wikipedia + Lucene's MoreLikeThis = useful bits about the bits? Let me know what you make of all this -- do leave a comment on your way home. And if you decide to give this recipe a try in your own app, please get in touch and let me know how it goes. I'd love to hear if this thing's got legs.

Comments

  • Comment number 1.

    This is a little bit similar to the Muddy Boots BBC labs commission a friend of mine worked on, which you may or not be aware off. http://muddyboots.rattleresearch.com/

    I'm interested in the gory details, and in particular given the returned wikipedia articles, how would you extrapolate the category tags to be used on the episode pages? Yahoo Term Extraction on the wikipedia title and first paragraph in the article? Or just the whole title? I'd be interested to see how you would tackle that last hurdle.

    But indeed, congratulations, it seems a very promising start!

  • Comment number 2.

    thanks for the comment. yes, smuggyuk, the work described here has fed into the Muddy Boots commission -- we've asked them to, amoung other things, improve my system to the point where, hopefully, it can be run as a realtime, Yahoo! Term Extraction or Open Calais-like API... as you can see, my prototype runs painfully slowly... things look very promising in the Muddy Boots work, so watch this space...

    your question: "how would you extrapolate the category tags to be used on the episode pages?" -- the Wikipedia articles' titles are all unique by definition, so we can use those as "tag" link text on the content page badge. the Wikipedia/dBpedia URI is a key for a bunch of useful info: Category Title, précis, links to other Wikipedia and Web pages, etc... does that answer what you were wondering?

  • Comment number 3.

    Yeah it does, confirms my thoughts, and glad you've fed into the the Muddy Boots guys!

  • Comment number 4.

    I am using Lucene for my product search engine (http://www.octoprice.co.uk) and I am very interested in the more like this feature. Somebody has found a simple way to implement it?

    Thanks...

  • Comment number 5.

    I am using Lucene for my product search engine (http://www.octoprice.co.uk) and I am very interested in the more like this feature. Somebody has found a simple way to implement it?

    Thanks...

  • Comment number 6.

    Interesting stuff....we've done similar experiments, for example classifying 100 million webpages into categories using Wikipedia as a rough taxonomy, and building a system that tells you which movie a movie review is talking about.

    We've used a different framework though: Xapian (www.xapian.org) - open source like Lucene, but C/C++ with bindings to other languages (including Ruby, Perl etc.). Xapian also does more-like-this and we suspect would perform better in this situation.

    The MySociety guys are using Xapian quite a bit, as are Mydeco (which octoprice might be interested in).

  • Comment number 7.

    I tried a few terms (about a Wikipedia article I had just been editing) and hence I knew the context in advance (that is I had a deep knowledge).

    The system came up with a very closely related set of terms. Very neat!

    How much does this rely on the Wikipedia "network of links"? And how much on the "bits of bits"? Or are they the same thing?

    My next task is try to get some spurious URIs!

 

BBC iD

Sign in

BBC navigation

BBC © 2014 The BBC is not responsible for the content of external sites. Read more.

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.