« Previous | Main | Next »

The World Cup and a call to action around Linked Data

Post categories:

John O' Donovan | 13:37 UK time, Friday, 9 July 2010

Underneath the surface of the BBC World Cup web site there is a revolution going on in the technology and workflow being used to manage and publish our content. To some extent we have been doing this in stealth mode as we figure out a lot of challenges, but as we approach the World Cup Final we'd like to explain some more about what we have changed and why this is an important engagement for us in the development of the Semantic Web and support for the use of Linked Data .

For some time, we have been working on utilising Metadata and Linked Data to organise and manage the site dynamically, culminating in the World Cup 2010 site which uses Linked Data to manage how content is published. We have also had some News Linked Data discussions with other news organisations thinking about how to bring a critical mass to the development of the Semantic Web and what benefits it can bring.

The World Cup site is our first major statement on how we think this can work for mass market media and a showcase for the benefits it brings.

First some background on the World Cup site.

The World Cup site is a large site with over 700 aggregation pages (called index pages) designed to lead you on to the thousands of story pages and content which make up the whole site. Examples of index pages range from the Groups and Fixtures page through to detailed pages for each team or player.

Normally, managing all these index pages for the World Cup would not be possible as each of these needs to be curated by an editor, setting up automation rules or keeping it up to date with latest stories and information. To put the scale of this task in perspective, the World Cup site has more index pages than the rest of the Sport site!

So how is this possible? Clearly some form of automation is required, but search technologies and previous methods for doing this have proven to be inaccurate and there is no point in having all these pages if the quality of them is perceived to be low. You don't want to get content mixed up between different players with the same surname, for example.

The key change is we are using some advanced methods for analysing content and deciding how to tag this content with precise metadata linked to uniquely identified concepts (a concept usually being a person, place or thing). In the case of the world cup we are interested in players, teams, matches, etc... but the principle can be easily applied to anything. To do this we are using some technology from IBM (Languageware) and Ontotext (BigOWLIM) and a high level view of the process is shown in Fig 1, but we will be following up this post with more details about how this all works.


Pushing the Boundaries

Though there are lots of dynamically published sites on the internet, the difference here is in the use of RDF and Linked Data to build and manage the site. This is incredibly flexible and we are only just starting to explore the possibilities of how this allows us to present and share content. Though we have been using RDF and linked data on some other sites (such as BBC Programmes, BBC Wildlife finder, Winter Olympics) we believe this is the first large scale, mass media site to be using concept extraction, RDF and a Triple store to deliver content.

Another way to think about all this, is that we are not publishing pages, but publishing content as assets which are then organised by the metadata dynamically into pages, but could be re-organised into any format we want much more easily than we could before.


So why is this important?

The principles behind this are the ones at the foundation of the next phase of the internet, sometimes called the Semantic Web, sometimes called Web 3.0. The goal is to be able to more easily and accurately aggregate content, find it and share it across many sources. From these simple relationships and building blocks you can dynamically build up incredibly rich sites and navigation on any platform.

There is also a change in editorial workflow for creating content and managing the site. This changes from publishing stories and index pages, to one where you publish content and check the suggested tags are correct. The index pages are published automatically. This process is what assures us of the highest quality output, but still saves large amounts of time in managing the site and makes it possible for us to efficiently run so many pages for the World Cup.

To make all this possible there has been fantastic support from the Sport team, engaging with new tools and workflows. We are all looking forward to the London Olympics, where there will be over 12,000 athletes and index pages to manage and so without this type of technology, we will not be able to showcase and maximise all the content we have.

A call to action

We'd like to engage further in the development of Linked Data and feel we have a role to play in supporting this important new view of how content is published and shared. The methods talked about here will become the basis for more and more of our content publishing and we fully appreciate the work many people are doing in this area to make this possible.

There is a vision for the future here with more time spent on creating and sharing content and less on managing it. However we have had to overcome many problems in getting this far and many of these issues are related to organising and cleaning up data. Due to all the technical and data challenges we have not yet been able to expose all our data as RDF, for example, though we will start doing this soon.

As more content has Linked Data principles applied to it (as outlined here , then these problems will become less significant and the vision of a Semantic Web moves closer. Importantly, what we have been able to show with the World Cup, is that the technology behind this is ready to deliver large scale products.

This is more than just a technical exercise - we have delivered real benefits back to the business as well as establishing a future model for more dynamic publishing which we think will allow us to make best use of our content and also use Linked Data to more accurately share this content and link out to other sites and content, a key goal for the BBC.

We look forward to seeing the use of Linked Data grow as we move towards a more Semantic Web.

John O'Donovan is Chief Technical Architect, Journalism and Knowledge, BBC Future Media & Technology. Read the follow up post, BBC World Cup 2010 dynamic semantic publishing on the Internet blog.


  • Comment number 1.

    Great summary John and congratulations to all concerned; this is an excellent effort.


  • Comment number 2.

    Thank you BBC. You are pioneers for making new tech mas market. I think RSS, Podcasts, Online Video, Backstage and much more. This move to support the semantic web and RDFa is massive.

    "publishing content as assets which are then organised by the metadata dynamically into pages"

    I would love to see more posts and detail about how this has helped the BBC.

  • Comment number 3.

    What about microformats?

    hCard for teams/ players and (with Geo) venues
    hCalendar for fixtures

  • Comment number 4.

    John, thanks for describing the content-publishing technology and workflow. This could make for a great presentation at a conference I'm organizing... if you can make it to New York. The Web site doesn't formally launch until Monday, but check it out: https://smartcontentconference.com . Drop me a note or submit a speaking proposal if you'd be interested in speaking.

    Seth, [Personal details removed by Moderator]

  • Comment number 5.

    This article is great, it's good to see the BBC embracing metadata so well. Have you looked at sharing the data via an open content database like Freebase (https://freebase.com%29?

    My only complaint is your perpetuation of the myth of a "versioned" Internet. Please never say "Web 3.0" again ;)

  • Comment number 6.

    I appreciate BBC's efforts to use the Semantic web technologies. There is a need for players like BBC to leverage this excellent technology. It is helping the creation of the next web.
    Though it is actually helping organisations with large amount of data to benefit from it in the back end, the end users are still not having any perceivable benefits yet. It will be great if some applications are created which will clearly demonstrate the power of the Semantic Technologies to the end user. Which will be a turning point. I hope BBC team can play a big role in that.
    Cheers and best wishes for the efforts.

  • Comment number 7.

    Samsethi: Thre are more details being published today on how this works

    Andy Mabbett: We are more focused on RDFa at the moment though would consider microformats if there was enough demand. We still have a lot of work to do to expose the data.

    Gareth Adams: I won't mention Web 3.0 again. Oh, whoops...

  • Comment number 8.

    There's a follow up post that's just been added on the blog BBC World Cup 2010 dynamic semantic publishing that you may also be interested in.

  • Comment number 9.

    Great, pioneering work from your teams! Congratulations for showing what is possible with semantics. It shows that you can gain business benefits from applying these technologies and paradigms. Would love to talk more with the team. We are doing similar work with publishers, media companies, corporates etc...

  • Comment number 10.

    I'm not sure I fully understand the topic being discussed, I have some knowledge about microformats and I can understand the use of creating a semantically described web, but I followed through to this page via a BBC blog link which talked about essentially doing away with urls, but I don't see how that is really possible.

    In any case, the semantic web is going to be the next general evolution and having sites like the BBC getting behind the effort will ensure that process is driven forward faster.


More from this blog...

BBC © 2014 The BBC is not responsible for the content of external sites. Read more.

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.