The World Cup and a call to action around Linked Data
Underneath the surface of the BBC World Cup web site there is a revolution going on in the technology and workflow being used to manage and publish our content. To some extent we have been doing this in stealth mode as we figure out a lot of challenges, but as we approach the World Cup Final we'd like to explain some more about what we have changed and why this is an important engagement for us in the development of the Semantic Web and support for the use of Linked Data .
For some time, we have been working on utilising Metadata and Linked Data to organise and manage the site dynamically, culminating in the World Cup 2010 site which uses Linked Data to manage how content is published. We have also had some News Linked Data discussions with other news organisations thinking about how to bring a critical mass to the development of the Semantic Web and what benefits it can bring.
The World Cup site is our first major statement on how we think this can work for mass market media and a showcase for the benefits it brings.
First some background on the World Cup site.
The World Cup site is a large site with over 700 aggregation pages (called index pages) designed to lead you on to the thousands of story pages and content which make up the whole site. Examples of index pages range from the Groups and Fixtures page through to detailed pages for each team or player.
Normally, managing all these index pages for the World Cup would not be possible as each of these needs to be curated by an editor, setting up automation rules or keeping it up to date with latest stories and information. To put the scale of this task in perspective, the World Cup site has more index pages than the rest of the Sport site!
So how is this possible? Clearly some form of automation is required, but search technologies and previous methods for doing this have proven to be inaccurate and there is no point in having all these pages if the quality of them is perceived to be low. You don't want to get content mixed up between different players with the same surname, for example.
The key change is we are using some advanced methods for analysing content and deciding how to tag this content with precise metadata linked to uniquely identified concepts (a concept usually being a person, place or thing). In the case of the world cup we are interested in players, teams, matches, etc... but the principle can be easily applied to anything. To do this we are using some technology from IBM (Languageware) and Ontotext (BigOWLIM) and a high level view of the process is shown in Fig 1, but we will be following up this post with more details about how this all works.
Pushing the Boundaries
Though there are lots of dynamically published sites on the internet, the difference here is in the use of RDF and Linked Data to build and manage the site. This is incredibly flexible and we are only just starting to explore the possibilities of how this allows us to present and share content. Though we have been using RDF and linked data on some other sites (such as BBC Programmes, BBC Wildlife finder, Winter Olympics) we believe this is the first large scale, mass media site to be using concept extraction, RDF and a Triple store to deliver content.
Another way to think about all this, is that we are not publishing pages, but publishing content as assets which are then organised by the metadata dynamically into pages, but could be re-organised into any format we want much more easily than we could before.
So why is this important?
The principles behind this are the ones at the foundation of the next phase of the internet, sometimes called the Semantic Web, sometimes called Web 3.0. The goal is to be able to more easily and accurately aggregate content, find it and share it across many sources. From these simple relationships and building blocks you can dynamically build up incredibly rich sites and navigation on any platform.
There is also a change in editorial workflow for creating content and managing the site. This changes from publishing stories and index pages, to one where you publish content and check the suggested tags are correct. The index pages are published automatically. This process is what assures us of the highest quality output, but still saves large amounts of time in managing the site and makes it possible for us to efficiently run so many pages for the World Cup.
To make all this possible there has been fantastic support from the Sport team, engaging with new tools and workflows. We are all looking forward to the London Olympics, where there will be over 12,000 athletes and index pages to manage and so without this type of technology, we will not be able to showcase and maximise all the content we have.
A call to action
We'd like to engage further in the development of Linked Data and feel we have a role to play in supporting this important new view of how content is published and shared. The methods talked about here will become the basis for more and more of our content publishing and we fully appreciate the work many people are doing in this area to make this possible.
There is a vision for the future here with more time spent on creating and sharing content and less on managing it. However we have had to overcome many problems in getting this far and many of these issues are related to organising and cleaning up data. Due to all the technical and data challenges we have not yet been able to expose all our data as RDF, for example, though we will start doing this soon.
As more content has Linked Data principles applied to it (as outlined here , then these problems will become less significant and the vision of a Semantic Web moves closer. Importantly, what we have been able to show with the World Cup, is that the technology behind this is ready to deliver large scale products.
This is more than just a technical exercise - we have delivered real benefits back to the business as well as establishing a future model for more dynamic publishing which we think will allow us to make best use of our content and also use Linked Data to more accurately share this content and link out to other sites and content, a key goal for the BBC.
We look forward to seeing the use of Linked Data grow as we move towards a more Semantic Web.
John O'Donovan is Chief Technical Architect, Journalism and Knowledge, BBC Future Media & Technology. Read the follow up post, BBC World Cup 2010 dynamic semantic publishing on the Internet blog.