BBC Topics: How It Works
The point of the Topic Pages is that they bring together content from all around bbc.co.uk. Obviously, many different systems produce all that content, and in general they don't tend to share content very well. Our challenge was to build a platform that could make sense of the different interfaces to those systems to make sharing that content easier.
The first thing to note is that the Topic Pages themselves are dynamic, unlike the vast majority of pages on bbc.co.uk. Essentially, this means that the HTML of the page isn't stored as a physical file on a hard-disk, but instead is built up dynamically when the page is requested.
This is done by the "Page Assembly Layer" or "PAL", a brand new component written in the PHP programming language. In the future, the intention is that most pages on bbc.co.uk will be produced dynamically using the PHP layer, and the Topic Pages system has blazed a trail, being the first released on this new platform.
The PAL itself does a fairly simple job, in principle. First, it receives a request for a Topic Page; it then looks up which modules (ie, the different blocks of content on the page, such as BBC News, Programmes and Weather) it needs to build that page; it grabs all those different modules, which originate on various different systems and finally it assembles them before returning the page to the user.
The really important part here is that the PAL is grabbing all the useful content dynamically, and not storing any content itself (apart from a bit of caching, to help smooth out any spikes in load). This means that the PAL is a really generic system that can be used for building other sorts of pages as well.
The PAL actually requests the various content modules from another component, which itself routes these requests on to the underlying systems. This Module Routing system is implemented in Apache Cocoon, an open source framework, released by the Apache Software Foundation. This way, the PAL can access content through a simple, uniform interface (which is based on REST principles, for those who are into such things), rather than having to deal directly with many complex interfaces to multiple systems.
This gives us two big benefits. First, if we want to change how a particular module is implemented, we can just reconfigure the Module Routing, and don't have to alter the PAL (which we don't want to tinker with too much, as it is busy serving pages); second, it makes it easier for any other system, or other page on bbc.co.uk, to reuse the modules. This model also makes it easier to add new content modules to the PAL, as most of the logic about how to access a new module can be handled in the Module Routing layer.
As Matt said, "Topics are automatically updated web pages, each one covering a different person, country or subject." We wanted to present content from across the various departments within the BBC. The Search Engines are the one place that all that content from bbc.co.uk (well, most of that content) is brought together (nearly 2 million documents from www.bbc.co.uk and the Newsand Sport sites). So it makes a lot of sense to create many of the modules from the Search indexes (the News module, and the Programmes module, for example, are both created like this). This is done using elaborate search queries, which are a bit fancier than the one- or two-keyword queries which the Search Engines normally receive - the query for the Fashion topic, for example, contains 364 terms! These queries are built using a variety of techniques by the Search Editorial Team.
We also use content from other sources than the Search indexes - for example, the Weather modules, and the information boxes about countries, which come from the BBC News website. In order to make it as easy as possible to share and reuse these content modules, we are attempting to create some standards for the formats in which the data is passed around (rather than just allowing every system to specify its own format).
Having looked around a bit, we decided to base our data structure on the Atom format, which was originally created to describe feeds from Blogs. Atom is a rich XML format, and one of its most attractive features is that you can add your own data elements. This means we can have a standard container for our data, but also include extra elements where appropriate (such as in the weather modules, to denote temperatures, for example).
This approach is similar to Google's GDATA, which is also based on ATOM. We have a prototype of this XML format - called BBC Module XML - in use in the Topics pages, and in the next few months we will be looking at refining and improving it, and hopefully making it useful for other systems across bbc.co.uk. We use XSLT to transform from the XML to the nice XHTML modules which are placed on the page.
The only parts that I haven't touched upon so far are the admin systems for creating and maintaining the pages and also the modules that go on them. At the moment, these admin systems aren't as joined up as they should be (and are implemented in different programming languages in some cases). This is a headache for the editorial team that maintain the Topic Pages. Our next priority is figuring out how we can bring those systems closer together, and generally improve the workflow for the editorial staff, so that they can easily add more Topic Pages.
Once we've made their lives a little easier, we will look at more feature enhancements. These include providing RSS feeds of the Topic Pages so that people can more easily stay up to date with their favourite topics. Additionally, we want to improve our systems for sharing metadata, so that it will be easier to automatically link to relevant Topic Pages from other pages on bbc.co.uk. And we will also add more types of content modules, to increase the range of content on the Topic Pages.
As you hopefully can see, the Topic Page project was pretty complex, and involved creating many new systems. Wherever possible, we have developed those systems to be generic and extensible, to provide not just Topic Pages, but also a platform for sharing and reusing content, and building other products in the future. This has all been possible thanks to some fantastic, far-sighted and occasionally frenetic work from the BBC Search Team and colleagues in FM&T Journalism and FM&T Internet - thanks to all of them.
N.B. More information on /topics can be found here.
Stephen Betts is Search Technical Team Leader, BBC Future Media & Technology