Main content

Genome: behind the scenes

Andy Armstrong

Internet Engineer

Tagged with:

In October 2011 Helen Papadopoulos wrote about the Genome project - a mammoth effort to digitise an issue of the Radio Times from every week between 1923 and 2009 and make searchable programme listings available online.

Helen expected there to be between 3 and 3.5 million programme entries. Since then the number has grown to 4,423,653 programmes from 4,469 issues. You can now browse and search all of them at http://genome.ch.bbc.co.uk/

Back in 2011 the process of digitising the scanned magazines was well advanced and our thoughts were turning to how to present the archive online. It's taken three years and a few prototypes to get us to our first public release.

The first edition of the Radio Times

Jake Berger and Hilary Bishop have written on the About the BBC blog about the Genesis of Genome but I know some of you share our fascination with the technicalities of projects such as this, so I'm going to give you some of the gritty behind-the-scenes details.

The web site is hosted on Linux servers. We have database servers running MySQL and Sphinx Search and application servers which run Apache and a web application written in Perl using Dancer and Template Toolkit. In front of the application servers a layer of Varnish application cache servers helps to handle the load.

Along the way we've used a lot of other Open Source software to index, catalogue and transform data, scale images and automate development processes. As is often the case with projects underaken at the BBC we couldn't have done it without the work of countless Open Source developers and, as ever, we are extremely grateful to them for making our work possible.

The web application was designed from the outset to allow speedy browsing and searching of over four million records so we always had an eye on performance when making technical choices. Even so the sheer size of the data was more of a problem than we expected. Until we switched our development servers to using solid state disks (SSD) it took over a week just to load the database into MySQL. Even after the switch to SSDs a complete database load would take more than twenty-four hours. We quickly learnt that "we'll just load up another copy of the database for testing" was not an option.

Once the data is loaded it's mostly unchanging, which helps a little with caching. However the data set is so large that we can't realistically cache all of it. You may notice that some pages of listings take longer to load than others. If that's the case (and assuming it's not too frustrating for you) you can congratulate yourself for finding a bit of the archive that nobody else has looked at recently - if they had it would be 'hot' in the cache and served to you quickly.

One of the most common ways to use the archive will be the search box; we expect a lot of searching and, naturally, the search queries are unpredictable (although we guess that lot of people are going to search for "Doctor Who"). The unpredictability of the search terms means that we can't cache search results at all. We've done quite a bit of performance testing, but one of the reasons for the "Beta" badge that the site is currently wearing is that we couldn't really tell how well it would perform until it went live.

Radio Times covers through the years

As I said the data is mostly unchanging. However we know the data contains errors and we're asking for your help to find and fix them. It's possible to submit changes to any programme listing and add any additional comments you may have in the "Tell Us More" box.

When an edit is submitted it shows up in our admin interface within 10 seconds. From there, if approved, it is applied to the live site. In an extreme case it could be only 30 seconds between submitting an edit and having it accepted and visible. Normally though, the editors who process the incoming changes will batch them up, maybe working through the queue a couple of times a day (they have other things to do too).

There's a synchronisation protocol that is responsible for collecting incoming user edits from the front-end web servers and relaying them back to the admin system. When changes have been approved the admin system sends them back to the front-end servers, each of which updates its own copy of the data. To preserve the integrity of the data each change is logged in the database and can, if necessary, be undone.

One of the things we hope to be able to do in future releases is to make it possible to edit more than just the programme text. We know, for example, that some programmes are currently showing up on the wrong days due to OCR errors on the dates.

We also have work to do to make the site even more accessible to screen readers and other assistive technologies. And smartphone users may notice certain, er, challenges with the current version of the site (it's better in landscape than in portrait). Sorry. It's on the list. We're working on it.

It may seem odd, but until this data was digitised, the BBC had no comprehensive record of day-to-day broadcast history in a searchable machine readable form. One of the most exciting things about the data is the possibility of linking it to other archives that the BBC and other organisations have. It would, for example, be fascinating to see the covers of contemporaneous newspapers alongside the programme listings, or trade links with related Wikipedia articles.

In addition to helping us by editing erroneous programme pages and, of course, testing the site in its Beta form, we would also love to hear from anyone with interesting ideas for matching and combining this data with other databases and resources. Some of us have been staring at this data for years; you're bound to be able to think of something exciting to do with it that we've lost sight of.

And if you find bugs please accept our apologies and let us know about them too.

I must mention some of the people without whom this web version of the BBC's broadcast history would not have been possible: Nick Clement and Richard Sullivan created the visual design, Sam Urquhart built the web front end, Mo McRoberts built the first prototype of the web site and all of the infrastructure that the site now runs on. Ian Pouncey gave us invaluable insight and practical advice on accessibility. Any remaining shortcomings are there because we haven't yet had time to implement all his recommendations.

We all hope you enjoy using Genome to explore the BBC's history.

Andy Armstrong is Internet Engineer, Archive Development

Tagged with:

More Posts

Previous

Lean-UX and the big picture: a case study

Next