BBC Search refresh: in-depth view
Hi. I'd like to follow up Matt's post about our new Search pages with a little more explanation of what's going on behind the scenes. This is my first post here - and I'll admit a tendency to witter on about stuff - so please bear with me!
The Search & Navigation team has been responsible for the systems that drive iPlayer's search results page since iPlayer launched, but for the new TV and Radio search feature we wanted to include data about all programmes, not just those that are currently on iPlayer.
Full details of all programmes have been available on the Programmes site for a while, but they've never been searchable before. To enable this new feature, we're supplementing the existing iPlayer data feeds with a new feed from the Programme Information database, PIPs.
PIPs is only available on the BBC's internal network, so we have to replicate all the information we need for Search to our servers in the public-facing "content network".
We poll the "changes" list on PIPs 24 hours a day using a chunky piece of XSLT 2.0 code and push the data into our "Media" Autonomy server farm.
These data are in a hierarchical structure based on entities including masterbrands, services, brands, series, episodes, versions, broadcasts and on-demands. The episode data we're interested in are spread across all of them, so identifying the official "first broadcast" date and titles for an episode is harder than it first appears! (After ten years in the corporate IT world it's been rather refreshing to work with interesting data for the year I've been here!)
At query time, the search engine looks for both iPlayer and PIPs records in its database, giving precedence to the iPlayer record when it finds a pair corresponding to the same episode, and boosting results for more recent programmes.
This means that you should generally see the iPlayer items at the top of the list, followed by "coming up" programmes, followed by older episodes. We had to be careful with the boosting - we don't want to favour iPlayer and/or recent programmes too much, because they might not actually be relevant to your query, despite being more recent.
We're planning to extend this system in the next few weeks to add in other searchable data such as actor and presenter names and also record details of "top level editorial objects" - brands and standalone series - so they can be featured more prominently on the results page.
This release is the culmination of many other individual projects that have been ongoing for the last several months. Here's a brief overview:
Most obviously, we've given the user interface a spring clean, bringing it into line with the new "Visual Language" used across other BBC websites.
We've upgraded all the search servers to the latest version of Autonomy IDOL on brand new servers, migrating a lot of separate indexes spread across various server pools into three pools: Core, Media and Collections. I've lost track of the number of servers used to host the various parts of the Search system - it's easily more than 40 - so this upgrade gives us an welcome opportunity to decommission some venerable old servers that have been giving our Operations team nightmares!
We've improved the text encoding support across the site and in the indexing systems - you should no longer see corrupted UTF-8 characters anywhere.
We've implemented date biasing for results in the main results page - that means that recent content should appear higher up in the results.
We've noticed that people often search for BBC programmes using slightly different spellings to our preferred style, for example "Dr Who" vs the preferred form "Doctor Who". In these cases we search for the preferred style but give the option to search again using the original spelling. This system has previously only been in place on iPlayer search, but we think it's a useful addition across the other scopes.
We have a new Click Through Tracking system. You'll see evidence of that in the links on the first results page presented for each query. It's really useful for us to know which links are being used, so that we can understand how people use the site and improve it over time. Don't worry: we aren't storing any personally-identifiable data.
One last little thing I'd wanted to do for a long time was to change the search results URL from the dated and purely historical /cgi-bin/search/results.pl to simply /search - this proved to be one of those tasks that initially looks like an easy win, but because the old address is used internally for some specially-handled searches including CBBC Find, it turned out to be gnarlier than we expected!
That's all for now. Please do let us know what you think of the search system - we're here to make it better!
Andy Webb is a Senior Software Engineer in Search & Navigation, BBC Future Media & Technology