Research & Development

Posted by Michael Smethurst on , last updated

The BBC Internet blog recently ran a series of posts on search engine optimisation at the BBC:

  1. Duncan blogged about insights to be found in search logs, the benefits of well written meta descriptions and the importance of honesty and trust
  2. Oli blogged about the /food site rebuild and the importance of a redirect strategy
  3. Martin tackled copywriting and particularly headline writing for the BBC News site

Over in R&D our job is to experiment with possible futures so this post covers some work we've been doing to help describe TV and radio in but also some thoughts on how search (and optimisation for search) works now and how that might be about to change. So post number 4 in a series of 3.

How search (and optimisation) works now

Most modern search engines work in the same way. They send out a "bot" to crawl the web, following links from one page to the next (hence the importance of redirects when pages move) and indexing the content they find on their travels (hence the importance of audience appropriate copywriting). I've blogged about search bots in the past, describing them then as "your least able user" and that still holds true today. Build a site that's accessible to all users and you're 99% of the way to having a site that's optimised for search engines.

The results are also always the same. A user asks a question and the search engine returns a list of pages some of which can hopefully answer the question. SEO has two aspects:

  1. Increasing visibility of your content by getting your pages to appear further up search engine results than the pages of your competitors
  2. Making your results more attractive to users in the hope they're more likely to click through

There are various techniques employed to meet the former, predominantly around increasing link density around your content (inbound and outbound); making the titles of those links as descriptive of the content as possible; making pages about all the things you know your users are interested in; not duplicating content across many URLs; not hiding content where search bots can't get to it (CSS hiding, reliance on javascript rather than progressive enhancement, Flash).

And there are a couple of techniques to meet the latter: crafting page titles and meta descriptions (which both appear on search engine result pages) to make the content appealing whilst conforming to the constraints of search engine result display and using Rich Snippets to mark up your pages in a way that allows search engines to extract meaning and display appropriate information. used the latter with great success to annotate recipes with a photo and cooking times. Elsewhere it was widely reported that Best Buy saw a 30% increase in traffic when they added RDFa descriptions to their product pages; Yahoo! reported a 15% increase in click throughs for enriched links; and Yahoo! Research reported that 31% of webpages and 5% of domains contain some embedded metadata.

In all of this the metrics for success are also the same: more users clicking more links to your content, more traffic, more page impressions, more uniques etc. But as Rich Snippets evolves into and knowledge graphs and particularly instant answers these metrics might be set to change.

Freebase, Knowledge Graphs, Google Maps, Google+...

In 2010 Google acquired Metaweb and with it Freebase, a community maintained graph database of information taken from Wikipedia, MusicBrainz and a host of other sources. It was their first step away from search engines as an index of pages and toward search engines as a repository of knowledge. In similar moves Bing, Duck Duck Go and Apple (through Siri) chose to partner with Wolfram Alpha.

The various acquisitions and partnerships were intended to bootstrap the search companies "knowledge engines" (the obvious example being Google's Knowledge Graph) with a baseline of "facts". Rather than seeing any of these search companies as a set of products (maps, social networks, product search) it probably makes more sense to see them as a massively interwingled graph of data. Everything they do is designed to expand and make more links in this graph (even the move into mobile operating systems could be seen as a way to stitch contextual information around location into the graph). In that sense Google+ is less a social network and more an additional source of links between people and people and people and things.


The final piece in the jigsaw was the 2011 announcement of by a consortium of search engine companies (Google, Yahoo, Bing and Yandex), with support from the W3C, providing a way for websites to mark up content in a way that search engines can extract meaning.

Out of the box HTML provides a way to markup document semantics (headings, paragraphs, lists, tables etc.). If you want additional semantics to describe real-world things (people, businesses, events, postal addresses) the choices on offer can seem like a bit of a minefield. It's much simpler if you break it down into two layers: the syntax and the vocabularies. On the syntax layer there are three choices (all of which can be parsed to RDF):

  1. microformats is a syntax but with a built-in set of community defined vocabularies
  2. RDFa provides a way to embed the RDF model into HTML using any combination of RDF vocabularies
  3. Microdata is the standard HTML5 approach

On top of these are a set of vocabularies, some community defined, some more tied to specific implementations. Facebook's Open Graph is a vocabulary built on top of RDFa. Twitter's Cards build on top of Open Graph. And is a set of vocabularies to describe blog posts, comments, movies, books, diets, music and much, much more, built on top of microdata and RDFa. The vocabularies published so far are really straw-men building blocks for what's needed and not meant to be complete. If you have an interest or are a domain expert in any area covered or not covered there's a w3c mailing list to browse and/or join.

By marking up your content with vocabularies, when a search bot crawls your pages it can pick out the entities and their properties and stitch your assertions into the wider "knowledge graph" so search engines can answer more questions, more quickly, from more users.

Using to describe TV and radio

In amongst the vocabularies were definitions for TVSeries, TVSeason and TVEpisode which we were asked to comment on. Our candidate changes can be found here, and the background discussion is here.

Being the BBC the first request was to put radio on a level footing with TV. Over the years we've found (with PIPs and the Programmes Ontology) that, at least in terms of data description, radio and TV have more similarities than differences. So we've requested the addition of Series, Season and Episode with TVSeries, RadioSeries, TVSeason, RadioSeason, TVEpisode and RadioEpisode as subclasses.

We've asked for clearer definitions of start and end dates for series and seasons although that's probably not immediately apparent from the wording. If you read it slowly it does make sense...

We've also asked for better linking between episodes, series and seasons; first publication dates for episodes; the addition of clips; and the addition of broadcast services, broadcasts and ondemands (e.g. the data which determines catch-up availability in iPlayer).

There's still things missing which would be nice to see, mostly around programme segmentation (interviews, scenes, tracklists) but what's there looks like a good starting point to build on.

Finally, yes it does use North-American terminology. A series is something like a po:Brand and a season is something like a po:Series. But from our search logs lots of people in the UK seem to use season for series these days...

Knowledge Graphs, Instant Answers and Siri

So what might the future of search look like? One possibility is something like Apple's Siri. By which I don't mean all search will be done by voice. It could be voice or screen or a pair of location aware, web connected glasses which provide the interface. But the interface is less important than the switch from ask a question, get some links to web pages to ask a question, get an answer. There, then, directly from the search engine. In use Siri doesn't feel much like traditional search, but, bootstrapped by Wolfram Alpha, that's the job it's doing.

And the move from lists of links to answers directly in result pages is already starting to happen and get rolled out for more and more entity types. If you search Google for Mo Farrah the search results have an info box with a picture, date and place of birth, weight and height, education, spouse and children. There's similar results for football clubs and TV programmes. In a similar fashion Duck Duck Go brings in images and information from Wikipedia: Andy Murray, Oldham Athletic, Gardener's World.

All this is powered by an underlying knowledge graph and the knowledge graph will be further powered by the ( semantic assertions you make in your HTML. For at least some class of questions there's going to be less and less need for users to want links through to web pages, when the questions they've asked can be answered directly by the search engine. Probably the best example of a direct question and answer is a Google search for how old is mo farah. 29 apparently.

The missing pieces: provenance and the economy of attribution

There are still some things which are unclear about the / knowledge graph / instant answers combination. From a broadcaster's perspective how will territorially specific questions (when's Gardener's World next on) get answered? And how will provenance be flagged (this Friday at 20:30 on BBC Two say the BBC)?

But the interesting part is how all this might change the metrics not only for SEO but for websites in general. An example: in the main there are only two reasons I ever visit restaurant (or dentist or doctors or cattery) websites: to get the phone number and the opening hours. If you're prepared to accept anecdote as evidence and say I'm not alone in this, and if those sites can express this information semantically and search engines can extract this information and present it directly on search results... then why would many people ever have to visit the website? The end result would be less visitors to the website but no fewer customers for the business.

Again from a broadcaster's perspective we know (from offline audience enquiries and search logs) that there are a number of common questions users want to ask: when does programme X return / when's it broadcast, what was the theme music for programme Y, who played X in Y, can I buy programme X? Again if this information is available from search results (in a consistent fashion across broadcasters) you might get less website visitors but no drop (and possibly a rise) in listening / viewing / purchase figures. Your website is still important but for very different reasons. And in some extreme future the only visitors your website might ever have are search engine bots...

Obviously all this works less well for businesses which rely on page views for advertising revenue. Or even businesses built around original content online. And there's still a gap around provenance (who said that) and trust in that provenance. But it feels like we're moving from a web economy of page visits, uniques and reach toward an economy of attribution. How that economy works is still unclear but the W3C's work around provenance is probably a good starting point.