Vote 2014: data architecture and semantic tagging
Hello, I'm Paul Rissen, a Data Architect working with BBC News.
Recently, I've been involved in looking after the data architecture for the politics section of the BBC News website, and in particular, the Vote 2014 proposition. In this blog post, I'll take you through the work I've been doing, as well as introducing you to the world of Semantic Tagging and Linked Data.
What is Vote 2014?
Vote 2014 covers the main elections that are happening across the UK in May - local council elections, as well as voting for the European Parliament. But our aim has been to put the foundations in place for a system that we can use to cover any election in the future - especially with the Scottish Referendum and General Election coming up in the next twelve months. This means that when developing the data architecture, we have to strike a balance between delivering something bespoke for the May elections, and something more flexible for use in the long term.
Getting the Information Architecture right
My role on the project was two-fold. Firstly, I provided support on the information architecture of the site. This means that I helped the team define what the core concepts of interest to our users might be, and how they are stitched together. Fairly simple, at first - we have an election, a council, a candidate, a party and so on. But with each new concept you add, your site, and its' development, gets slightly more complex. The advantage of working on Vote 2014 is that we had a very fixed deadline - the elections weren't going to be postponed if we didn't have our website ready in time.
So, rather than trying to model everything about an election, I tried to keep it down to the basics, whilst also not painting us into a corner when it comes to developing the site for future elections. Once this basic model was in place, I could deal with the second, main part of my role - supporting our journalists and our users through the use of semantic tagging.
What we can do with traditional tagging
What is Semantic Tagging?
Tagging has been around on the Web for a good few years now. People come into contact with it every day with hashtags on Twitter, for instance. The concept is simple - everyone uses the same word or phrase to mark their Tweet as being about, or relevant to, a particular subject. The subject might be in the phrase, though it doesn't have to be, and the benefit is that anyone who is interested in that subject can click on the tag, and get everything, no matter where it's come from.
This kind of tagging only gets you so far - it can't really support anything more than being able to find everything tagged with that subject. Semantic tagging enables us to do two more things. Firstly, we define what type of thing the tag is - so we can say that not only is a phrase like 'David Cameron' a tag, but that 'David Cameron' is a Person. In the case of Vote 2014, the most important types for us were Councils, Constituencies and Elections.
The benefit to giving the tag a type is that we can build a richer, more useful and engaging experience for users based on the type of tag. For instance, if we know that the tag 'Birmingham City Council' is a Council type of tag, then when we give the user everything tagged with that subject, we can also provide more contextual information - because it's a council, we know it has an election history, a set of council members, and so on. If we know the tag is a Person, we can present biographical information, if a Place, a map, and so on.
The second thing that semantic tagging allows us to do, is to express a type of relationship between the article or clip the journalist has made, and the tag. In traditional tagging, there is no explicit type of relationship between the journalist's work, and the tag. All we know is that hopefully the article is something to do with the tag. But with semantic tagging, we can be clear what the relationship is. We can say that this article is about Birmingham City Council, and mentions David Cameron, for instance.
What we can do with Linked Data
Again, having this information allows us to be more useful in the user experience. If you wanted to know more about Birmingham City Council, chances are you'd be more interested in articles and clips which are about the Council, versus ones that just mention them - so we can prioritise the former, and suggest the latter once you're finished.
My main role in BBC News Online, therefore, is to work with our teams to uncover which types of tags, which relationships, and then which instances (Birmingham City Council is an instance of a Council) we might need. Once I've done so, I construct what's known as an Ontology - a specification of the types and relationships that we're going to use - essentially like a dictionary, or a cook book, of allowed concepts. They don't all have to be used, and we can use as many or as few Ontologies as we need, but the types and relationships do need to be specified somewhere.
A Minimum Viable Ontology
As mentioned above, when working out a conceptual model for a website, it's very tempting to try and model it as completely as possible - in the case of Vote 2014, this might include not only Councils, Constituencies and Elections, but Candidates, Parties, Manifestos and so on. Each one of these types of things, and each individual instance, could have their own URL on the website, meaning they could be linked to and shared by our users. Whilst there may well be utility in doing this, it can be distracting when we have a fixed deadline, and, more importantly, it can lead to increased complexity early on in the project, when all you really need are the basics. On the other hand, we didn't want to come up with a model that would only work for these particular elections. So the model, and thus the ontology, that we created, had to be as streamlined, yet flexible, as possible.
The phrase 'Minimum Viable Product' has become fashionable recently in the world of software development, to describe an ethos whereby a development team concentrates on the smallest possible working, and useful, version of a project, releasing that to users with the appropriate provisos in place. By studying how real users interact with this minimum product, the development team can then learn, develop and release a new version sooner, repeating the process over and over again, rather than trying to capture all of the possible requirements for a perfect product up front, and not releasing anything for ages. In my world, we have a similar concept of Minimum Viable Ontology.
Admittedly, this was a new concept for me when I started on the project - but it was very useful. It keeps us honest about what features, and thus what types and relationships, are really necessary to provide users with the most crucial experience. And so although the ontology I produced has the name 'A Politics Ontology', it appears very small at first - and does not, perhaps, cover a lot of the obvious things that could have been modelled. However, it has what we need, and can be easily extended over time.
A Politics Ontology Explained
We knew we wanted to have a page where users could find the latest on the local elections in England, in Northern Ireland, the EU Parliament Elections, and a special UK-focused EU Election page too. We also wanted a page for every Council that was having an election in May, as well as the EU Parliamentary Constituencies within the UK, if not the rest of Europe. Most importantly, we wanted our journalists to tag their articles with the relevant Councils and Elections. Thus, the Ontology provides classes, i.e. types of tag, for Councils, Elections and what we're internally calling Statistical Geographies. Why that more complex name? Why not Constituency? Here, we decided to opt for flexibility over specificity. This was also the name that the Office of National Statistics (ONS) gives to this type of thing, which is an important point we'll return to soon.
We also included a relationship, also known as a property or a predicate, which allows us to link an area of land to the political organisation which has some kind of responsibility for it - this is the 'governsGSS' property that you can find in the ontology specification. Again, this is the same name as used by other organisations such as the ONS - a GSS code is a unique code for an area of land.
Using common identifiers
Although the BBC is interested in Councils and Constituencies, on behalf of its' users, it is not necessarily the authority on those things. That responsibility belongs to organisations such as the Government and the Office for National Statistics. Thus, rather than reinventing the wheel when constructing our Ontology, we looked for terms that we could borrow, as long as their meaning was the same. Similarly, although the BBC will provide URIs, i.e. Web identifiers, for Councils and Constituencies, it is not the official place to go, or to point to, for those Councils and so on. Thus, we can use what's known as a sameAs relationship, to link the terms and instances in the BBC's world, to the more authoritative URIs for these things.
This may sound like cheating, but it is an important principle in Web design. Indeed, the whole Web is held together by hyperlinks - it would be remiss of us not to link to other sources that provide information on the same thing, and, if we use common identifiers, then users will benefit, as they can use a whole range of different websites, whilst only needing to use a single identifier to explain to a computer what they're looking for.
For the purposes of Vote 2014, we use sameAs links to connect our Council pages to the URIs and information provided by Open Data Communities, and we also connect Councils to the official ONS URI for the area of land which they govern. Similarly, for the EU Constituencies, as these are areas of land, we can use the sameAs relationship to point directly to the relevant ONS identifiers. For the Elections, where possible, I linked to the appropriate DBPedia or Wikidata URIs - these being structured data representations of Wikipedia, which are widely used across the Web as common identifiers.
Putting it all together
Making sure we link to the right sources of information, and the right particular identifier for each Council and Constituency, was a manual task for me - involving a big spreadsheet which mapped them all together. Once I had done this, I had an even bigger task on my hands. For the project to be a success, I had to store the information about each Council, Constituency and Election in the BBC's Linked Data Platform. This meant translating from my spreadsheet to a format called RDF.
RDF stands for Resource Description Framework, and is a way of expressing information in triple statements - noun, verb, noun, or subject, predicate, object, to use the technical terms. "The cat sat on the mat" is a triple statement. The cat, and the mat, are concepts in this context, and 'sat on' is the verb, or predicate. Each of these would be identified by a URI - the cat and the mat would be instances of tags, and the verb 'sat on', would be given a URI in an ontology.
So, for Vote 2014, I needed to produce a number of triple statements for each of the Councils, Constituencies and Elections - about two hundred tags in total, each with several triple statements. Rather than doing this manually, i decided to pick up my coding skills, lying dormant since University, and taught myself enough Python, a development language, to create a script that would run through my spreadsheet, and output the correct triple statements.
After a few attempts, and some help from others, it worked, and I could load the tags, and all their associated information, into the Linked Data Platform. From that moment on, the tags were available for journalists to use, and we could build the rest of the Vote 2014 proposition, pulling in any pieces of content, across BBC News on TV, Radio and Online, regardless of regional team, that had been tagged with a relevant tag.
An oft-forgotten part of Web design is URL design. As the founder of the Web, Sir Tim Berners-Lee, himself wrote, Cool URIs don't change. Once a URL is out in the wild, although it may well have to change, the valid reasons for doing so are few and far between. We can never completely guarantee that a URL will be permanent, but it's good practice to try and design URLs so that they are less likely to need to change in the first place, avoiding future technical headaches, as well as the worst user experience of all - a 404. The BBC /programmes platform is a good example of a part of the BBC Online that has been designed with long-lasting URLs in mind.
Earlier, I mentioned the challenge of building something that worked for the elections in May, whilst providing a foundation for future development. An important distinction to make here is the difference between things which are only relevant for the May elections, versus things that will be relevant for much longer. The Council pages are a good example of these, as the Council will live on after the elections, and for, hopefully, many elections in the future. How do we avoid having to rebuild the Council page for every election in the future?
We can tackle this, in part, by designing our URL structure so that the Council pages, for instance, don't live underneath the Vote 2014 proposition. To all intents and purposes, in the visual and interaction design, they can appear as part of Vote 2014 for the duration of May, but the URL pattern we use for them is bbc.co.uk/news/politics/councils/:id. Arguably, we could remove /politics/ (and indeed /news!) from the URL, as the more things we put into the URL, the more fragile it becomes, but again, practicality and pragmatism has to come into play - we can be sure that councils will almost always be political concerns, and the BBC is most likely to be featuring Councils and the like in the context of News, so that should be OK.
Finally, we want to ensure that the pages we've made are seen and found by our users. This means that when people use search engines, we want our pages to appear high enough in the results, and we want to make sure that the result is clear enough for people to trust and understand what they are going to get if and when they click on the link.
Human-readable URLs help, though permanence is the most important thing. One other way of helping search engines understand what information is on a page, is to use RDFa - i.e. to embed some of the triple statements mentioned earlier in the HTML of the page itself. Last year, for the local elections which fell under the 'Vote 2013' banner, we prepared RDFa and this had a big impact on the search rankings of our pages. Therefore, the final part of my role on the project, aside from continuing to support the team throughout the May period, and beyond as we extend the Politics proposition, was to define a template for the development team, into which they could pour the relevant details for each Council, Constituency and Election, and get RDFa out.
Beyond Vote 2014
As mentioned, the Politics Ontology is, for now, not much developed beyond its' minimum viable state. Over the coming months, with the Scottish Referendum and General Election on the horizon, we'll be looking to expand it, and improve the tags we have in order to support the rest of the BBC News Online Politics proposition, bringing together all our journalism, no matter whether broadcast or on-demand, and eventually, relevant content from the rest of the BBC, in a way which is more meaningful, more useful, and more engaging, for all our audiences.
Thanks for reading.
Paul Rissen is a Data Architect, News Strategic Output, BBC News
If you've enjoyed Paul's post you may also be interested in this post from Sophia Angeloutou: "Linked data: new ontologies website".