« Previous | Main | Next »

Building applications on large datasets

Post categories:

Chris Lowis Chris Lowis | 14:58 UK time, Monday, 21 March 2011

Recently there's been a lot of discussion in the technology scene about open-data and ways of working with large datasets. A recent high profile story was the UK government's data.gov.uk project, which aims to make reams of public data accessible on-line in electronic form. The opening up of this data suggests many possible use-cases.

In this article we look at the emerging "Data Marketplace" providers, and build a small example application for querying BBC Programmes data on top of Kasabi's new platform

The data.gov.uk data is only a small part of a world of open-data, some of it free-form and difficult to search and index, some of it highly structured and semantic.

Developers who wish to build applications which analyse and present data from across a multitude of different datasets face considerable challenges. First the data must be collected, then normalised or cleansed into a usable form and finally stored in a datastore of some kind so that it can be queried and "mashed-up". The interest and promise of open-data really lies in the latter stage of this process, but so much work has to be invested in the former that many projects die an early death.

Collecting and cleansing datasets and providing scalable datastores is therefore a tempting market for a raft of upcoming start-ups and more traditional "big-data" specialists. A recent blog post from the start-up DataMarket gives an excellent overview of this emerging field and the companies that are hoping to find success within it.

Working with open datasets

Chris Needham and I were looking for a way to query BBC Programmes data to extract information related to a particular point in time. We were provided with early access to the Kasabi platform, currently in closed beta.

Kasabi is a start-up venture incubated within Talis, a company with a history of providing data services to libraries and other public institutions. Kasabi, the platform is currently in closed beta, but we were allowed to have a look inside with the proviso that we would feed back ideas and suggestions and attempt to build a small prototype on top of the platform.

The Kasabi platform consists of, behind the scenes, the Talis datastore. This is a scalable graph database which can ingest data in RDF format and provides a SPARQL interface for querying and extracting subsets of the database. On top of this there is a Web front end written in Drupal which allows you to explore datasets (and their licences) and see what other people are building on them. The most interesting part of the platform though is a middle-layer with allows developers to create APIs on top of datasets by, for example, creating a stored SPARQL query but presenting the results in JSON through a simple API call. We'll come onto an example of this later.

Kasabi has already ingested many datasets into their platform including DBPedia People, Foodista, a large set of Open Government data and significantly for us, the BBC's /programmes, /music and /nature data.

What Was On? A simple BBC Schedule query API

Chris and I set about creating a small application using the Kasabi platform. We wanted to know: "given a point in time tell me what was on all of the BBC's TV and Radio services at that point". This is currently difficult to do with the /programmes API as it would involve fetching the schedules for each service individually, combining and then narrowing them down to a particular time.

We came up with a simple API which allows you to submit a date-time string and get back an array of JSON objects containing the broadcasts corresponding to that point in time. For example http://whatwason.heroku.com/2011-03-10T08:00:00Z returns broadcasts for 8am on 10th March 2011. You can add a ?limit=N query parameter to increase the number of results.

This API was created using a small Sinatra app on Heroku which makes a call to a Kasabi API we created and appends our private application key.

On the Kasabi side we created a stored SPARQL query, a detailed explanation of which is probably beyond the scope of this post. For those that understand SPARQL here's the query in full, for those that don't O'Reilly has a good book on the subject, and the specification itself is quite readable.

PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
PREFIX po: <http://purl.org/ontology/po/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?title ?short_synopsis ?service ?broadcast ?broadcast_start ?broadcast_end
WHERE
{ ?episode po:version ?version .
  ?episode po:short_synopsis ?short_synopsis .
  ?episode dc:title ?title .
  ?broadcast po:broadcast_of ?version .
  ?broadcast event:time ?event_time .
  ?broadcast po:broadcast_on ?service .
  ?broadcast po:schedule_date ?date .
  ?event_time tl:start ?broadcast_start .
  ?event_time tl:end ?broadcast_end .
  FILTER ( (?broadcast_start < ?time) &&
           (?broadcast_end   > ?time) )
}

In SPARQL variables begin with a ?. Kasabi uses this to allow parameters to be provided through query strings. The actual API call our Heroku app makes looks more like this: http://api.kasabi.com/api/?date=2011-03-10&time=2011-03-10T08:00:00Z

We had to provide separate date and time parameters to make the query execute in an acceptable time. Using the FILTER operation in SPARQL can be quite slow so Kasabi's Leigh helped us to improve performance by first scoping the data to a specific date by specifying the po:schedule_date relationship.

The API we've produced is an interesting proof-of-concept that has applications to our work around RadioTAG. It also demonstrates the power of having all of our data in a form that is easy to search and query. Feel free to try some sample queries of your own, but as it stands the API comes with no guarantees of accuracy or availability!

One possible extension to this API would be to support querying the schedule data of other broadcasters. However, at the moment Kasabi do not support querying across multiple datasets, and as far as I am aware schedule data from other broadcasters is not available in a form that would be easy to ingest into a datastore. I can also imagine combining time-based searches of BBC data with other historical datasets, to find programmes that relate to particular events in the news, for example.

We haven't yet experimented with some of the other ways of working with the data, but many are provided including faceted search, augmentation of RSS feeds with data from the dataset, and an implementation of the Linked Data API. There's also a blog post describing other applications that have been built on the platform.

Conclusion

At the BBC we produce a lot of data and make some of it available via our APIs. We were also the first public sector organisation to publish linked data and are now starting to use semantic data platforms to power some of our websites. But the potential for aggregated and searchable data that spans multiple domains is much larger.

Last year, Georgi Kobaliarov issued an interesting challenge "If we had a web of data, what would you build?". The question generated a lot of exciting ideas on Read Write Web and in the Open Data community. The enabling technology has been around for some time, but with the emergence of hosted service providers giving developers the ability to query across multiple datasets we are starting to see the potential for realising some of these ideas.

Comments

Be the first to comment

More from this blog...

BBC iD

Sign in

BBC navigation

BBC © 2014 The BBC is not responsible for the content of external sites. Read more.

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.