When is a dataset not a dataset? The hackday project that crowdsourced data.gov.uk
When is a dataset not a dataset? How many of the now 3241 datasets listed as part of data.gov.uk are easy to open up and play with? How many are tables for computers to analyse, instead of PDF reports for people to read?
The Hacks and Hackers Hackday filled a Channel 4 office with journalists and developers on the final Friday in January. Our aim was to tell new stories with open data. Attendees already had form - the BBC's Open Secrets blogger Martin Rosenbaum, and data journalism teams from the Times, the Guardian, and the
Tom Morris was part of a team that looked into the quality of data.gov.uk. Although data.gov.uk advertises itself as a database of open datasets, many of the entries are actually PDF files. He built a prototype format checker that invites people to go through datasets and record the file format. You can listen to him explaining the checker to me and to the hackday, or reuse the interview under the BBC Backstage License.
On Wednesday February 3rd, he put a completed quality checker online. On that Thursday, the crowd had gone through data.gov.uk and marked up all of the datasets.
Tom posted his initial breakdown to the data.gov.uk community on March 20th:
Sadly, this is over-optimistic. I've manually checked some of the data that has been categorised as JSON and RDF. Most of it is not actually correctly categorised - either people clicked, say, 'RDF' when they meant to click 'PDF', or they have seen an RSS or Atom feed and categorised it as RDF. What this admittedly imperfect dataset is basically saying is that the vast majority of the 'data' on data.gov.uk is not actually machine-readable data but human-readable documents.
HTML - 252 XML - 5 Word - 4 RTF - 1 OpenOffice - 1 Something odd - 85 JSON - 9 Nothing there! - 190 CSV - 12 Multiple formats - 1211 PDF - 468 RDF - 10 Excel - 408 TOTAL - 2656
He will be at the Open Knowledge Conference this weekend, where he will speak about Citizendium and might do the analysis, which he told me was the most important part. When done, it will be very interesting indeed to read it.