The Simple Joys of Web-Scale Identifiers
<aside>Second post of the day is quite a record for me but this one isn't about microformats so you can probably look away now...<aside>
This post is partly a response to Tom's post about URLs and partly the result of conversations with Matthew Wood, Chris Sizemore and John O'Donovan on our recent jaunt to Linked Data Planet. Now I think that most of our department would agree with Tom. After all we've been having these conversations for a few years now and when it comes to URL design we're standing on the shoulders of giants.
When you're building anything it's always good to admit that cleverer people than you or I (or even Tom) came before. In the case of the web those people gave us HTTP and HTTP is stateless. It's the whole beauty of the web: everyone, everywhere gets the same thing from the same place. The moment you pick a fight with this design you're probably gonna get beat.
So, like I say I agree with the four Linked Data rules but I'd like to try to add a fifth: if possible don't reinvent other people's web identifiers. By web identifiers I mean those fragments of URLs that uniquely identify a resource within a domain. So in the case of the MusicBrainz entry for The Fall (http://musicbrainz.org/artist/d5da1841-9bc8-4813-9f89-11098090148e.html) that'll be d5da1841-9bc8-4813-9f89-11098090148e.
The last time we updated the /music site we made this mistake (kind of unavoidable at the time). Even though we linked our data to MusicBrainz we minted new identifiers for artists. So The Fall became http://www.bbc.co.uk/music/artist/jb9x/ where jb9x was the identifier. But jb9x doesn't exist anywhere outside of /music. We'll (hopefully) never make that mistake again.
When we first partnered with MusicBrainz the big attraction was 2 fold:
- stable web-scale identifiers
- liberal data licensing - no separate deals to reuse data in APIs etc
So when the next version of /music goes live you'll see: http://www.bbc.co.uk/music/artists/d5da1841-9bc8-4813-9f89-11098090148e and the world will hopefully be a slightly better place.
Now I can already hear my old mentor saying:
Michael noooo! URIs are just identifiers for resources. They shouldn't reflect the taxonomy of the site. The resource should define it's relationships to other resources not the URI. Call them anything you like but just keep them stable.
With which I also mostly agree but - if bbc.co.uk/programmes tagged content with the same vocabulary as bbc.co.uk/news we'd be able to cross promote news stories from programmes and programmes from news stories by sharing APIs not databases. Tie this into personalisation and the power goes logarithmic. Read six articles on reconstruction in Iraq? Then you might like this Panaroma programme.
But if the vocabulary used to tag programmes and news was web-scale then The Times, The New York Times, Fox News etc (or someone in between) could start to aggregate stories around a shared sense of topic. This is what Chris' recent post on using wikipedia / dbpedia as a controlled vocabulary begins to hint at. It's like Yahoo! Term Extraction or Open Calais except the terms returned are web native or web-scale identifiers if you will.
So what's the practical benefit: well because the new /music URLs will be based on MusicBrainz identifiers and because /music will be interlinked with /programmes and because the Last.fm API speaks in MusicBrainz identifiers Patrick can spend a weekend at Mashed making something that takes your Last.fm user name, extracts your favourite artists, ties them to /music and recommends BBC programmes. Which is a pretty good hack.
Taking another example for those who wish to stalk Tom Scott. His blog is at derivadow.com which is also his OpenID, you'll find his delicious account at del.icio.us/derivadow, his tweets at twitter.com/derivadow and if you want to hire him he's at www.linkedin.com/in/derivadow on LinkedIn. So derivadow is a web-scale identifier for Tom. It's not as strong or as powerful as a set of RDF linked URIs but if you wanna aggregate Tom-ness it's a pretty good starting point. Sadly I can't find him anywhere on Last.fm but that's possibly a godsend.
The obvious question is if web-scale identifiers are so good why did the BBC mint it's own for programmes? After all the the b00c4wxm used in /programmes and iPlayer is a BBC invention. And the answer is there were no suitable identifiers out there. I'd like to think that if Program(me)Brainz existed with stable identifiers we'd have put in the work to use those instead. But it didn't so we couldn't... But now we have stable identifiers out there on the web free to use for anyone. It would be good for example to see these identifiers adopted by Speechification. Time will tell.
One argument against all this is that web-scale identifiers are often kinda ugly. After all if Last.fm gets away with www.last.fm/music/The+Fall why do we need d5da1841-9bc8-4813-9f89-11098090148e. The answer is ambiguity. MusicBrainz has 16 Auroras. Which one(s) does the BBC play? Probably none actually but you get the point. If we want to be exact in what we point to we need to handle ambiguity. In general we follow 3 commandments:
- URLs should be human readable
- URLs should be hackable
- URLs should persistently point to one concept
And the greatest of these is persistence. If you can't maintain stable URLs per concept don't even bother with 1 and 2. There are others that argue that URLs are part of the interface. If resolving ambiguity is not important to your business then I'd agree but if you need to differentiate stuff with the same label you need unique identifiers - better yet web-scale identifiers.
Now I guess the Linked Data people would say do this properly in RDF with owl:sameAs etc and we will do. But for hackers without PhDs the possibility of instant interoperability and quick mesh-ups is irresistible. Obviously you'll still need to establish equivalency between this and this but luckily that's where the Linking Open Data people have done some of our work for us. And they're damn nice people to boot.
So I guess what I'm saying echoes Tom. Cleverer people than us have come up with ways to attach web-scale identifiers to content so why waste time reinventing. Whilst the BBC or *insert your organisation here* should own their data (whilst hopefully making it free - as in beer; as in speech) we don't have to own our identifiers. If we choose to use the power of web-scale identifiers we free our content to fly and leave it to other people to add value / make money in the middle. It's not exactly profound but it does feel like a small breakthrough to an aging BBC employee.