« Previous | Main | Next »

The Simple Joys of Web-Scale Identifiers

Post categories:

Michael Smethurst Michael Smethurst | 13:26 UK time, Wednesday, 25 June 2008

<aside>Second post of the day is quite a record for me but this one isn't about microformats so you can probably look away now...<aside>

Bob Dylan with his MusicBrainz identifier

This post is partly a response to Tom's post about URLs and partly the result of conversations with Matthew Wood, Chris Sizemore and John O'Donovan on our recent jaunt to Linked Data Planet. Now I think that most of our department would agree with Tom. After all we've been having these conversations for a few years now and when it comes to URL design we're standing on the shoulders of giants.

When you're building anything it's always good to admit that cleverer people than you or I (or even Tom) came before. In the case of the web those people gave us HTTP and HTTP is stateless. It's the whole beauty of the web: everyone, everywhere gets the same thing from the same place. The moment you pick a fight with this design you're probably gonna get beat.

Which is not to say that people haven't picked this fight. Many websites (including the BBC) use cookies to preserve state across requests. So stateful web apps do get built but when you make that choice you need to be aware that all your user activity will remain uncaptured by the web - no browsability, no google goodness, no benefit to your organisation (beyond the obvious) and no caching.

So, like I say I agree with the four Linked Data rules but I'd like to try to add a fifth: if possible don't reinvent other people's web identifiers. By web identifiers I mean those fragments of URLs that uniquely identify a resource within a domain. So in the case of the MusicBrainz entry for The Fall (http://musicbrainz.org/artist/d5da1841-9bc8-4813-9f89-11098090148e.html) that'll be d5da1841-9bc8-4813-9f89-11098090148e.

The last time we updated the /music site we made this mistake (kind of unavoidable at the time). Even though we linked our data to MusicBrainz we minted new identifiers for artists. So The Fall became http://www.bbc.co.uk/music/artist/jb9x/ where jb9x was the identifier. But jb9x doesn't exist anywhere outside of /music. We'll (hopefully) never make that mistake again.

When we first partnered with MusicBrainz the big attraction was 2 fold:

So when the next version of /music goes live you'll see: http://www.bbc.co.uk/music/artists/d5da1841-9bc8-4813-9f89-11098090148e and the world will hopefully be a slightly better place.

Now I can already hear my old mentor saying:

Michael noooo! URIs are just identifiers for resources. They shouldn't reflect the taxonomy of the site. The resource should define it's relationships to other resources not the URI. Call them anything you like but just keep them stable.

With which I also mostly agree but - if bbc.co.uk/programmes tagged content with the same vocabulary as bbc.co.uk/news we'd be able to cross promote news stories from programmes and programmes from news stories by sharing APIs not databases. Tie this into personalisation and the power goes logarithmic. Read six articles on reconstruction in Iraq? Then you might like this Panaroma programme.

But if the vocabulary used to tag programmes and news was web-scale then The Times, The New York Times, Fox News etc (or someone in between) could start to aggregate stories around a shared sense of topic. This is what Chris' recent post on using wikipedia / dbpedia as a controlled vocabulary begins to hint at. It's like Yahoo! Term Extraction or Open Calais except the terms returned are web native or web-scale identifiers if you will.

So what's the practical benefit: well because the new /music URLs will be based on MusicBrainz identifiers and because /music will be interlinked with /programmes and because the Last.fm API speaks in MusicBrainz identifiers Patrick can spend a weekend at Mashed making something that takes your Last.fm user name, extracts your favourite artists, ties them to /music and recommends BBC programmes. Which is a pretty good hack.

Taking another example for those who wish to stalk Tom Scott. His blog is at derivadow.com which is also his OpenID, you'll find his delicious account at del.icio.us/derivadow, his tweets at twitter.com/derivadow and if you want to hire him he's at www.linkedin.com/in/derivadow on LinkedIn. So derivadow is a web-scale identifier for Tom. It's not as strong or as powerful as a set of RDF linked URIs but if you wanna aggregate Tom-ness it's a pretty good starting point. Sadly I can't find him anywhere on Last.fm but that's possibly a godsend.

The obvious question is if web-scale identifiers are so good why did the BBC mint it's own for programmes? After all the the b00c4wxm used in /programmes and iPlayer is a BBC invention. And the answer is there were no suitable identifiers out there. I'd like to think that if Program(me)Brainz existed with stable identifiers we'd have put in the work to use those instead. But it didn't so we couldn't... But now we have stable identifiers out there on the web free to use for anyone. It would be good for example to see these identifiers adopted by Speechification. Time will tell.

One argument against all this is that web-scale identifiers are often kinda ugly. After all if Last.fm gets away with www.last.fm/music/The+Fall why do we need d5da1841-9bc8-4813-9f89-11098090148e. The answer is ambiguity. MusicBrainz has 16 Auroras. Which one(s) does the BBC play? Probably none actually but you get the point. If we want to be exact in what we point to we need to handle ambiguity. In general we follow 3 commandments:

  1. URLs should be human readable
  2. URLs should be hackable
  3. URLs should persistently point to one concept

And the greatest of these is persistence. If you can't maintain stable URLs per concept don't even bother with 1 and 2. There are others that argue that URLs are part of the interface. If resolving ambiguity is not important to your business then I'd agree but if you need to differentiate stuff with the same label you need unique identifiers - better yet web-scale identifiers.

Now I guess the Linked Data people would say do this properly in RDF with owl:sameAs etc and we will do. But for hackers without PhDs the possibility of instant interoperability and quick mesh-ups is irresistible. Obviously you'll still need to establish equivalency between this and this but luckily that's where the Linking Open Data people have done some of our work for us. And they're damn nice people to boot.

So I guess what I'm saying echoes Tom. Cleverer people than us have come up with ways to attach web-scale identifiers to content so why waste time reinventing. Whilst the BBC or *insert your organisation here* should own their data (whilst hopefully making it free - as in beer; as in speech) we don't have to own our identifiers. If we choose to use the power of web-scale identifiers we free our content to fly and leave it to other people to add value / make money in the middle. It's not exactly profound but it does feel like a small breakthrough to an aging BBC employee.

Comments

  • Comment number 1.

    But of course, your programme identifiers (which humorously have recently all started with b00b!) aren't web-scale. Nobody else (e.g. ITV, Channel 4, HBO etc) can use your identifiers without risk of overlap.

    Perhaps the BBC should start a Program(me)Brainz system itself, or sponsor MusicBrainz to do so. You need someone who can be responsible for the identifiers and ensuring that two organisations don't use the same identifiers for different program(me)s...

  • Comment number 2.

    I actually believe that the 3 Asimov rules, I mean semantic standards, are of decreasing importance:

    1. URLs should be human readable
    2. URLs should be hackable
    3. URLs should persistently point to one concept

    If 1 and 3 is not possible, then 1 takes priority.

    Wikipedia does this quite well. If we use a similarly musical example, 'Queen' by itself will bring up a disambiguation list. 'Queen (Band)' will bring up the artist and 'Queen (Chess)' will bring up the playing piece. I know where I'm going to end up. If someone told me article 'sdkjh221' was about the band, there's no way to know until I visit it.

    And let's say there were 2 bands called 'Queen' one would be 'Queen (1980s)' and the other would be 'Queen (2000s)'.

    My lecturer in my database normalisation classes always made sure we understood that you rarely need to assign a random identity to an object. Unfortunately real life doesn't work out that way, as you have found. But experience has taught me that the manipulation of the original unique identifier is better than making one up your own.

    Whilst inter-site compatibility is very important and a very useful tool, without some sort of consensus, there will be anarchy. Eventually the strongest will survive, and I'm willing to bet the ones that will survive are the ones that people can read.

    I hope this makes some sense. It's a very abstract world and my explanation isn't very conclusive, certainly anecdotal. But I hope you can see that there are solutions out there that work, you just need a bit of imagination.

  • Comment number 3.

    hey ed, i think you (and maybe michael originally?) have misinterpreted what the Beeb's programme episode identifiers actually are?

    the unique identifier isn't: b00c53g4...

    it's: http://www.bbc.co.uk/programmes/b00c53g4

    thus it's web-scale and guaranteed unique.

    i think we're getting confused because the MusicBrainz IDs, both internal and external, are GUIDs...

    but URIs are just as unique, and arguably more web-scale, than GUIDs...

    "Nobody else (e.g. ITV, Channel 4, HBO etc) can use your identifiers without risk of overlap." -- yes they can, just keep the BBC namespace at the beginning. the whole URI is the identifier.

    mattcopp, i agree with you that Wikipedia has URLs to die for. luckily and fortuitously for them, their "business" process ensures this: consider the workflow of how wiki pages are created and named...

    but disagree with: "you just need a bit of imagination" -- imagination is at a surplus, believe me... what you actually need is a wiki-style URL naming scheme enforced inside an organisation, like Wikipedia lucked into, and so far the BBC hasn't been able to enforce a wiki-style unique naming scheme to it's programme titles, nor has it been able to curate disambiguation pages (or afford the server load to automate such beasts -- and automated naming conventions lead to abstract IDs anyway... "let's say there were 2 bands called 'Queen' one would be 'Queen (1980s)' and the other would be 'Queen (2000s)'." what if there were 2 Queens in the 1980s? what would we call the 2nd? whatever we called it, a human would need to be involved, or else an automated process would start to name things in less-than-human-readable ways: Queen1, Queen2, Queen3, etc?)...

    nor has MusicBrainz, for that matter, thus the GUIDs...

    "let's say there were 2 bands called 'Queen' one would be 'Queen (1980s)' and the other would be 'Queen (2000s)'." -- fine, but some human being actually needs to name these thing in that way by hand, or else some pretty clever software must do so -- but then that would depend on human-entered metadata in the system somewhere in the chain, no?

    plus many "episodes" simply don't have titles at all, much less unique ones.

    anyway, human-readable is a delight. persistent is a drop-dead requirement. Wikipedia lucked into both, because it has lots of free labor that literally has to give each page a unique name, by hand.

  • Comment number 4.

    @mattcopp + @onpause Wikipedia's URLs are lovely to look at, nice-ish to type and swallowed whole by google but they're not persistent. They move as concepts are disambiguated. Somewhere in my head lurks a figure of 5% but since I don't know if that's per month, per year or forever it's a pretty useless figure. But even if that's the total movement over all time given the size of wikipedia that's a fair bit of shift. And obviously cool URIs don't change:

    http://www.w3.org/Provider/Style/URI

  • Comment number 5.

    @ed + @onpause I suspect I've made myself misunderstood - I was certainly not claiming that b00c4wxm is globally unique.

    The only true web-scale identifiers are URLs. They're how we uniquely identify and locate resources. I think we're all agreed there? But that's not quite what I'm talking about here.

    In retrospect maybe this post should have been called 'Approximations of web-scale identifiers' but that's a bit of a mouthful. The bit I'm interested in is the fragment of the URL that's *almost* good enough to carry identity. So in the Panorama episode example a Google search for b00c4wxm returns 14 results and they're all about that episode on /programmes and iplayer. Now we know that that ID isn't globally unique and it's certainly not as powerful as a URL but it clearly carries some degree of unambiguous meaning

    The same with derivadow. Again not globally unique. We could all register on a social network that Tom's not joined yet (cough) as derivadow. Now a google search for derivadow gives a fair few results but all the ones I'm seeing are our Tom. A search for 'Tom Scott' on the other hand brings back judges, musicians and mystery buyers. So while neither b00c4wxm (which doesn't start with b00b) nor derivadow are true web-scale identifiers they are pretty damn good approximations.

  • Comment number 6.

    I should be able to type:

    www.bbc.co.uk/programmes/rubbadubbers

    and not have to remember:

    www.bbc.co.uk/programmes/b0072hw8

    If there is another programme called rubbadubbers then there should be disambiguation info at the readable url.

    The BBC partly encourages hackable urls. Why not extend that power to programmes and customers?

  • Comment number 7.

    @ritchielee there's a task on our board to do just that. Should be there soon. In the meantime:

    http://www.bbc.co.uk/programmes/a-z/rubbadubbers

    does the job

  • Comment number 8.

    yes, you are right, academic research confirms 5% drift over Wikipedia's lifetime.

    you know me, though, always looking on the bright side: i say hooray, 95% remained persistent! ... ;-)

    i'm sure MusicBrainz fairs even better than 95%, but surely not 100% persistence?

    plus, Wikipedia *URLs* actually remain persistent, as in dereferenceable -- almost no 404s... it's the *resource* "behind them" that changed, which is actually what you are getting at, no? i agree that this is unfortunate, but they sometimes surmount this via redirects and implied redirects via disambiguation pages. but it's inconsistent and non-machine readable, alas.

    Wikipedia has changed the *meaning* of 5% of it's persistent URLs. so they still point to something, but what they point to has changed.

    much thanks, as that's an important point to make, no pun intended.

  • Comment number 9.

    I find the Wikipedia example interesting for two reasons that have nothing to do with each other:

    1. The disambiguation aspect has been part of HTTP for a long time now via the 300 status code. Unfortunately because the main entry point to the web has been the browser for years now those kind of features have never really been implemented. They'd work well for automated processing.

    2. English speaking speaker tend to forget that the rest of the World has to endure ugly URLs no matter what due to encoding issue. So the "readable URLs", although large in audience, is limited in its scope.

  • Comment number 10.

    A follow-up, the 300 status code doesn't actually offer you a choice between multiple resources but between multiple representations of the same resource. That, of course, may not be what one wants.

  • Comment number 11.

    Ok, I find this an interesting topic. Much could be read and written about it, and is. Thinking of the internet in database form (or RESTful) is quite a handy analogy.

    I had to re-read your post to see what you were trying to acheive Michael, and now I see the purpose of using unique identifiers.

    Musicians being creative types, database structure does not spring in to their minds when they chose their names. That is why there are 16 very similar Auroras. Wikipedia and Last.fm do get very conviluted about that subject, especially Wikipedia as it deals with much more than just artists.

    Random and unique identifiers are quite a handy way of differentiating between them, and is only the really sensible way. But the problems with the length and that it's not human readable, but this is not the BBC's fault, it's MusicBrainz who are trying to logicalize a very large and overlapping field.

    I understand why you use MusicBrainz and not Wikipedia, because MusicBrainz is a consistant source.

    I would ask though that when you are redirected to the artist page from the MusicBrainz style URL, you redirect the browser to the BBC's identifier for that artist. Rather than just dumping us at /artist as the ting tings link above does.

  • Comment number 12.

    > URLs should be human readable

    I really don't understand how you consider the following to be human readable:
    http://www.bbc.co.uk/music/artists/d5da1841-9bc8-4813-9f89-11098090148e

    URLs should certainly contain the name of the artist / song that its describing.

    Otherwise the URL of this blog post should be:
    http://www.bbc.co.uk/blogs/radiolabs/2008/06/6fa26370-491f-11dd-ae16-0800200c9a66.shtml
    Right? Otherwise you could run into some instance where there are TWO posts with the same name of "the_simple_joys_of_webscale_id" !!!!

    Edge cases aren't worth trashing friendliness, hackability and usability.

    99% of the artists won't have collision problems. So why would you don't design a system for 1% of your data set?

  • Comment number 13.

    This comment was removed because the moderators found it broke the house rules. Explain.

  • Comment number 14.

    This comment was removed because the moderators found it broke the house rules. Explain.

 

BBC iD

Sign in

BBC navigation

BBC © 2014 The BBC is not responsible for the content of external sites. Read more.

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.