« Previous | Main | Next »

In Search of Cultural Identifiers

Post categories:

Michael Smethurst Michael Smethurst | 10:49 UK time, Wednesday, 14 January 2009

Post updated following comments: thanks everyone

For Books....

Books filed by colour in Adobe shop

A Rainbow of Books by Dawn Endico. Some rights reserved.

Late last year we got quite excited about Open Library. Using the open word always seems to tick our boxes. We chatted about the prospect of a comprehensive, coherent BBC books site heavily interlinked with BBC programmes. Every dramatisation of a novel, every poetry reading, every author interview and profile, every play linked to / from programmes. The prospect of new user journeys from programme episode to book to author to poem and back to episode still seems enticing. We started to wonder if we could use Open Library as the backbone of this new service in the same way we use MusicBrainz open data as the backbone of /music.

Unfortunately when we looked more closely an obvious problem came to light. Open Libary is based on Amazon book data and Amazon is based on products. Correction: OpenLibrary is NOT based on Amazon data (see Tim's comment). For now it models books in a similar fashion to Amazon (as publications/products not cultural artifacts). OpenLibrary are looking to enhance this model to allow grouping of publications into works which is fantastic news. If you can contribute code or knowledge I'd encourage you to do so. And the BBC isn't all that interested in products. Neither are users.

If I tell someone that I'm reading Crash they generally don't care if I'm reading this version or this version or this version. What's interesting isn't the product but the cultural artifact. It's the same story with programmes. Radio 7's David Copperfield isn't a dramatisation of this or this or this, it's a dramatisation of this - the abstract cultural artifact or work.

The problem is probably so obvious it hardly warrants a blog post but now I've started... Lots of websites exist to shift products. So when they're created the developers model products not looser cultural artifacts. And because the cultural artifact isn't modelled it doesn't have a URL, isn't aggregatable and can't be pointed at. As Tom Coates pointed out people use links to explain, disambiguate and clarify meaning. If something isn't given a URL it doesn't exist in the vocabulary of the web.

The problem is compounded by Amazon encouraging users to annotate it's products with comments, tags and ratings. Why is the Penguin version of Passage to India rated 5 stars whilst the Penguin Classic version is rated 3 stars? They're essentially the same thing, just differently packaged. Are users really judging the books by their covers? Anyway it all leads to conversations which should be about cultural artifacts fragmenting into conversations about products. It also leads to a dilution of google juice / page rank as user attention gets split across these products.

I'm no library science expert but speaking to more library minded friends and colleagues it seems they use 3 levels of identification:

  • The Dewey Decimal system is used for general categorisation and classification.
  • The ISBN is used to identify a specific publication.
  • The bar code they scan when you take out a book is used to identify the individual physical item.

So there's something missing between the general classification schemas and the individual publication. Like Amazon, libraries have no means of identifying the abstract cultural artifact or work - only instantiations of that work in the form of publications. These publications map almost exactly to Amazon products and since Open Library is built on Amazon data [I]t's why we see 45 different Wide Sargasso Sea's in Open Library.

So whilst Open Library's strapline is 'a page per book' (which feels strangely familiar) in reality it's a page per publication / product. It would be interesting to know if Open Library have any plans to allow users to group these publications into cultural artifacts. If they do then we'd really end up with one page per book and one canonical URL to identify it. Update: They do which is fantastic news. This combination of open data and a model of interesting things is fantastic. At which point the prospect of links to and from BBC programmes (and Wikipedia / DBpedia) gets really interesting.

...and Music

So we've written in the past about using MusicBrainz as the data backbone for the new /music site. MusicBrainz models 3 main things (artists, releases and tracks) and provides web-scale identifiers for each. So why have we chosen to only expose artist pages? Why not a page per release or a page per track?

The problem is the same one as Amazon / Open Library. In the case of releases MusicBrainz models individual publications of a release. So instead of being able to identify and point to a single Rubber Soul you can point to this one or this one or this one. And in the case of tracks MusicBrainz really models audio signals. So this Wonderwall is different to this Wonderwall is different to this Wonderwall with no means of identifying them as the same song - for all we know they might have as much in common as Statement by Extreme Noise Terror and I Should Be So Lucky by Kylie. Which isn't a problem except that we want to say this programme played this song by this performer - which performance / mix is much less interesting. Same with reviews - most of the time we're not reviewing a publication but a cultural artifact.

So how do we get round this? We're currently working with MusicBrainz to implement the first part of its Next Generation Schema. This will allow users to group individual release publications into what we're calling cultural releases. So we'll have one Rubber Soul to point at. After that it's on to works and parts and acts and scenes and acts of composition etc with a single page and a single URL for each.

...and Programmes

Again the problem resurfaces in the world of programmes. Most of our internal production systems deal with media assets and these assets aren't always grouped into cultural artifacts. But people outside the BBC aren't really interested in assets. If your friend recommends an episode of Horizon you're unlikely to care if they mean the slightly edited version, the version with sign language or the version with subtitles. Most of the time people talk about the abstract, platonic ideal of the programme.

Way back in time when /programmes was still known as PIPs a design decision was made to model both cultural artifacts and instantiations. If you look at the /programmes schema you'll see both programmes (brands, series and episodes - the cultural artifact) and versions (the specific instantiation). When we talk about one page per programme what we're really talking about is one page per episode. What we're definitely not talking about is one page per version or one page per broadcast.

Getting this stuff right is really the first job in any web project. Identify the objects you want to talk about and model the relations between those objects. The key point is to ensure the things you model map to user's mental models of the world. User centric design starts here and if you choose to model and expose things that users can't easily comprehend no amount of product requirements or personaes or storyboards will help you out.

For want of a better label we A&M types often refer to this work as 'cultural identifiers'. One identifier, one URL, one page per cultural artifact, all interlinked. It's something that Wikipedia does better than anyone. One page per concept, one concept per page. bbc.co.uk could be a much more pleasant place to be if we can build something similar for the BBC.

Comments

  • Comment number 1.

    I completely agree with your analysis of the problems with the otherwise excellent Open Library. Mind you, there is no automatic way I can see (short of user intervention) to differentiate between different instantiations of a book that are meaningfully different (second editions that have been extensively revised) from different instances that are essentially the same (second editions that are just reprints of the first editions). Should book, radio and TV versions of HHGTTG be grouped or separate?

    I do think that Open Library should offer the ability to provide a "cultural artefact" metapage for each book though to help users to navigate.

    This problem is also common to book sites like http://www.goodreads.com/ - none seems to have solved (or even noticed) it!

    I hope that your raising this on a BBC site will get a reaction from some of the people running these sites.

  • Comment number 2.

    Very interesting. If you're free on Saturday you should come and discuss it at Bookcamp - http://bookcamp.pbwiki.com/ It sounds like the perfect topic.

  • Comment number 3.

    Agree, this is a big problem! (Which, as you say, Tom Coates identified a while back).

    You could consider using Wikipedia/DBPedia identifiers - they do identify the 'cultural artefacts' (eg http://en.wikipedia.org/wiki/Rubber_Soul and http://en.wikipedia.org/wiki/Crash_%281973_novel%29 ). Unfortunately, their coverage isn't that great (the notability thing means that not every book or album can be included) and the identifiers aren't necessarily that persistent (pages can be moved).

    Incidentally, as I understand it, ISBNs have a 1:1 relationship with barcodes (EANs)... Bookshops have to use simple text searching or proprietary databases to identify different versions of the same book. It's a real pain!

    LibraryThing might fit your needs though - see http://www.librarything.com/work/7140 - it's not quite clear how they model the data though.

  • Comment number 4.

    I second FrankieRoberto's comment - check out LibraryThing. From their website:

    "All LibraryThing books belong to a "work," a cross-user and cross-edition concept designed to improve social contact, recommendations and cataloging quality."

    On a different point - I think that on some specialised music shows listeners would be interested in the exact version of a song played, and that artist/title doesn't give enough detail.

  • Comment number 5.

    @david - agreed. i can't think of a way for machines to solve this problem. it has to come down to human intervention. maybe they'll offer a way for users to group publications. and as you say, it is in all other respects excellent

    @amcewen - would love to come along but the penguin blog seems to indicate it's fully booked

    @frankie - we are working with dbpedia people to provide a semweb complient controlled vocabulary for the bbc - more soon

    @frankie + @ickbinberliner - LibraryThing is also very good and does appear to have solved this problem. but i'm not sure how their data is licenced. maybe i'm missing something but i'm only seeing copyright not creative commons... if i register and add data is that my data or LibraryThing's or open?

    @ickbinberliner - yes, mixes are important for specialist shows. i'm not saying u shouldn't model down to this level. just that it's important to be able to group at a more abstract level. also a lot of the music we play is specifically edited for radio or even specially mixed for a dj. so it's a completely different signal even from that played on other shows. you've basically got a song. which is performed, which is recorded, which is mixed, which is then FURTHER edited for playout. it's an extra level of detail that stretches even the music ontology...

    Michael

  • Comment number 6.

  • Comment number 7.

    @setoub - that's very cool :)
    michael

  • Comment number 8.

    Hey. This is the founder of LibraryThing. A couple of points:

    1. Open Library is NOT based on Amazon data. The core of the data is library data (from all over). I believe there is some Amazon data on top of that, but it's not the core and they are not a stand-in for Amazon. They are a non-profit, and insofar as they are trying to come up with a site you go to first about books, they are competing with Amazon.

    2. Both libraries and booksellers are stuck in the business of managing inventory. That's how both end up with 20 different pages for what is essentially one "work." Every edition has a different ISBN, LCCN or other identifier. Cross-edition linking is minimal.

    3. Booksellers, publishers and libraries have been thinking about this problem for some time, coming up with different answers. Librarians are working on the "FRBR" system--something Open Library is experimenting with. The book industry has come up with the International Standard Text Code. It's early in the game, though.

    4. LibraryThing's "work" system is based on the "cocktail party test." Basically, if you've read _The Unbearable Lightness of Being_ and you're at a party where the attractive woman in the backless dress mentions she loves the book, the edition is completely immaterial. So too it would be immaterial if she read it in Czech and you read it in Finnish. Our system isn't perfect, but it gets away from the idea that books are about physical objects. As the world goes increasingly digital and the digital goes social that's going to look increasingly like the way to go.

    By the way, LibraryThing has an API to our work system. Send it an ISBN and it gives you the others. It's free for non-profit use. If BBC wanted to use it for free, we'd be honored too. Send me a note?

  • Comment number 9.

    "If I tell someone that I'm reading Crash they generally don't care if I'm reading this version or this version or this version. What's interesting isn't the product but the cultural artifact."

    I am left puzzelled but what exactly is the 'cultural artifact' Later on there is a model based on broadcasting...'programmes schema you'll see both programmes (brands, series and episodes - the cultural artifact)' This seems both a way of cataloging 'artifacts' and imposing order on them ...so 'thats the way broadcasters think' and is part of their weltanshuang to help them order their world. But why should I be forced to do it that way? Wasn't hypertxt supposed to free me from hierarchical and linear impositions. In the Nineeteenth Century an exactly analogous schema could be appled to Dicken's serial novels in Home Words but for whom is the schema an appropriate account of this cultural object. Try it with 'Sex in the City' And to raise a further hare..I am sure at the cocktail party I would chose an suitible episode in Kundera's work to flirt with the lady in question. That for the moment would be my cultural artifact in the frame of reference of amorous cocktail conversation leading to seduction (linked to from Womans Hour perhaps).
    A better approach is to understand the frames of reference of users or even their language games so they can engage with the material. One might do worse than to generate a search from the various message boards and other evidence of users reactions and extend it.


  • Comment number 10.

    "One identifier, one URL, one page per cultural artifact, all interlinked. It's something that Wikipedia does better than anyone. One page per concept, one concept per page. "

    As you suggest, resources and the URLs that point at them probably belong at the BBC for stuff like an Episode of Horizon.

    For other 'artefacts', resources/URLs exist and belong at Wikipedia or elsewhere: Bleak House, The Archers, Eddy Grundy, John Peel, Rubber Soul, Norwegian Wood, even storylines.

    "bbc.co.uk could be a much more pleasant place to be if we can build something similar for the BBC"

    I think it's tricky where to draw the line. I guess the danger is reinventing the (Wikipiedia/DBPedia) wheel.

  • Comment number 11.

    While it might be a bit esoteric I wrote a short post recently about how Freebase is also distinguishing between what FRBR calls Works and Manifestations.

    "Like Amazon, libraries have no means of identifying the abstract cultural artifact or work - only instantiations of that work in the form of publications."

    If you s/identifiers/web identifiers/ I think you are partially right. Libraries have long practiced Authority Control which provides the plumbing for grouping editions under what's called a Uniform Title. The identifiers in this case were the string headings that were derived from the work, not URIs.

    More recently the Functional Requirements for Bibliographic Control (FRBR) have informed efforts like Tim's at Library Thing, and OCLC's Worldcat to essentially mint URIs for works like: http://www.worldcat.org/oclc/24630204 and http://www.librarything.com/work/27239.

    My hunch is that if OCLC and LibraryThing returned machine readable data from those URIs, or at least let clients autodiscover the data, we would be in a better linked-data-library world.

    OpenLibrary at least seem to get it.

  • Comment number 12.

    Great piece, and an interesting problem.

    Of course, the appropriate level to discuss at depends on who is discussing. For a broad audience like the BBC then the work may be the most useful, but for academics discussing the differences between different editions of a Shakespeare play the manifestation (in FRBR terms) and often the specific physical copy is relevant.

    I wrote up some research on this a while ago: http://events.linkeddata.org/ldow2008/papers/02-styles-ayers-semantic-marc.pdf

  • Comment number 13.

    Ed mentions Worldcat above. Here are some links that may be of interest, to Worldcat and to related resources.

    Worldcat is a union catalog containing over 120M records (for books, cds, maps, etc) which in turn represent over a billion items in libraries around the world. It is a destination in itself for research etc purposes and also links through to library collections.

    http://worldcat.org

    http://www.worldcat.org/oclc/2659341&tab=holdings?loc=se5+9ax#tabs

    (holdings we know about close to a camberwell postcode. better coverage in the US.)

    Managing similarity and difference ....

    This is a longtime discussion point. We know that people want access to *both* specific editions/copies/etc and to works.

    Here is a collection of editions/translations/etc:

    http://www.worldcat.org/oclc/24630204/editions

    Here is a specific one:

    http://www.worldcat.org/oclc/24630204

    xISBN ....

    Give it an ISBN and get back ISBNs from the same workset. Some other related identifier services are also available here.

    http://xisbn.worldcat.org/xisbnadmin/index.htm

    Identities ...

    Pull together what we know about an author/contributor/.

    http://www.worldcat.org/identities/lccn-n80-13225

    (look at the source)

    FRBR ...

    Some background materials and exploratory projects showing principles in action at:

    http://www.oclc.org/research/projects/frbr/default.htm

    API, etc ...

    http://www.worldcat.org/affiliate/default.jsp

    Let me know if you would like to talk to somebody about any of the above or related

    Lorcan

    http://orweblog.oclc.org





  • Comment number 14.

    The issue is more complicated in Music... there is the work (the sheet music and lyrics) there can be many covers /versions of that and each one that is conceptually different is catalogued with an ISWC.

    these works can all be performed by different people in different performances, each performance can only have many recordings and that recording can be edited. All of wich are "editorially" different, (and could have different ISRC codes)

    So what is your cultural entity? the work? but there can be re-arrangements of that. the Title? but two completely different cultural entities can share the same title.

    Traditionally, it's all been base on money, and the most abstract entity that money could be charged for was the work. All works were treated as being the same object type even those that are versions of another work. because computers and links didn't exist, so it was unnecessary.

    Also works exist ouside musical genre taxonomies because they can be performed in different styles.

    anyway if you go here:

    http://iswcnet.cisac.org/ISWCNET-MWI/confirmLogin.do

    You can have some fun with the ISWC database which does track recordings and creators. Performers get listed but are not references. try "bird on a wire"...

    The book world has it easy, people don't often trace the preformances of book readings, or creates book mash-ups (yet).

  • Comment number 15.

    Hi. I'm Karen Coyle and I've been working on the Open Library project for about a year now. Tim is right that OL is not solely based on Amazon data, although we do have a lot of data from Amazon. We started with the Library of Congress "Books All" (meaning all languages) file, added in records from other libraries (see: http://openlibrary.org/about/help%29, records from Amazon, and from a few publishers. We've got more data in the pipeline, so the site will continue to transform, depending on what data we receive.

    We attempt to merge the records so you don't see duplicates of the same edition. That doesn't always work because we may not have sufficient information in the metadata to make that determination. I like the LibraryThing solution of letting humans make those decisions where algorithms fail, and we've talked about how we could do that in OL.

    We have a plan to provide a work-level view, as well as a need to present a view that works well with copyright data (something like the FRBR expression level). I should point out that it's not just bookstores that focus on the "product" level -- that's also the level where libraries do their cataloging. It's not because the product is important to libraries, though, it's because that's what you have in hand when you are cataloging. If you are interested in how catalogers approach their task, the Dublin Core/RDA project has produced a set of cataloger scenarios at http://dublincore.org/dcmirdataskgroup/Scenarios. In particular, Scenario 10 describes a situation where the cataloger doesn't have enough information to know what the work really is. If catalogers were omniscient, bibliographic databases would be much better. Alas....

  • Comment number 16.

    @LibraryThingTim - my bad on amazon thing. i saw the amazon mention, did some compartive search which returned very similar results and assumed you were using amazon as the spine of your data. good to know you're not. good luck with the competition : ) i've updated the post to reflect.
    i've come across frbr in the Music Ontology which we use to publish linked data from the /music site. my desk mate Yves is far more of an expert but i appreciate the difficulties
    i like the "cocktail party test". there's no perfect model for this but i think your solution will get a long way there.
    i tried to mail you today but no response so far. would love to chat about this. usual first_name.last_name@bbc.co.uk if ur interested


    @cping500 - i think uv misunderstood. what we're trying DESPERATELY to avoid is to make a website that refects "the way broadcasters think". broadcasters think in terms of assets and time slots. we put in a lot of effort (user testing, focus groups, conversations like this) to try to model how non-broadcasters think. we don't always get it right but that's our intention
    hypertext does free you from hierarchical and linear impositions but to make hypertext you need links and to make links you need relationships. unless we know the relationship between the penguin edition and the penguin classic edition we can't make those links
    we're not talking about "an appropriate account" - we're talking about documents which describe and uris which identify
    if you want to identify an episode from a kundera novel you also need to identify that it comes from the kundera novel to give context. but it doesn't /just/ come from the pengin edition...

    @samdutton - agreed. it would be stupid to try to replicate (wiki|db)pedia. it does a good enough job of describing a large part of the world. we just want to comprehensively cover *things the bbc is interested in* and we want to link all that stuff up. an example: we currently publish a page for every artist in musicbrainz. but we only link to those that the bbc has an interest in (those we've played or reviewed etc). because other artists aren't linked to the only way to find them is to hack the url which means google can't find them so they're not really on the web. we're not trying to steal page rank or become an encyclopedia of everything - just better represent and make findable the diversity of bbc (mainly broadcast) content

    @esummers - interesting post. the freebase work in this area is really cool.
    and yes machine readable open data would be fantastic. thanks for the link to the lod mailing list post. i'd missed that. open book data on the lod cloud would be totally, completely amazing. espeacially if the frbr work pays off. open library definitely *get it*

    @mmmmmrob - again you're right. the context of the discussion tends to emphasise either the work or the expression. and (very?) occasionally the manifestation. as you say the bbc is broad church so often the work is the thing we're most interested in. thanks for the link

    @isldlisld - thanks for the links. i'm liking worldcat. are u uk based? would be good to chat about this sometime. also how is the data licenced?

    @jetski - i'm not sure the issue is more complicated in music... isn't it just complicated everywhere? the music ontology maps music frbr style with works, performances, sounds, recordings and signals. maybe it's because i spend more time thinking about music than books but books definitely feel more complicated to me. i hope to never have to model the bible. or after a conversation with a colleague today hamlet. again thanks for the link - never seen inside there - very interesting

    @everyone - thanks for all the comments. it would make a fascinating pub chat :)

  • Comment number 17.

    In my own simpleterm I would have to feel that one page per concept and one concept per page seam is the simple solution. However in reality i can see that this can be unworkable especially when factors such as link juice are considered.

    To throw a half concept out there (which must already exist in some way) is there not some sort of method that is similar to social book marking where by concepts can be bunched together by similarities that is not obvious in by just the single concept on the page. I suppose a tagging system for that can be used to bunch similar single concepts.

    Im confused, and feel well out of my depth. Think I should stick to keywords and stuff!

    Thanks for listening,

    Jon
    SEO specialist

  • Comment number 18.

    I second FrankieRoberto's comment - check out LibraryThing. From their website:

    "All LibraryThing books belong to a "work," a cross-user and cross-edition concept designed to improve social contact, recommendations and cataloging quality."

    On a different point - I think that on some specialised music shows listeners would be interested in the exact

 

Copyright © 2015 BBC. The BBC is not responsible for the content of external sites. Read more.

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.