Back in 2012 we launched our World Service Archive prototype featuring over 70,000 programmes from 60 years of radio, all available on the web. We previously wrote about how we built it and in this post we analyse the data gathered over the past 2 years from this crowdsourcing experiment.
The experiment investigated an alternative approach to publishing a large media archive on the web. Our system started by automatically deriving descriptive tags from the archive audio. We used the resulting data to publish the archive online as a public prototype and to bootstrap the search and discovery mechanisms. We then asked users of the prototype to validate, correct and augment these automatically derived tags.
We have now analysed the data gathered over two years and we believe we have shown that a combination of automated tagging algorithms and crowdsourcing can be used to publish a large archive like this quickly and efficiently. The full results are in our article published in the August issue of the Journal of Web Semantics and this blog post summarises what we have learnt.
Launched in July 2012 to a limited panel of World Service listeners, the prototype was opened to all in January 2013. As of March 2014 it had 3685 registered users and all data and results found in this post are from that date. Only registered users can browse, listen to or tag programmes in the prototype because of a restriction on content rights and so we can track activity and usage. The users include a mix of radio listeners, industry professionals and BBC staff.
Cumulative registrations over time. Spikes in this graph are generally where we have had publicity for the prototype. We suspended new registrations from Feb-May 2013.
Of the 70,234 programme records published in the archive there are 35,820 listenable programmes and 34,414 that have no audio available to users. This is either due to rights issues or because we do not hold the associated audio file. As of March 2014 14,867 programmes (41%) of listenable programmes had been listened to and 1458 users (40%) had listened to at least one programme.
People using the prototype can search, browse and listen to archive programmes and they can choose to improve the existing metadata. There are five ways to do this: voting on whether tags are correct or not, adding new tags, selecting the associated image, identifying voices in the programme and directly editing the programme's title and synopsis.
In total there have been 67,082 tagging actions. 9248 listenable programmes (26%) have had at least one tagging action and 666 users (18%) have carried out a tagging activity.
|Tag voted up||27649||8209||8107||612|
|Tag voted down||24289||3755||4372||645|
Table 1 - totals of user tagging activity. The Total column is the total number of activities associated with a given action. Tags is the number of unique tags affected. Programmes is the number of unique programmes affected. Users is the number of unique users having performed a given action.
Table 2 - totals for other activities.
We track new speaker identities being added or confirmed by users, and use them to quantify how accurate the speaker aggregations we automatically construct are and how well our identity propagation algorithm works. We are evaluating the precision and recall of this feature as more and more user edits come in and as of March 2014, we had an 85.1% precision and a 43.1% recall (see below for more on precision and recall).
Cumulative % of edits by number of users
Looking at each user's activity we see that one prolific user has done 32% of the total edits, ten people have done 70% of the edits and 10% of the users have done 98% of the work. Other crowdsourcing studies have shown similar results. For example the steve.museum experiment found that 0.7% of users contributed 20% of tags and in the Library of Congreee Flickr Commons experiment 40% of tags were added by only 10 users. But the fact that a person has at some point done an action isn't a particularly good indication of current activity. Instead, we use what Wikipedia defines as "Active users" - "registered users who have performed an action in the last 30 days". By this measure, in March 2014, we had 44 active users.
What kind of tags?
We can try to classify the tags added or voted upon by users by broad categories; people, places or “other concepts". This is realtively easily done by using DBpedia.
|Type of tag added||Number||Percentage of total|
Table 3 - tags added by category
|Type of tag voted up||Number||Percentage of total|
Table 4 - tags voted up by category
|Type of tag voted down||Number||Percentage of total|
Table 5 - tags voted down by category
More specifically, the 20 most frequently added tags are Panel game (88 adds), Comedy (79), 30 minutes (78), Play of the Week (70), Just a Minute (69), Ian Messiter (59), Drama (41), William Franklyn (37), Patricia Hughes (34), London (33), Interview (31), African Theatre (31), Australia (30), Quotation (29), Family (27), 5 minutes (26), India (24), Fletcher (24), Clement Freud (24) and 60 minutes (23).
This list is dominated by tags related to a radio panel game, “Just a Minute” (e.g. Ian Messiter) and to radio drama (e.g. “Play of the Week” and “African Theatre”) - both genres where significant fan groups interacted with the prototype. And we believe the time related tags (e.g. "60 minutes") were people tagging with the duration of the show, though the wikipedia concept was of the US television news show. Obviously, the programmes most likely to be tagged are highly dependent on the users of the prototype and on whether the programme was featured on the prototype's homepage.
The 20 tags most frequently voted down are: Doctors (TV series) (223 down votes), Black (English band) (210), Play (theatre) (206), Moon (film) (177), Brother (film) (175), James (band) (165), Game (food) (161), Smile (The Beach Boys album) (147), Queen (band) (145), Michael (archangel) (135), Greek (TV series) (129), Ride (band) (111), Joe (singer) (109), Royal Dick School of Veterinary Studies (109), Heroes (David Bowie song) (106), Madness (band) (104), Hole (band) (99), Subjectivity (98), Bottom (TV series) (98) and Prince (musician) (97).
Most of these seem to be incorrect machine-generated tags pertaining to cultural artefacts (bands, albums, TV series etc.) named after very common words. We could use this feedback to refine the ranking generated by our automated tagging algorithm.
It also appears that disputed tags (i.e. with a similar number of up and down votes from users) are not disputed because of their accuracy (the disputed topics will usually all be mentioned during the course of the programme), but because of their relevance to describing the full programme. Is “Satellite” an appropriate tag for describing a documentary about the 1969 moon landing?
User-validated tags clustered by episode co-occurence
How good are the tags?
It is difficult to evaluate this as the accuracy and usefulness of tags are pretty subjective, but we've tried!
Initially we asked 13 professional BBC archivists to independently tag a set of 95 programmes that had the most user activity. They weren't shown the existing tags (neither machine-generated or user-generated tags) and started from a blank slate. We used this as a "ground truth" to evaluate the machine-generated and crowd-sourced tags as they evolved over time.
Table 6 - precision and recall of tagging over time.
Note on precision and recall: Frequently used terms in information retrieval. In this case, precision is what percentage of the created tags are found in the ground truth (as generated by the archivists) and recall is what percentage of the ground truth tags were found in the created tags. More detail here.
In August 2012 only the machine-generated tags are considered. In September 2013 the crowdsourcing experiment had been running for a year and In March 2014 for 18 months and we can evaluate the precision and recall at each point in time. We can see that as more users contribute data to the prototype, the precision and recall of the resulting tagging has improved.
Another way of evaluating the success of the tagging is to study the number of tag votes and tag addition over time, but normalise them by total user activity at that time so we can see how the proportion of each activity changes.
Normalised tag additions, up votes and down votes over time.
Over time we would hope that the overall quality of tags would improve and therefore we'd expect to see the up votes trend upwards (people would tend to agree with the tags more often as the tag quality improves), down votes trend downwards (as people see fewer tags to disagree with) and additions trend down (as the list of existing tags becomes more comprehensive). It's pretty noisy but looking at the above graph, up votes are trending upwards and down votes are trending downwards as hoped, but additions are fairly constant. This may be because the tagging space is so large there is always room for adding new tags.
We’ve noticed certain groups of people using the prototype, particularly a large community of radio drama enthusiasts and fans of the panel game, “Just a Minute”. Some users have even sent in four programmes that they had recorded off-air at the time of broadcast but were missing from our archive.
The approach to crowdsourcing in this experiment is similar to that found on Wikipedia. The prototype is usable as just a reference resource, but with relatively prominent crowdsourcing features. It is less similar to more task-based crowdsourcing projects, such as GalaxyZoo, which are generally designed to enable people to do the work as efficiently (and enjoyably) as possible. It would be a valuable experiment to try a comparable task-focused approach to the crowdsourcing in our prototype.
We also use two crowdsourcing interface paradigms in the prototype. There is a wiki-like “last edit wins” approach for the synopsis editing and an aggregated voting approach for the tags.
There are relatively few other examples of crowdsourcing metadata for linear audio-visual media. There is the Waisda video labelling game, a “game with a purpose” from the Netherlands Institute of Sound & Vision which was applied to a number of long-form and short-form videos. And similar approaches are found in PopVideo, a project from Luis von Ahn at Carnegie Mellon University and the Yahoo! Video Tag Game from Yahoo! Research. It’s worth noting that these video tagging games often lead to tags that describe who and what is seen in the video, rather than identifying higher-level concepts such as “Barack Obama talking about the recent events in Syria”.
We are unaware of any other examples of crowdsourcing at scale using Linked Data tags, and we believe that using Linked Data in user- and machine-tagging has major benefits; for linking internal systems together, for ensuring tags are unambiguous, for linking out to other places on the web and for supplementing data by pulling from other linked data sources.
We are investigating automatically identifying speakers through projects like the "Speakerthon" event, which should eventually lead to an openly licenced database of voices for famous people. We want to investigate grouping topics into single events, as defined in the Storyline ontology which would enable more precise event-based discovery within the archive. We are also working on automated programme segmentation as many programmes are fairly long and tackle multiple topics. Finally, we have started work on a platform for sharing our audio tagging tools and cloud-based processing framework with other content owners outside of the BBC.
This prototype is still live for now and we are working on using the data gathered to help publish the World Service radio archive permanently on the BBC's website.