Notes from the WWW 2012 conference
Last week I attended the International World Wide Web conference in Lyon, France. This conference is probably the largest one in that space: around 2500 participants and 15 parallel tracks. I presented two papers:
Automated Semantic Tagging of Speech Audio in the demo track, which focuses on the various tools we built to process very large archives with this algorithm, and on applications we built with MetaBroadcast using the resulting tags (slides).
I also contributed to a panel with Peter Mika from Yahoo! Research, Ivan Herman from the W3C, and Sir Tim Berners-Lee from MIT/W3C. The panel was entitled 'Microdata, RDFa, Web APIs, Linked Data: Competing or Complementary?' and was looking at publishing statistics for structured data extracted from the Web Data Commons dataset and from a Yahoo! dataset to try and understand what format were used and for what use-case. One of the main message from this panel is that structured web data is already mainstream - Yahoo! reports that 25% of all web pages contain RDFa data and 7% contain Microdata.
From left to right, Peter Mika, Yves Raimond, Ivan Herman, Tim Berners-Lee (c) Inria / picture T. Fournier
I thought I would write my notes from the conference. Of course, I wasn't able to see everything so the selection of papers below just reflects the presentations I attended. Given the general quality of the papers, I strongly suggest going through the online proceedings.
Linked Data on the Web workshop
I spent the first day of the conference in the Linked Data on the Web workshop. A couple of personal highlight were the following papers:
- NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud. As more and more online services for Named Entity Recognition are available, the NERD framework attempts to align them to provide a unified way of accessing their results as well as a way to compare them. It looks like most of them perform well in particular domains, and perhaps the best results could be obtained by combining several of them.
- Towards Interoperable Provenance Publication on the Linked Data Web. This position paper describes how the work done by the W3C Provenance Working Group could be used to express provenance as Linked Data. One interesting aspect was the application of 'follow-your-nose' principles to provenance data. Some data could be marked as derived from another set of data, identified by a URI. Getting that URI would also hold some derivation information, ultimately leading to a full provenance trail for any derived data. This would be very useful for scientific dataset, but also for news articles, weather reports, etc.
- Using read/write Linked Data for Application Integration -- Towards a Linked Data Basic Profile. This paper introduces the Linked Data profile W3C member submission, defining a read-write Linked Data architecture, apparently already in use in some IBM products.
- Interacting with the Web of Data through a Web of Inter-connected Lenses. This paper introduces the Mashpoint framework, a framework for pivoting selections of data (e.g. a list of countries) between web applications for data visualisation. Mashpoint looks like a very promising tool for data journalism.
AdMIRe and PhiloWeb workshops
On the second day I attended the AdMIRe workshop and the end of the PhiloWeb workshop. The earlier focused on advances in Music Information Retrieval, while the latter focused on the intersection of Web Science and Philosophy.
Music Retagging Using Label Propagation and Robust Principal Component Analysis. This paper uses content-based similarities between musical tracks to improve the quality of user tags on those tracks.
Melody, bassline and harmony representations for music version identification. This paper compares and combines a number of content-based features for the task of identifying different versions of the same musical work.
Power-Law Distribution in Encoded MFCC Frames of Speech, Music, and Environmental Sound Signals. This paper was particularly interesting in that it dealt with Mel-frequency cepstral coefficients-based sound classification, which we used in BBC R&D for a couple of projects. Most sound similarity metrics using aggregates of MFCCs assume that their distribution is homogeneous. However for a wide range of sounds the MFCCs distribution fits a shifted power-law distribution, which means that very few selected frames can be used to obtain similar performances. Perhaps using similarity measures which do not assume homogeneity could help take such biases towards particular combinations of coefficients into account?
Xavier Serra's keynote described the CompMusic project. Two particularly interesting aspects of the project are that it focuses solely on non-Western music and that it contributes directly to making Musicbrainz better, a bit like what we do for the BBC Music website.
- The Million Song Dataset Challenge. This paper describes a very large-scale dataset for the evaluation of music recommendation algorithms, providing a wide range of data about a million songs.
Main conference - day 1
The main conference started on the Wednesday with a very inspiring keynote by Tim Berners-Lee. He tackled a number of very interesting topics, such as the 'principle of least power' when designing new languages, the need for open mobile web applications and the issues around hierarchical systems such as DNS and PKI. He finished his keynote by talking about what he called the 'three sides of privacy': personal data held by businesses, personal data leaks (and the so-called 'jigsaw effect') and privacy invasion (e.g. through Deep Packet Inspection). He concluded by asking the audience to spend 90% of their time building new things, but 10% of their time protecting the open Web infrastructure and information accountability.
I attended the demo sessions all afternoon, where I was presenting our automated tagging framework. The Google Art Project held the keynote of this session, describing the work they have been doing capturing a number of artworks from an international selection of museums. They demonstrated the ability to look at specific parts of artworks in detail, their 'street view' for museums, and the creation of personal collections of artworks. They also mentioned that an API to access the data will be opened - we'll certainly keep an eye out for that! Rai also presented their personalised newscasts use-case within the NoTube project in the same session. They also presented some archive-related work, trying to help journalists find information in the news domain from their archive.
Main conference - day 2
Thursday started with a keynote from Chris Welty (IBM Research), who was part of the team behind IBM Watson which won the Jeopardy! quiz programme last year. A part of his keynote was spent describing the approach used for Watson, which is quite different from the traditional approach for automated question-answering. Typically a question is translated into some formal language and the resulting query is executed on a large knowledge base. Watson never tries to understand the 'meaning' of the questions. Rather, it finds documents that could hold the answer and scores them on lots of dimensions. Then, it learns the best combination of those scores based on previous Jeopardy! games. Semantic technologies in Watson are just used for some of these scores, not as a goal in itself. However it is an important tool, as it does bring a 10% performance boost.
This keynote was followed by a panel on the open Web, introduced by a keynote by Neelie Kroes from the European Commission. The panel was very good, with a lot of controversial questions being tackled, like the HADOPI law in France.
In the afternoon I attended the Entity Linking session. The LINDEN framework was presented first, describing a Named Entity Recognition technique using YAGO as target identifiers. Candidate entities are generated, and disambiguated using a number of features, e.g. link probability (estimated using count information in the dictionary), semantic associativity (using the Wikipedia hyperlink structure), semantic similarity (derived from the YAGO taxonomy) and topical coherence of a document around the candidate entity. The approach was interesting, but the paper suggests that a big part of the algorithm relies on concepts extracted by Wikipedia-Miner and providing some context for the disambiguation. It wasn't clear how LINDEN compares with that tool and whether it actually improved the results first obtained by Wikipedia-Miner.
The second paper was about generating cross-lingual links in Wikipedia. A significant number of Wikipedia pages are lacking cross-lingual links, as everything is currently done manually. The algorithm presented in this paper exploits the fact that articles linked to or from equivalent articles tend to be equivalent.
The final paper of the session was Zencrowd, using probabilistic reasoning to combine automated and manual work (done through Amazon Mechanical Turk, which came up a lot during the conference for user evaluations) for an RDFa enrichment task.
The last session I attended that day was specifically about Semantic Web technologies, describing why SPARQL 1.1 property paths are not scalable and that their semantics need to be changed (which also got the best paper award at the conference), template-based question answering (which addressed this problem in a very different way to what IBM Watson is doing, translating full text queries in SPARQL queries), and mapping relational databases to RDF.
Main conference - day 3
I attended the EU track on the Friday morning, where current EU projects were showcased, including LAWA (tracking entities through time in Web archives) and ARCOMEM (making use of the social web for identifying Web documents to archive).
Finally, I attended the Web Mining session in the afternoon. This session included three very interesting papers. The first one started from the basis that 'real stories are not linear' and described an algorithm for generating 'tube maps' for news stories. The second one tried to address the ambitious goal of predicting news events. Their system gathered a wide range of Linked Data and news article, extracted causal links from different events described within them, and tried to generalise such causal links. Then, given a particular event input, these generalised links can be used to predict future events, e.g. "China overtakes Germany as world's biggest exporter" is used by their system to predict "wheat price will fall". The last paper mined the Google news archive, holding several articles per day since 1895, and derived statistics about how long a person stays mentioned in the news. Apparently, the median duration of a person being famous in the news has consistently been 7 days for the last century.