Posted by Chris Newell on

Work on Natural Language Processing in IRFS has previously concentrated on extracting entities such as people, places and organisations from web pages, subtitles and Speech to Text streams. More recently we've had ambitions to extract higher-level, structured information using state-of-the-art tools such as the spaCy's dependency parser and visualizer.

The first new project in this area concerns Quote Extraction and Attribution, inspired by a prototype we helped to develop at a News Labs Language Hack Day called "Who Said What". The aim of the project is to extract quotes and identify the correct speaker, allowing us to build a database of attributed quotes. This would then allow us to find what someone has said on a specific topic and whether this has changed over time.

The task is complicated by the fact that there are direct and indirect quotes (with and without quotation marks respectively) and mixed quotes (with direct and indirect parts). A further complication is that the speaker may be referred to by a personal pronoun (e.g. "she") requiring a process called coreference resolution to resolve this to an explicit speaker.

In the last sprint we completed the first two stages of our pipeline using techniques proposed by Pareti et al:

  • a verb-cue classifier, which identifies speech verbs such as "said"
  • a quote sequence classifier, which identifies the content span of the quote

For the implementation we're using spaCy and CRFsuite with training data from the Penn Attribution Relations Corpus kindly provided by Silvia Pareti and the Institute for Language, Cognition and Computation. The next phase of work will concentrate on coreference resolution and attribution.

In Other News

This sprint we welcomed Alicia Grandjean to IRFS as a software engineer. She'll initially be working with Chris to update the Peaks.js homepage.


Chris has been working on a new version of Peaks.js, our open-source library for rendering audio waveforms in the browser. The update fixes a number of long-standing issues, including zoom and scroll behaviour, and significantly improves performance when rendering large numbers of segments.

Talking With Machines

Joanne, Andrew, Chris, and Ant met to discuss the user data we can collect from the interactive radio drama trial, and how we might visualise this to understand how people interact with the content.

Personalised Radio

Tim has been investigating the use of historical schedules as training data for an automated programme scheduling algorithm. David has been reviewing existing BBC Radio & Music research work in this area, including Music Mental Models (behaviours and motivations around music) and BBC Radio personas.

Face Recognition

Ben has been collating a list of important people which could be used to train our face recognition model. He started by using data from the DBpedia project to gather a list of recent people, and then order them by the corresponding Wikipedia article's page rank. This approach gives us a good starting list of generally important people, but also the ability to categorise - so for example we can gather a list of the top N cabinet members. The page rank does add some skew to the dataset, as not all articles or themes in Wikipedia are curated equally, so we're looking for other ways to rank these lists.

Speech To Text

Matt and Chris Newell have had discussions with several internal BBC projects interested in using our Speech to Text and Semantic Tagging systems together as a continuous pipeline, ideally from an IP Studio endpoint.

Chrissy has been trying various modes of feature normalization with the Tensorflow speech to text approach to try and improve results on the more difficult datasets we use for testing.

Standards Work

Chris, who co-Chairs the W3C Media and Entertainment Interest Group, sent some feedback to a group looking at secure communication between media devices on home networks. He also joined the BBC R&D video codec standards meeting to share updates on standardisation activity for UHD in various industry groups, including MPEG, ITU, DVB, and W3C among others. This is just one of the many standards activities in BBC R&D.

Interesting Links