Posted by Chris Newell on
This week we focus on a collaboration with BBC Four, where we're exploring and demonstrating the benefits and pitfalls of using Artificial Intelligence to support content curation and delivery.
Libby and Alicia ran a workshop for the IRFS and Connected Studio teams this week, to gain experience with Raspberry Pis. The task was to make e-ink badges integrated with our internal "Whereabouts" diary system.
Artificial Intelligence In Broadcasting
In the past couple of months, the Discovery team have been experimenting with various methods of Machine Learning (ML)-driven video analysis as part of a trial in collaboration with BBC Four. One of the project goals is to expose Artficial Intelligence (AI) and its various "ways of seeing" to BBC audiences, demystifying what is often misrepresented in popular culture as "black box" algorithms. Unlike a lot of similar (and extremely valuable) research in the field of ML-driven media analysis, we are not attempting to find the "perfect model" or achieve record-breaking accuracy metrics. Rather, we're interested in situations where a ML model fails, or produces a surprising result. We're hoping that by showing the kinds of things AI is good at, as well as the things it is particularly bad at, BBC audiences can be given a more honest account of what AI is and what it means for television and media in general.
Ways of Seeing
The first phase of the project revolves around exploring a number of ML techniques for extracting high-level information from archive material. With help from the IRFS Data team, we're processing large amounts of video material from Redux (an archive of BBC broadcasts), running several existing ML models. So far we've been experimenting with automatic image captioning using the Densecap model (Johnson, Karpathy et al.). This neural network system takes images as input and produces "dense" localised captions describing various objects in the picture. The pre-trained model is based on the Visual Genome dataset, where captions contain not just definitions of objects in the scene, but also actions ("a woman sitting on a chair") and contextual information ("the photo is taken during the day"). Therefore this model constitutes a complex, yet superficial view of the analysed material – it is not concerned with the semantics of the image, but merely with describing what it can "see".
A set of captions generated for a frame from a BBC Four programme
We found Densecap to be extremely interesting for demonstrating the inherent bias and intriguing failures of ML systems. The model is both impressively specific at times (determining the context in which scenes occur as well as objects and actions themselves), and amusingly inaccurate elsewhere.
Examples of sub-images that Densecap captioned with the word “cow”
Automated captioning helps us extract “embodied” semantics of archive material – looking at what’s on screen and disregarding the context thereof. The other techniques we’re investigating include: “shot type” detection as a means to extract visual/compositional semantics; and topic modeling analysis on subtitles to obtain semantics of the scene script itself, often completely separate from the visual content of the scene. Our hypothesis, then, is that by combining these semantic layers into a composite dataset, we might be able to train models which approximate a very much simplified idea of the editing process. The composite dataset can be used as a source for exploring the archive material. We’re experimenting with exploration methods including simple text similarity-based traversal, probabilistic modeling and recurrent neural networks to navigate the vast amount of extracted information. In addition to data analysis and exploration, we’re investigating methods of displaying AI-generated streams of archive material and real-time metadata in the context of a broadcast.
Elsewhere in the IRFS team this week...
AI in Music and Radio
Tim and Jakub have been preparing a 'Singing with Machines' workshop which they're delivering at the Sónar+D conference in Barcelona this week. The workshop will guide participants through the process of creating a collaborative musical installation using smart speakers.
Tim has also submitted a paper about our Public Service Personalised Radio project to IBC 2018, the annual International Broadcasting Convention.
Kristine has made good progress with her work on creating machine learning models of editorial decisions in music radio. She's got great results classifying audio by broadcast station, and some promising results classifying it by show. She's now going to use this to develop a prototype 'local music station' using a sample of BBC Introducing music.
Talking With Machines
In the Experiences team Nicky Birch presented a talk on "Designing for Voice" at The Next Web conference. The theme was to encourage creatives to use the medium and push the boundaries of what can be achieved in voice. The talk went well and she had lots of positive responses afterwards. You can watch the talk here.
New Content Experiences
Barbara Joanna and Joanne have been planning a New Content Experience workshop on Autonomous Vehicles later in June. Barbara also worked with UsTwo and a UCL MSc student in redefining the focus of their research project on the impact of media in cars.
Meanwhile Ben and Misa in the Data team have been refactoring their content analysis pipeline to use gstreamer, allowing it to analyse audio in a range of formats.
Chris was away at the W3C's Advisory Committee meeting last week. On Wednesday, he joined other R&D colleagues at the Media Web Symposium at Fraunhofer FOKUS.
On Thursday and Friday, he joined a meeting of the Second Screen Community Group, which is developing a set of open protocols to support the W3C Presentation API and Remote Playback API. The group is making good progress, with agreement on the discovery, transport, and message encoding formats, as well as an approach to interoperability with HbbTV.