Seeing isn't always believing: a deep dive into Deepfake
Senior Technology Demonstrator
Much has been written about the societal impact of AI but there has been far less penned about its creative potential.
I'm focusing on an AI experiment conducted in support of the BBC’s ‘Beyond Fake News’ season.
Our experiment took inspiration from this viral ‘Fake Obama’ clip produced at the University of Washington. Researchers used AI to precisely model how President Obama moves his mouth when he speaks.
This image synthesis technique is more popularly known as ‘Deepfake’. The term ‘Deepfake’ (a portmanteau of deep learning and fake) can be unhelpful and confusing as the underlying technology has potential for both creative and nefarious use. It is the malicious use of the technology that grabs our attention, often cited examples have ranged from fake news to porn.
So why is this problem important for the BBC? Video reanimation can confuse (and impress) audiences, challenge our notion of truth and has the potential to sow widespread civil discord. It’s crucial for organisations like the BBC to get under the skin of the technology by understanding what it takes to create a compelling video reanimation and researching what can be done to detect manipulated media.
For our experiment, we wanted to push the technological creative boundaries by exploring whether a presenter could appear to be seamlessly speaking several languages. To make this happen we asked BBC World News presenter, Matthew Amroliwala, to record a short 20 second script. We then asked three different presenters from the BBC World Service Hindi, Mandarin and Spanish services to record the same script but in their native languages. We deliberately picked diverse languages in order to test how effective the technology is.
For the modelling and synthesis work we partnered with London AI startup Synthesia. Before recording his 20 second piece, we asked Matthew to read a prepared script which would tease out all of his facial movements. This was used as training data for the deep learning and computer vision algorithms. A generative network (this is a network used to generate new images of a person) was then trained to produce photorealistic images of Matthew’s face which would form the basis of his new digital face.
Finally, to bring the digital face to life, the facial expression and audio track from our World Service colleagues is transferred onto the new digital face - a process called digital puppeteering.
And that’s it. Take a look at the video below and see how convincing our reanimated video is.
So, what did I conclude about our experiment? Spanish Matthew looks convincing to me. However, is there a feeling that something is not quite right when viewing the Hindi and Mandarin Matthew? Is the reanimation not quite as finessed, or is my brain so unused to seeing him speak mandarin that the suspension-of-disbelief is broken? Or is transferring non-European languages trickier technically?
But consider this: we now have a flexible digital copy of Matthew’s face. It would be possible for him to record a new video (perhaps in his kitchen) and for us to reanimate those words onto any other recording of Matthew - in the studio or reporting on location. The implications for a trusted broadcaster like the BBC are serious.
Technology is at a point where it’s possible to cheaply and quickly manipulate video and make it difficult to tell the difference from an original. We will need tools that can verify the authenticity of a video and be able to prove this to the audience.
But what mechanism would instil confidence in our audiences? We are seeing academia and technology companies working on the problem of authenticity, but there is some way to go. For now, for the audience, there needs to be a heightened awareness of this technology’s capability. Seeing isn’t always believing.