Posted by Stephen Jolly on , last updated
In November 2021, the AI in Media Production team at BBC Research & Development carried out a television shoot with a difference. A typical shoot produces audio and video that are then edited together to make a programme. This time, we intended to put that material into a dataset for academic researchers. Greater access to TV programmes' raw material would help many academic fields. One that is of particular interest to us is 'Intelligent Cinematography'. Its researchers are looking at how artificial intelligence and machine learning could help with production tasks like shot framing and editing.
Of course, the BBC makes TV programmes all the time, so it is reasonable to ask why we needed to do a special shoot. We have joined other BBC productions in the past and gained valuable material for our own research. This is a less useful approach when it comes to creating a dataset to share with others, though, for a few reasons:
- We need to ensure that we have permission to share the content with others. We require consent from all the people involved in making the content - which can be complicated with material we don’t own or commission ourselves..
- We want to control the script and direction of the programme. We want it to contain scenes that are easy for AI systems to process, and some more challenging ones.
- Most importantly, we needed to shoot the programme in a very different way to normal television.
To explain this last point - we wanted our dataset to support research into ways to frame shots. In a normal TV production, the director tells the camera operators how to frame the shots. This bakes the framing decisions into the recording, and it is not possible to revisit them. Instead, we used four static ultra-high resolution cameras with wide-angle lenses. We set these up to record the whole scene at once. This approach allows the framing of the shots to happen later on by cropping. Using four cameras lets users and algorithms select different perspectives in post-production.
The programme is a comedy about a celebrity quiz show, with a script by a professional writer. The quiz show host and contestants are played by professional actors. The participants in the show are school friends who have become famous in later life. They carry out challenges and talk about their pasts and their relationships with each other. The early scenes are calm, making life easier for any algorithms trying to make sense of them. The action builds throughout the show, and the final scenes are chaotic. This is an intentional challenge for the algorithms that will analyse them.
We have included the following artefacts from the shoot into our dataset:
- The video from every camera, with synchronised audio from on-camera microphones. We have provided the video in its original resolution and quality. We have also included a low-resolution, low-quality version. We hope this will be easier to store and review on less capable computers.
- Audio from the microphones worn by the actors. We have provided this in unprocessed, processed and stereo down-mixed forms. (The processing is noise reduction, plosive suppression, loudness levelling and trimming to length.) The unmixed audio may be useful for identifying the people who are talking to frame them. The mixed audio can act as the soundtrack for edited videos.
- The script. (Users of the dataset should bear in mind that some dialogue was ad-libbed or improvised.)
- A human-edited version of the programme for reference and benchmarking purposes.
- Various useful kinds of metadata. One example of this is “shot logging”, which identifies the audio and video from each take. It also provides basic guidance about which takes to use. We have also included AV sync metadata to help align the audio and video.
- Documentation to help users better understand the material and the shoot.