Posted by Barbara Zambrini on , last updated
Back in January 1927, Professor T.H. Pear of the University of Manchester ran an original experiment with the BBC on the perception of voices on radio to understand how people responded to disembodied voices. Voices were presented on all BBC radio stations, and 5,000 people provided feedback using a questionnaire in the Radio Times.
Ninety years later, how do audiences in 2020 feel about the use of synthetic voices in different types of media content ranging from national news to entertainment? Do regionality and gender affect the way BBC content can be perceived if delivered by synthetic voices? How would that make people feel?
BBC Research & Development have just launched an online study called Synthetic Voice and Personality, which tests several bespoke synthetic voices with British regional accents on a wide public audience.
The study explores the ways synthetic voices could be used in different media contexts in the future and is part of our ongoing research into new forms of content driven by synthetic voices. It is a collaboration between BBC R&D, the University of Salford, BBC Science, BBC Radio 4 and with the expertise of the BBC’s Voice + AI team.
We consider this a follow up to Professor Pear’s original experiment and its results will be covered in a BBC Radio 4 programme later this year, which will revisit the two tests 90 years apart. To our knowledge, there are no other studies of this scale on the perception of regional accents in relation to synthetic voices. There are only a few published scientific studies into the perception of synthetic voices out there – none on UK accents.
The study will run for eight weeks, during which time participants can listen to a range of audio samples from male and female synthetic voices we created solely for this study, with a range of regional accents from across the UK.
The study will answer some research questions around the voices people prefer in the examples we present to them. We are working closely with Professor Trevor Cox from the Department of Acoustic Engineering at Salford University to ensure the study is academically rigorous.
With this study, we want to explore the following:
- Regional accents
- Tone of voice
- Context of use (the type of content attached to specific voices)
- Perception of synthetic voices (what people think and how it makes them feel)
This experiment is part of broader research we in BBC R&D are conducting on new forms of interactive conversation and voice experiences. The data gathered from the study will be analysed, and the insights will form the basis of the Radio 4 programme that will be aired in summer.
As with all new R&D research, this is meant to start digging into an area to hopefully find a gold nugget - there is no expectation those insights will all lead to drastic changes. It is about sharing insights and making people aware of users' feedback. However, the BBC’s Voice + AI team – who are currently building the BBC’s voice assistant – are taking a keen interest in this study, and will be looking at how the results might inform how the BBC builds its voice services in the future.
Creating the voices - the process
R&D collaborated with BBC staff from all regional radio stations, local news teams, and the technology division to find volunteers with distinctive regional accents willing to record their voice for us. For the purpose and feasibility of the study, 12 regions were chosen, with a male and female option for participants to choose. As a result, we generated 24 synthetic voices.
Participants will have access to pre-recorded audio files of each of those 24 voices as part of the online study - where it will not be possible to modify or synthesise speech in real-time.
We aimed to design a compelling experience that allows participants to interact with the synthetic voices. During the study, users listen to different voices with a variety of accents, including ones that are similar to their own. We ask a series of questions to determine the voice they would prefer in different contexts. For example, would they prefer a voice similar to their own to read the local news? This proved to be an interesting design challenge as the voices that we present to participants need to be randomised throughout the study.
We also designed the study to ensure people with visual impairments can take part and should be completed in 10 to 15 minutes.
For the creation of the synthetic voices, we used an open-source speech-to-text machine learning model - a modified version of DC TTS, which is derived from the paper 'Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention'.
We have used BBC subtitles to create a phonetically balanced text corpus, a specially designed set of phrases covering the majority of phonemes and phoneme combinations in the English language. This acted as a script for the people who recorded their voices for us.
The original audio of each voice took an average of 3 hours for the person to read out a script of 22,915 words. The recording of each person was used as data to train a machine learning model. This is a demanding computational task – it takes around 16 hours to generate a synthesised voice that can then be used to generate new utterances using that voice. There is some post-processing done on the voice recordings to make them sound less metallic/robotic and also to remove some other audio artefacts, as well as EQ.
We wanted to explore ways for creating a diverse set that is faster, cheaper and with a reasonable level of quality. It is a future-facing proof of concept demonstrating the level of quality that can be achieved in a very short amount of time.
We took subtitles in the English language from the BBC archives and automatically transcribed them phonetically using the BBC's pronunciation dictionary. From that, we were able to work out all the common combinations of phonemes – the different sounds made when pronouncing words – that would need to appear in a text intended to be recorded as training for a synthetic voice. We then searched for each combination of phonemes in the subtitles from the BBC archive, identifying a sentence where they appear which we added to a script. In total, that gave us a script of a little over 1000 sentences that cover the most common sounds in the English language proportionally. Each of our contributors needed to read this corpus for us to be able to make their synthetic voice counterpart say anything.
The system was originally trained on a "base" voice with a huge corpus (approximately 24 hours of voice recordings). Then for each voice that we added, we were able to use a smaller audio sample (approximately 3 hours) of the specially designed corpus created by R&D. This means each of our synthetic voices could be trained in a shorter amount of time (but also takes less time to record), achieving a form of transfer learning.
This study is not the end of our work in this field – it builds onto our existing knowledge and expertise, and the technical and UX work is a good foundation for the future. Our work will contribute to the wider literature in this field as there is only a small amount of published work on HCI and regional accents in different countries. Additionally, as previously mentioned, the BBC Voice + AI team will be looking at the results of the study to see how they might inform our voice products and projects in the future.
This post is part of the Internet Research and Future Services section