Update: Virtual Voice-Over Tool for Multilingual Journalists

Language Technology Producer

In December, BBC News Labs and BBC Digital Development launched a new pilot providing 'virtual voice-over' technology to offer translated video to audiences. Now, as the pilot sees the launch of a second language version, Language Technology Producer Susanne Weber gives an update on the work so far and how the system operates.

The BBC's pioneering online pilot service is driven by the latest language technology: text-to-speech voice synthesis and computer assisted translation. This joint effort between BBC News Labs and BBC Digital Development has produced an innovative tool called “ALTO” which assists multilingual journalists in re-versioning news video content. The pilot service, which offers experimental video clips, is now available on BBC Russian and BBC Japan

Computer assisted translation and voice synthesis

ALTO combines a number of cutting edge language technologies to allow a single language journalist to generate multilingual voice-overs for a video story and script. The script is first pre-translated using machine translation (think Google Translate), the results of which are post-edited by the language journalist. This process is generally referred to as computer-assisted translation. Post-editing is not only necessary for linguistic reasons, but mainly because of the BBC’s editorial requirements which don’t leave any room for unedited, fully automated machine translation.

In the second step, the language journalist converts the translated script into a computer generated voice track. This is done by using off the shelf text-to-speech technology (TTS) provided via cloud services. The TTS voices are generated through unit selection synthesis – also known as concatenative speech synthesis – which makes them sound more ‘natural’. What this means is that each of the TTS voices was once a real voice of a person whose utterances were recorded and then segmented into tiny units (phones, morphemes etc.). When a new track is produced, these segments are joined into new utterances – i.e. synthesised.

Selecting a synthetic voice from the dropdown menu

Occasionally, the synthetic voices mispronounce words – mostly names of people and places (‘proper nouns’). In this case, the language journalist then tweaks the spelling of the word to make the voice pronounce the word correctly; or at least as correctly as possible. ALTO also uses speech synthesis markup language (SSML) to help our journalists insert pauses between words and sentences. This makes the new audio sound more intelligible.

The journalists can choose from at least 2 voices in their language. This allows them to create a voice-over with different gender voices to be as close to the original audio as possible.

In the final step, the new audio is automatically attached to the video file. First, the original audio track is stripped off from the video file. Then the new audio, containing the TTS voice audio files, is stitched to the video, and finally the original audio track is re-attached to the video, but at a lower audio level, i.e. they’re ducked automatically. The stitching process takes less than a minute for a short 30-45 second clip.

We are now in the process of developing a more flexible, dynamic auto-ducking of the original audio track. This is to accommodate the fact that the audio tracks vary in their dynamic range and, ideally, should to be fine-tuned when mixed with different TTS voices.

For more information, you can also watch a video explaining the 'virtual voice-over' news service here


More Posts