Posted by Ben Clark on , last updated

Weeknotes from IRFS: Understanding music mixes, new neural nets and a virtual voice

Rob presenting our Content Analysis work.

Synthetic Voices

The BBC's analysis of voices began in 1927 by a Professor Pear to capture public opinion about regional accents used around the UK. With the centenary of this project approaching the Anansi team are working with BBC Radio 4, the BBC Science Unit and Professor Trevor Cox from the University of Salford. 


The team want to find out if there is a relationship between people’s accents and their preferences for different synthetic voices. They also want to discover if preference for different voices varies with content type or assistant. The experiment will ask users to set up a hypothetical smart speaker using a web-page. They will be asked to select a voice from a selection for different types of content like news, entertainment and sport.


For this experiment the team needed 20 different synthetic voices for the users to select from. The team opted to train their own voices using open source software. They tested two synthetic voice packages. Tachotron2 from google uses Recursive Neural Networks which means it takes longer to train, making it expensive to produce each synthetic voice. The option selected by the team, DC-TTS, uses Convolutional Neural Networks, which are quicker to train and thus cheaper.


They trained the model using transfer learning, using an extra three hours of audio to transfer the style of the new voice to a pre-trained voice model. The team needed to create a script of phonetically balanced sentences which were read to create the audio. Existing corpora were not useful so they created their own using the BBC’s subtitle database to assemble a new phonetically balanced corpus. The results are really impressive!

Content Analysis

Matt, Misa and Ben have been trying to improve the speech to text system by training a new type of Kaldi model called nnet3-chain. This model is closer to state of the art than the current system and looks promising so far. They have also been working with Platform Media Services to support the integration of the existing speech to text software into the BBC’s speech to text service. Rob and Mathieu have presented the Content Analysis Toolkit to colleagues at a BBC Academy Fusion talk.


Chris has continued to develop his framework for evaluating recommender systems following meetings with the iPlayer team. He also ran a workshop for the whole of the data team introducing them to recommender systems. He has been working with Taner and Saba, who are investigating using audio, video and text features to improve recommendations. Denise has been trying to define a sentiment model which covers all BBC content to improve recommendations.

Music Mix Classification

The Better Internet team is revisiting the Shape of Mixes recommender and are trying to train models to classify mix brands (for example Radio 1’s Workout Anthems). Some mixes are quite similar in terms of music and so far they are trying to classify groups of mixes instead of individual mixes. Kristine is also trying to model the sequences of tracks played in the mixes.

Memory Making Study

Holly and David have been testing and refining the cultural probes which the team intend to send to 12 participants as part of the memory making study they want to carry out. This is an approach which the team hasn’t used before and they are learning a lot about how these physical objects can be used to understand users.

Finally, the blog post about the News Mood Filter was picked up by le Figaro!