Posted by Tom Nixon on , last updated
The World Service Radio Archive project aims to improve the metadata available for the archive, by combining metadata extracted automatically from the audio, and crowd-sourced metadata generated by users.
In 2012, Yves developed a speaker identification system for the archive, which automatically splits programmes up by voice. The voices in these tracks are then matched together between programmes, so that users can navigate between programmes containing the same person.
The archive contains about 50,000 programmes with audio, but this feature was only available on around 1000 programmes until recently. In December we expanded this to the whole archive, see for example all programmes containing Nelson Mandela, or Brian Cox.
As for the rest of the world service archive experiment, we'd like your help in improving and evaluating our algorithms.
Our Speaker ID System
This was built using diarize-jruby, itself built on top of the LIUM Speaker Diarization toolkit, to split up programmes and train a model for each voice found. Originally, these models were compared pairwise to find speakers with the same voice.
This works well, but is not suitable for processing large numbers of programmes. If 1000 programmes are processed, and 4 speakers are found in each, then in total around 8 million pairs of speaker models have to be compared. While this is possible for small numbers of programmes, processing the entire world service archive of 50,000 programmes would be infeasible using this technique.
To apply speaker identification to the entire world service archive we use locality sensitive hashing (LSH) to search for matching speakers, rather than performing an exhaustive search. This technique is implemented in the ruby-lsh library. When queried, the LSH store returns a reasonably small list of candidate speakers that may be similar -- these are then compared as before to find matching speakers. This results in performing far fewer comparisons, and overall is around 18 times faster.
An additional speed-up is had by distributing the speaker models between several LSH stores, each executing queries in parallel.
As far as we are aware, this is the largest speaker identification system of its kind.
How You Can Help
The Speaker ID system tells us which speakers in different programmes are the same, but not who they actually are; that's where you come in! When a speaker name is added, edited, or confirmed, the name is propagated to matching speakers in other programme, so a small number of edits go a long way. Up to this point, 1287 speaker names have been edited or added, resulting in 24,076 speakers with names.
Here's how the process works:
If you can identify a speaker that doesn't have a name, add it like this.
Click on a speaker name to view other programmes that they may appear in.
Speakers that have been named automatically can be confirmed by clicking the tick, or edited if they are incorrect.
Adding speaker names also helps us to evaluate the performance of the system. To do this, we need to compare automatically generated speaker matches with known names of speakers. We can't tell if a matching pair of speakers is correct without knowing the names of both speakers, so the more speakers you identify, the more accurate the evaluation gets.
We primarily measure performance using precision and recall. Precision is the proportion of matches found that are correct, while recall is the proportion of matches found for speakers that should match. The system is tuned for precision rather than recall -- we value correct matches more than finding all matches, and while the numbers are always changing as we get more data, the system currently achieves 87% precision and 45% recall.
We're really excited by the potential this technology has to make large media archives more accessible. In the future, we'd like to integrate this into COMMA, and investigate using dimensionality reduction to reduce the memory requirements of the index.