Research & Development

Posted by Henry Cooke on , last updated

In BBC R&D, we’ve been running a project called Talking with Machines which aims to understand how to design and build software and experiences for voice-driven devices - things like Amazon Alexa, Google Home and so on. The project has two main strands: a practical strand, which builds working software in order to understand the platforms, and a design research strand, which aims to devise a user experience language, set of design patterns and general approach to creating voice interfaces (VUI), independent of any particular platform or device.

This is the second of two posts about our design research work. In the previous post, we discussed a prototyping methodology we've been developing for VUI. In this one, we'll outline some of our findings from doing the work - key considerations we've found useful to bear in mind while designing VUI prototypes.

Above: the team in a prototyping role-play session.

BBC Taster - Try The Inspection Chamber

BBC R&D - User Testing The Inspection Chamber

BBC R&D - The Unfortunates: Interacting with an Audio Story for Smart Speakers

Tone of voice

When your only method of communication with your users is through voice, then the tone of voice your application uses is super-important - as important as the choices made about colour, typeface and layout in a visual application.

Think about the vocabulary and writing style used in your voice prompts and responses. Is your application mostly concerned with responding to direct requests for specific information? Then you probably want to be pithy and concise with your responses. Is your application designed for people in a more relaxed, open frame of mind? Then you can afford to be more discursive and chatty.

If you’re using recorded talent (although this could apply to synthesized voices, to a lesser degree), think about timbre, intonation and delivery style - everything you’d think about when recruiting a voiceover artist.

Use of system synthesized voice vs. recorded talent voice

Let’s be honest - while the system voices on Alexa or Home are good, they’re optimised for short, pithy answers to requests. They’re not really suitable for reading out long chunks of text, especially if that reading requires natural intonation.

Using recorded human voice allows for much more natural speech, and allows for a reading with a particular tone: excitable / serious / etc. The downside is that you add a production overhead to your app, and once the voice is recorded, it can’t be changed. A voice app using recorded talent speech will never be as adaptable as one which generates speech on the fly.

It’s all about the data

This one applies more to data-driven applications than narrative experiences. One thing we’ve found when designing conversational user interfaces (CUIs) - text or speech - is that while it seems intuitive to be able to ask questions against a large dataset (say: the news, or a list of programmes), these types of application can only be built if there’s an extensive, well-tagged and searchable data source to query. In these cases, the actual interface itself, and parsing the user’s intent is a relatively straightforward problem to solve - the really hard problems arise from collating, sifting and re-presenting the data required to answer the user’s questions. These kinds of application are a lot more about juggling data than they are about natural language.

The expectation of ‘smart’

Towards the end of 2016, we did some prototyping and user testing around CUI - mostly in text messaging channels, but with some VUI. One of the most striking things we found from the testing was the expectation that users had about the intelligence of the entity they were talking to. Essentially, since people were communicating with something that appeared to be smart enough to respond to natural language, and pretended to have a personality, they assumed it was also smart enough to be able to answer the kinds of question they’d ask another person.

This is an important thing to bear in mind when designing systems with which someone will converse, because it’s very rare that you will be able to deal with spontaneous, open speech. Most applications will have a limited domain of knowledge; a story about witches, say, or the programme catalog of a large broadcaster. How are you going to communicate to people the limits of your system without driving them away?

Letting the user know what they can say / dealing with a limited vocabulary

Most VUI systems don’t allow completely free, spontaneous speech as input; as a developer, you have to register upfront the collection of phrases you expect a user to say in order to interact with your application, and keep it updated as unexpected variations creep in.

Given that this limitation exists, you have the problem of communicating to people what they can say to navigate your application. Some developers choose to do this upfront, listing possible commands when an application starts for the first time. However, this can sound clunky, provides a speed bump for people wanting to get started with the meat of your application and requires them to remember what you told them at the point where an interaction becomes available.

Another way to do this is to wait until an interaction is about to happen, and then tell a person what they can say: “you can say forward, backward or stop.” However, this can seem a little mechanical, and interrupts the flow of a longer conversation or fictional piece.

Things to try

  • In a fictional piece, you could set up a choice as an argument between two characters
  • You could use a set of choices that is naturally limited, e.g numbers from 1-10, star signs.

Modes of address / managing the user, narrator and other characters

When a person interacts with a voice application, they’re always interacting with at least one voice. For simple applications, one voice is often enough - although Google Home’s model of handing off to other voices for different functions - “OK Google, talk to Uber about a ride” is interesting, and helps someone understand when they’re shifting contexts. For more complex, narrative-driven applications, it’s likely there will be more than one character talking over the course of the experience. In these applications, managing how the characters talk to one another and how the user is addressed becomes a challenge with some subtleties.

In this case, there’s a few questions you need to ask yourself:

  • is the user present in the piece, or an unnoticed observer?
  • is the user directly participating in the narrative with their voice, or distanced from the narrative and using their voice to navigate at a level of remove (the difference between speaking to characters directly, or using voice to choose branches in a storyline)?
  • Can all the characters in the piece address the user, or just one? Using a narrator / mediator to communicate with the user can simplify things, but it’s still important to consider how the user will know when a character is addressing them directly and when characters are talking between themselves (the ‘turning to the user’ problem)

Turn-taking / Letting the user know when they can speak

The ‘ding’

This seems like the most straightforward way to let someone know they can speak - “after the ding, say your choice”. However, there’s subtlety here: do you say “ding” or play the sound itself when referring to it? Is this confusing to the user? They have to understand the difference between a referential ding and a real one. If you say the word “ding”, do people understand that this means a “ding” sound when it’s played?

Audio mix

A more subtle way of letting the user know that they can speak in fictional, radio-like pieces is by using the audio mix. If you’re using music or sound beds for the action, you can drop these out at the time a character or narrator is addressing the user, signifying that our focus has moved away from the fiction and the user is alone with the narrator. Closer mic placement for a recorded narrator voice can also indicate closeness to the listener / user.

System voice

While we’ve identified some problems with using the system voice on VUI devices, it can be useful to let the user know they’re being asked a direct question, since that’s the voice they’re used to interacting with on a device. If you’re making a piece that includes many voices, consider using the system voice as a ‘bridge’ or ‘mediator’ between the user and the fiction world.

Maximum talk time

At the time of writing, you’re limited to 90 seconds of audio playback on Alexa (120 seconds on Google Home) between each user request. This means you cannot write a large chunk of dialogue to be played back as an audio file without having the user respond regularly. This is a constraint to bear in mind when designing your dialogue - how can you cause an interaction every minute or two without making it seem forced?

Thanks for reading!

These considerations are all things we've run across while doing our design research work on Talking with Machines that come up over and over again and have always proved useful to bear in mind while thinking about VUI design. We hope they're useful to you as you do your own thinking about VUI design - we've also developed a prototyping methodology for voice which contains some practical VUI prototyping tips.

Thanks to the whole Talking with Machines project team for their work on the design research which led to these posts: Andrew Wood, Joanne Moore, Anthony Onumonu and Tom Howe. Thanks also to our colleagues in BBC Children’s who worked with us on a live prototyping project: Lisa Vigar, Liz Leakey, Suzanne Moore and Mark O'Hanlon.

Tweet This - Share on Facebook

BBC R&D - Talking with Machines

BBC R&D - The Unfortunates: Interacting with an Audio Story for Smart Speakers

BBC R&D - Singing with Machines

BBC R&D - The Mermaid's Tears

BBC R&D - Better Radio Experiences

BBC R&D - Responsive Radio

BBC Taster - Try The Inspection Chamber