Posted by Henry Cooke on , last updated
Talking With Machines is a project which is exploring spoken interfaces - things like Amazon Alexa, Google Home or whatever Siri mutates into. Devices whose primary method of interaction is voice, and which pull together data and content for users in a way that makes sense when you’re talking to a disembodied voice.
Alongside some prototyping work, we’re also trying to figure out ways of thinking about experience design and UX for these kind of devices in general. This is an interesting problem – how do you design an interface for a device with no screen? This question becomes especially interesting when you consider that a lot of existing tools and techniques for UX have been developed and refined over years of the screen being the fundamental place where interactions happen with software. Amazon's Alexa developer documentation is very good, but we were interested in higher-level experience design and more general tools for thinking about voice UI (VUI).
We started at the beginning: what are the elements we need to be able to construct stories for voice UI? We identified a few key elements that, when combined, would allow us to build some example scenarios.
Some of these categories are familiar – people, context, devices. The really interesting ones when thinking about VUI are the emotional state of the person and the tone of voice being used by the device. Speech, unlike screen interaction, is a social form of interaction and comes with a lot of expectations about how it should feel. Get it wrong, and you’re into a kind of social uncanny valley. In fact, while we were developing this method, more than once we ran into a ‘creepiness problem’ – some of the scenarios, when worked through, just felt, well, creepy. An attentive, friendly personal assistant, present throughout the day, eventually felt invasive and stifling. A speaking toy for a child, connected to an AI over the internet, swiftly throws up all sorts of possible trickiness. Setting the context and tone for a conversation allows us to think more clearly about how it could unfold.
We filled out the categories with some useful sample elements and set about combining and recombining across categories to create some trial scenarios. This step is intended to be quick and playful - given the building blocks of a scenario, which can you mix and match into something which looks worth developing further?
The next step was to write example scripts for each scenario. This was another quick step; using different coloured postits for the parts of human users and talking devices we constructed a few possible conversations on a board. We’re not copywriters by any stretch, but sticking the steps up allowed us to swiftly block out conversations and move and change things that didn’t look right.
Text to speech testing
Once we’d mapped out our scripts, the next step was to test how they sounded; we needed to listen to the conversations to hear if they flowed naturally. Tom created a tool using Python and the OS X system voices which allowed us to do just this. This tool proved extremely useful, allowing us to quickly iterate versions of the scripts and hear where the holes were - some things which looked fine in text just didn’t work when spoken. Using text-to-speech does end up sounding a bit odd (like two robots having a conversation), but it meant that we didn’t have to get a human or two to read every tweaked iteration of the script and allowed us to listen as observers. It also made us think about the intent of the individual scenarios - is this how an expert guide would phrase this bit of speech? How about a teaching assistant?
Storyboarding & Animatics
The text-to-speech renderings of the scripts (auralisations?) were giving us a good sense of how the conversations themselves would sound, but we realised that in order to communicate the whole scenario in context, we’d need something more. Andrew developed a technique for rapidly storyboarding key scenes from the scenarios which allowed everyone in the team (not just those talented with a pencil) to draw frames of a subject. More on that technique in a later post.
Building on the storyboards and audio, we were able to create lo-fi video demos – animatics – which simply and effectively communicate a whole scenario. Using video enables us to illustrate how a scenario works over time, and allowed us to add extra audio cues which can be part of a VUI – things like sound effects, alerts and earcons.
A nice example of this richer audio is illustrated in our Proms scenario. A user builds a custom programme in relaxed conversation with her device while ambient sounds of the auditorium and orchestra can be heard in the background. The extra audio really helps set the tone of the interaction and makes for a warmer, more rounded experience.
We’re really happy with the prototyping methods we’ve developed – they’ve given us the ability to quickly sketch out possible scenarios for voice-driven applications and flesh out those scenarios in useful ways.
Some of these scenarios could be implemented on the VUI devices available today, some are a bit further into the future and that’s great! We want a set of tools that helps us think about the things we could be building, as well as the things we could build now.
We think we’ve got the start of a rich and useful set of methods for thinking about, designing and prototyping VUIs and we’re keen to develop them further on projects with our colleagues around the BBC. We’d also love to hear from anyone who finds them useful on their own projects!