Build Design Systems With Penpot Components
Penpot's new component system for building scalable design systems, emphasizing designer-developer collaboration.

UX Planet — Medium | Kore
When designing a good voice user interface, it is always advantageous to know how the technology works.
Knowing what goes on behind the scenes enables you to make design decisions that take into account the current limitations and advantages of the technology.
One key component of a Voice user interfaces(VUI) is automated speech recognition (ASR) that enables users’ speech to be translated into text. There are a number of free and paid services available that provide ASR engines. When choosing an engine, it is important to keep in mind the following two things:
Again, not all ASR tools will have advanced features like N-best lists, end of speech timeouts or the ability to incorporate custom vocabularies. It might be quicker to start with the cheapest ASR tool, however if the recognition accuracy is sub-standard, or the endpoint detection does not work very well, it is going to frustrate your users to eventually give up on the product.
Barge-In is the ability built into a VUI that allows for the user to interrupt the system anytime during the conversation. The decision to enable this will greatly depend on the type of VUI you are planning. It is advantageous if your VUI is going to say a long list of menu items, or tell a story or generally be verbose. Users might want to interrupt and stop it midway.
When deciding on a barge-in strategy, you need to keep in mind whether you want to enable barge-in with anything that the user says, or only using the wake word. Most common VUIs use the latter strategy of using a wake word. When Alexa is playing a song, the user can barge-in anytime to stop playing. Without barge-in, there would be no way to stop playing by using a voice command.
VUIs need to know when a user starts speaking as well as when the speech has ended. Knowing when the user stops speaking is known as a timeout. Giving an optimum timeout is critical to a good VUI experience. Think of a video call where the voice of the other person lagged and it was difficult to follow the conversation.
There are different conditions at which the ASR engine can decide to timeout:
Incorporating timeouts is essential to know when the user has stopped speaking.
When a user speaks with the VUI, the speech recognition system returns more than one response to what was said. It assigns a confidence value to each result and usually picks the one that has the highest confidence value. In simple terms, a confidence value is a percentage that indicates how confident the system is about a particular result. For example, when you say “Read me a book,” the system can interpret it as follows:
If you’ve designed your VUI to read books then the system would pick up the first result.
A recognition engine often does not return only one result. It returns an N-best list, which is a list of what it thinks the user might’ve said in the order of likelihood. It is usually the top 5–10 results along with a confidence score.
N-best lists are useful in cases where you’ve designed the system to answer in a narrow domain. For example, in a VUI that gives information about animals, when you say “Show me a Badger,” the ASR tool might interpret it as follows:
Since you already know that this VUI is about animal information, it can search for cues for animal names and pick the second result even if it does not have the highest confidence level.
Another use of N-best list is in correcting information in case the first answer is not valid.
Many studies show that Automated speech recognition tools have an accuracy of more than 90%, however this is under ideal conditions. An ideal condition for an ASR tool is an adult male in a quiet room with a good microphone. Rest assured, most real life conditions are not ideal.
Handling noise is one of the most difficult challenges for ASR engines currently. Noise refers to situations like any noisy environment, television in the background, multiple people talking at the same time or side speech when the user talks to another person while the VUI is listening. Some VUIs can detect noise and ask the user to move to a less noisy environment. However, there’s not much you can do about this apart from waiting for the technology to improve.
It is much more difficult for ASR tools to accurately recognize voices of children. Children have higher pitched voices owing to shorter vocal tracts. As yet there is less data to recognize that type of speech. Children often stutter and repeat words which is another challenge for ASR tools. Much like the earlier problem, things are improving.
It is easier for ASR tools to recognize longer phrases like “yes I will” rather than shorter responses like “yes.” Names, spellings and alphanumeric strings are also tough to recognize and it often makes sense to have a GUI input for these. The name ‘Karan’ is often misinterpreted as ‘Karen.’ Asking the VUI to call Karan might accidentally call the wrong person. Again, the technology is improving.
When designing voice interfaces, it is important to make sure that you do not store private data unless absolutely necessary. When you store this data, make sure the user is aware and has a choice in denying access. It might be tempting to store conversations and use this information for improving this experience. Even if the device is constantly listening, any data before the user has said the wake word should not be stored or sent to the cloud. Users expect privacy and breaking trust of your users is the worst thing you can do.
References:
1. Being Digital — Nicholas Negroponte
2. Designing voice user interfaces — Cathy Pearl
3. Design for Voice Interfaces — Laura Klein
If you liked this article, please click the 👏 button (once, twice or more).
Share to help others find it!
Understanding speech recognition to design better Voice interfaces was originally published in UX Planet on Medium, where people are continuing the conversation by highlighting and responding to this story.
AI-driven updates, curated by humans and hand-edited for the Prototypr community