Google's Bard Fail and the Major Usability Problem With Conversational AIs
Google’s conversational AI “Bard” failed spectacularly in the very first demo by providing false information about the James Webb Space Telescope. A mistake that caused parent company Alphabet’s stock value to drop by more than 7%.
I'ts strange that Google didn't fact check the demo since it’s a well known problem that conversational AIs make up information and confidently present it as facts. What goes wrong exactly is complex, but basically it’s a problem with the algorithms and it’s not easily fixed.
False information is a big issue for conversational AIs like Bard and ChatGPT, both because of the ethic and larger societal consequences, but also because it’s just bad usability to have a system where it's so unclear if the system is stating facts or not.
The very real financial consequences of Google’s Bard error made me curious about the usability of conversational AIs. An analysis of the usability might tell us something about how to avoid spreading false information like Google accidentally did.
I decided to make a structured analysis using Jakob Nielsen’s 10 design heuristics, slightly adapted to my purposes. The heuristics are more than 30 years old, but because they are based on user needs, they're still relevant today, even though the systems are very different from the GUIs of the 1990s. I don’t have access to Bard yet, so my analysis is mostly based on OpenAI’s ChatGPT, which has similar problems.
The analysis provides insights into what good design for conversational AI interaction looks like, both in terms of the general usability and how to handle errors.
Let's start with the first heuristic and go from there:
1. Visibility of system status: The user should always be able to tell what the system is doing.
ChatGPT gets visibility of system status right on the surface, but, like similar AIs, very wrong on a deeper level.
The Good: When you interact with ChatGPT, three animated dots and the gradually appearing text lets users know that the system is generating a response. Because of ChatGPT’s popularity, response times can be quite long, so it’s important for users to know if the AI is working or not.
The bad: Nobody knows how conversational AIs arrive at the answers they generate. We know something about how the AI was trained, but exactly what happens when the algorithm produces an answer is a black box. It’s unknown to users, developers, and to the AI itself. That means that it cannot explain how it got a weird result, if it’s making stuff up, or what its sources are. The consequence is that conversational AIs unknowingly spread misinformation and provide false answers -like Bard.
This is a huge problem for conversational AIs, because they seem equally confident when they are reporting facts and when they are just making stuff up. So confident, apparently, that Bard managed to convince Google. The problem can be somewhat mitigated by design, which we’ll get back to, but it’s a challenge for AI developers to make their algorithms aware of their own process, so misinformation doesn’t happen in the first place.
2. Match between system and the real world: Use language and phrases that are familiar to the user.
Conversational AIs excel at this. Interaction with a conversational AI feels like a natural conversation rather than as interaction with a technology. So much so, that users start expecting the AI to have the same capabilities as another person. That’s problematic because the AI will inevitably disappoint those expectations at some point, and the match between system and the real world will no longer be there.
For a conversational AI, match between system and real world arguably becomes about managing the users’ expectations. ChatGPT tries to manage expectations by constantly reminding the user that it’s a language model and not a person.
ChatGPT also doesn’t have a cute name or fun avatar, so there’s nothing in the interface design that signals that ChatGPT is somehow alive. Google’s conversational AI has been named Bard, which invites more anthropomorphism. I don’t know much about the rest of their design, so it will be interesting to see how they handle the users’ expectations.
3. User control and freedom: Make it easy for users to undo errors and exit interactions
ChatGPT gives users control. It’s clearly shown how to stop the AI’s answer generation, if you can see that what it’s producing isn’t relevant.
While the AI remembers your previous conversation, you can also easily ask it to disregard it, to start from scratch.
4. Consistency and Standards: Users shouldn’t have to learn new standards just for your product
This still seems like a work in progress. It’s easy to remember how to interact with conversational AIs because users use natural language, but there are still methods to writing prompts which work particularly well for generating useable content. ChatGPT has written guidelines for how to write effective prompts, and hopefully there will be a future convention for effective prompts, so it doesn’t differ from AI to AI.
5. Error Prevention: Make it difficult for users to commit errors
Conversational AIs are simple to use because they use natural language, so it's not easy for the average user to commit errors. ChatGPT can handle grammatical and spelling errors pretty well, but I assume the text based interaction can present problems to users who has severe dyslexia or who are analphabetic. In the future, conversational AIs will probably include text-to-speech though.
While it's difficult for user to make errors in the traditional sense, it's another matter if we stretch Nielsen's heuristic to also include the user error of misinterpreting the conversational AI's answers as facts. AIs should make it clear to users that they cannot be trusted, to prevent them from unknowingly spreading misinformation.
6. Recognition rather than recall: Don’t make the user remember stuff, show them what they can do
Conversational AIs differ from more traditional GUIs in how they provide information, most of the users action possibilities are not visible
In terms of recognition rather than recall, it’s positive that ChatGPT provides users with answers in real-time, eliminating the need for users to recall information from previous interactions. There’s no easy way for the user to recognize the best way to prompt the system though, so it might improve the interaction, if the interface provided some visually recognizable shortcuts to create effective prompts.
7. Flexibility and Efficiency of Use: Make the system easy to use for beginners as well as experts
Flexibility and efficiency of use seem to be inbuild in conversational AIs. The system automatically responds to the kinds of prompts it gets. It works for complete beginners and well as for expert prompt writers. You have to be able to read and write to interact with it, but it’s pretty forgiving of language errors. ChatGPT even speaks a multitude of languages and the same is probably true of Bard.
8. Aesthetic and Minimalist Design: Don’t show unnecessary information
Conversational AIs only require prompts, so it’s pretty easy to create aesthetic and minimalist designs. As we’ve seen earlier, ChatGPT doesn’t have any visual reminders of good prompts and finding an elegant way to add those, might create better usability (according to heuristic 6).
9. Help users recognize, diagnose, and recover from errors
It's easy for the user to recognize, diagnose, and recover from errors. The AI can inform the user when it's unable to fulfill a request or if it doesn't understand a question and ask the user to rephrase.
It's obviously a different matter if we stretch user errors beyond their original meaning again, to include the user misunderstanding answers as facts. ChatGPT, and by the looks of it also Bard, make it impossible for the user to recognize and diagnose when it provides misinformation, because the AI sounds equally confident when it’s right and when it’s making stuff up. Hopefully, future conversational AIs will be able to avoid creating misinformation, or at least provide some form of truthful confidence level for its responses, but let’s not assume that will happen anytime soon. Instead, conversational AI designers can try to mitigate the problem by:
Reminding users that they need to fact check information with other sources.
Reminding users that answers might be incorrect.
Changing the wording of answers to make them less confident.
Explaining the importance of having credible scientific sources.
From what I’ve seen, OpenAI actually do what they can to remind the user that the AI sometimes makes up information. A warning of misinformation is the first thing the user sees when they open ChatGPT.
It’s also stated multiple times in their FAQs. ChatGPT itself also from time to time informs users that they should check other sources or that something lacks scientific consensus.
The problem is that it’s unpredictable. Sometimes ChatGPT is very cautious, at other times it confidently proves convincing misinformation without any warnings or hesitations. That makes it almost impossible to avoid making potentially costly errors, no matter how many warnings are inserted into the system.
10. Help and Documentation: Make Sure That the User Can Get Help if Needed
OpenAI has an easily discoverable FAQ, that also really focuses on explaining why users shouldn’t always trust the information provided by the AI. They also have a Discord community where it’s possible to discuss their products.
My conclusion to all of this is that OpenAI gets a lot right in their interaction design of ChatGPT, but as long as the algorithm fails so spectacularly at heuristic 1 - visibility of system status and provides misinformation, it's going to be problematic to use conversational AIs to provide facts. It’s simply not an appropriate tool for that right now. That shouldn’t stop us from using it for all kinds of other things that it’s actually good at, like:
Finally, I asked ChatGPT to write a blog post evaluating itself using Jakob Nielsen’s heuristics. Unsurprisingly, it’s conclusion is more unconditionally positive than mine:
The answer ended abruptly when the system stopped responding because of an overwhelming demand by other users. Hopefully, it didn't provide too many of them with misinformation.