Googleâs conversational AI âBardâ failed spectacularly in the very first demo by providing false information about the James Webb Space Telescope. A mistake that caused parent company Alphabetâs stock value to drop by more than 7%.
I'ts strange that Google didn't fact check the demo since itâs a well known problem that conversational AIs make up information and confidently present it as facts. What goes wrong exactly is complex, but basically itâs a problem with the algorithms and itâs not easily fixed.
False information is a big issue for conversational AIs like Bard and ChatGPT, both because of the ethic and larger societal consequences, but also because itâs just bad usability to have a system where it's so unclear if the system is stating facts or not.
The very real financial consequences of Googleâs Bard error made me curious about the usability of conversational AIs. An analysis of the usability might tell us something about how to avoid spreading false information like Google accidentally did.
Usability Analysis
I decided to make a structured analysis using Jakob Nielsenâs 10 design heuristics, slightly adapted to my purposes. The heuristics are more than 30 years old, but because they are based on user needs, they're still relevant today, even though the systems are very different from the GUIs of the 1990s. I donât have access to Bard yet, so my analysis is mostly based on OpenAIâs ChatGPT, which has similar problems.
The analysis provides insights into what good design for conversational AI interaction looks like, both in terms of the general usability and how to handle errors.
Let's start with the first heuristic and go from there:
1. Visibility of system status: The user should always be able to tell what the system is doing.
ChatGPT gets visibility of system status right on the surface, but, like similar AIs, very wrong on a deeper level.
The Good: When you interact with ChatGPT, three animated dots and the gradually appearing text lets users know that the system is generating a response. Because of ChatGPTâs popularity, response times can be quite long, so itâs important for users to know if the AI is working or not.
The bad: Nobody knows how conversational AIs arrive at the answers they generate. We know something about how the AI was trained, but exactly what happens when the algorithm produces an answer is a black box. Itâs unknown to users, developers, and to the AI itself. That means that it cannot explain how it got a weird result, if itâs making stuff up, or what its sources are. The consequence is that conversational AIs unknowingly spread misinformation and provide false answers -like Bard.
This is a huge problem for conversational AIs, because they seem equally confident when they are reporting facts and when they are just making stuff up. So confident, apparently, that Bard managed to convince Google. The problem can be somewhat mitigated by design, which weâll get back to, but itâs a challenge for AI developers to make their algorithms aware of their own process, so misinformation doesnât happen in the first place.
2. Match between system and the real world: Use language and phrases that are familiar to the user.
Conversational AIs excel at this. Interaction with a conversational AI feels like a natural conversation rather than as interaction with a technology. So much so, that users start expecting the AI to have the same capabilities as another person. Thatâs problematic because the AI will inevitably disappoint those expectations at some point, and the match between system and the real world will no longer be there.
For a conversational AI, match between system and real world arguably becomes about managing the usersâ expectations. ChatGPT tries to manage expectations by constantly reminding the user that itâs a language model and not a person.
ChatGPT also doesnât have a cute name or fun avatar, so thereâs nothing in the interface design that signals that ChatGPT is somehow alive. Googleâs conversational AI has been named Bard, which invites more anthropomorphism. I donât know much about the rest of their design, so it will be interesting to see how they handle the usersâ expectations.
3. User control and freedom: Make it easy for users to undo errors and exit interactions
ChatGPT gives users control. Itâs clearly shown how to stop the AIâs answer generation, if you can see that what itâs producing isnât relevant.
While the AI remembers your previous conversation, you can also easily ask it to disregard it, to start from scratch.
4. Consistency and Standards: Users shouldnât have to learn new standards just for your product
This still seems like a work in progress. Itâs easy to remember how to interact with conversational AIs because users use natural language, but there are still methods to writing prompts which work particularly well for generating useable content. ChatGPT has written guidelines for how to write effective prompts, and hopefully there will be a future convention for effective prompts, so it doesnât differ from AI to AI.
5. Error Prevention: Make it difficult for users to commit errors
Conversational AIs are simple to use because they use natural language, so it's not easy for the average user to commit errors. ChatGPT can handle grammatical and spelling errors pretty well, but I assume the text based interaction can present problems to users who has severe dyslexia or who are analphabetic. In the future, conversational AIs will probably include text-to-speech though.
While it's difficult for user to make errors in the traditional sense, it's another matter if we stretch Nielsen's heuristic to also include the user error of misinterpreting the conversational AI's answers as facts. AIs should make it clear to users that they cannot be trusted, to prevent them from unknowingly spreading misinformation.
6. Recognition rather than recall: Donât make the user remember stuff, show them what they can do
Conversational AIs differ from more traditional GUIs in how they provide information, most of the users action possibilities are not visible
In terms of recognition rather than recall, itâs positive that ChatGPT provides users with answers in real-time, eliminating the need for users to recall information from previous interactions. Thereâs no easy way for the user to recognize the best way to prompt the system though, so it might improve the interaction, if the interface provided some visually recognizable shortcuts to create effective prompts.
7. Flexibility and Efficiency of Use: Make the system easy to use for beginners as well as experts
Flexibility and efficiency of use seem to be inbuild in conversational AIs. The system automatically responds to the kinds of prompts it gets. It works for complete beginners and well as for expert prompt writers. You have to be able to read and write to interact with it, but itâs pretty forgiving of language errors. ChatGPT even speaks a multitude of languages and the same is probably true of Bard.
8. Aesthetic and Minimalist Design: Donât show unnecessary information
Conversational AIs only require prompts, so itâs pretty easy to create aesthetic and minimalist designs. As weâve seen earlier, ChatGPT doesnât have any visual reminders of good prompts and finding an elegant way to add those, might create better usability (according to heuristic 6).
9. Help users recognize, diagnose, and recover from errors
It's easy for the user to recognize, diagnose, and recover from errors. The AI can inform the user when it's unable to fulfill a request or if it doesn't understand a question and ask the user to rephrase.
It's obviously a different matter if we stretch user errors beyond their original meaning again, to include the user misunderstanding answers as facts. ChatGPT, and by the looks of it also Bard, make it impossible for the user to recognize and diagnose when it provides misinformation, because the AI sounds equally confident when itâs right and when itâs making stuff up. Hopefully, future conversational AIs will be able to avoid creating misinformation, or at least provide some form of truthful confidence level for its responses, but letâs not assume that will happen anytime soon. Instead, conversational AI designers can try to mitigate the problem by:
Reminding users that they need to fact check information with other sources.
Reminding users that answers might be incorrect.
Changing the wording of answers to make them less confident.
Explaining the importance of having credible scientific sources.
From what Iâve seen, OpenAI actually do what they can to remind the user that the AI sometimes makes up information. A warning of misinformation is the first thing the user sees when they open ChatGPT.
Itâs also stated multiple times in their FAQs. ChatGPT itself also from time to time informs users that they should check other sources or that something lacks scientific consensus.
The problem is that itâs unpredictable. Sometimes ChatGPT is very cautious, at other times it confidently proves convincing misinformation without any warnings or hesitations. That makes it almost impossible to avoid making potentially costly errors, no matter how many warnings are inserted into the system.
10. Help and Documentation: Make Sure That the User Can Get Help if Needed
OpenAI has an easily discoverable FAQ, that also really focuses on explaining why users shouldnât always trust the information provided by the AI. They also have a Discord community where itâs possible to discuss their products.
In Conclusion
My conclusion to all of this is that OpenAI gets a lot right in their interaction design of ChatGPT, but as long as the algorithm fails so spectacularly at heuristic 1 - visibility of system status and provides misinformation, it's going to be problematic to use conversational AIs to provide facts. Itâs simply not an appropriate tool for that right now. That shouldnât stop us from using it for all kinds of other things that itâs actually good at, like:
Finally, I asked ChatGPT to write a blog post evaluating itself using Jakob Nielsenâs heuristics. Unsurprisingly, itâs conclusion is more unconditionally positive than mine:
The answer ended abruptly when the system stopped responding because of an overwhelming demand by other users. Hopefully, it didn't provide too many of them with misinformation.