Researchers at Cornell University, University of Washington, University of Waterloo and the Allen Institute for Artificial Intelligence have developed a benchmark to determine the extent to which AI chatbots give confident responses that are not based on training data. This is also known as hallucinating.

Using the benchmark, the researchers tested 15 difference language models (large language models or LLMs), including ChatGPT, Llama and Command R. Even the best models produced texts without hallucinations only in 35% of cases. This proves that the output of LLMs is not very reliable.
For topics about which there was no Wikipedia page, the probability of hallucinating was higher. This is because a lot of models were trained with data from this site. The researchers had deliberately chosen to ensure that 50% of their questions could not be answered with Wikipedia. The results therefore differ from claims made by AI companies.
Subject matter also had an influence: the models gave more correct answers in subjects such as geography and computer science. Questions about celebrities and finance, on the other hand, proved difficult for the LLMs. No model did well on all subjects.
Claude 3 Haiku was found to give the most fact-based answers, but this was mainly because this model indicated not knowing the answer in 28% of cases. If this factor is excluded, it is precisely OpenAI's models that turn out to be the most reliable.
AI chatbots' answers can sound very convincing, but they are not always accurate. Chatbots can make things up themselves or rely on incorrect information. Especially as AI chatbots take on an increasing role in society, it is important to realize how unreliable the tools' answers can be.
Hallucinations of AI could become dangerous if humans take chatbot answers for true. "Policies and regulations should be developed to ensure that human experts are always involved in the process of verifying and validating the information generated by generative AI models," said lead researcher Wenting Zhao.
