Audit of Hallucinations in LLM-based Models and Solutions

Q: Which benchmarks are used to evaluate hallucinations?

Key benchmarks include TruthfulQA, which tests for factual accuracy and confidence; HaluEval, which assesses hallucination detection across tasks; and FactScore, which breaks outputs into atomic facts and verifies each one.

Q: How are hallucinations evaluated in RAG systems?

RAG hallucinations are assessed using metrics like Faithfulness, Context Precision, and Context Recall, measuring whether answers align with retrieved sources and how well retrieval covers required information.

Q: Which metrics measure hallucinations in autonomous agents?

Metrics include Claim Traceability, Reference Accuracy, and Hallucination Rate, all of which evaluate how well the agent supports, cites, and verifies its statements.

Alexander Meshkov
Gen AI QA Director

Recently, the development of LLM has ushered in a new era in the field of artificial intelligence, providing opportunities to create solutions that can generate coherent and contextually relevant text, answer questions, and solve complex problems. However, along with these capabilities, these models also introduced a critically important problem: hallucinations. These are the generation of seemingly plausible, yet factually incorrect, fictitious information.

Simply put, a hallucination is a situation in which a model confidently asserts non-existent facts, invents quotes, or links to non-existent sources. This not only undermines trust in the system but can also lead to serious consequences for the brand.

A significant example of the severity of this problem was an incident in November 2025 when Google was forced to remove its Gemma model from AI Studio after it generated completely fictitious allegations against US Senator Marsha Blackburn, backed by non-existent links to news articles. The senator described this not as a harmless hallucination, but as a slander created and propagated by an AI model owned by Google. That is why the development of methods for auditing and evaluating hallucinations remains a priority area of research in the field of artificial intelligence reliability.

In addition, it is essential to say that the problem of hallucinations can affect not only the LLM models themselves, but also complex systems such as RAG (Retrieval-Augmented Generation) and autonomous AI agents. Each type of system brings its own nuances to the problem of hallucinations and requires specific approaches to their detection and measurement. In this article, I will explore a comprehensive approach to hallucination auditing, encompassing various types of systems and evaluation methodologies.

Before moving on to measurement methods, it is important to understand the nature of hallucinations in LLMs. Unlike human hallucinations, which are perceptual distortions, LLM hallucinations arise from fundamental features of their architecture and learning. Models think like humans; they are trained to predict the most likely continuation of a text based on patterns and instructions received during training. This sometimes leads to the generation of information that seems plausible in terms of language patterns but is not true. Most often, the problem with this is that the model, unable to find available information in the training data, attempts to fill this gap on its own, simply generating characters and words based on the most likely next word.

In the case of the Gemma model and Senator Blackburn, it is likely that when asked about possible allegations against a politician, the model found no real information. Then, instead of admitting to itself that there was no data, it synthesized a typical narrative of a political scandal from patterns found in training data about other public figures. Then it created a detailed but entirely fictitious story with specific dates, circumstances, and even bogus references to sources.

Benchmarks for evaluating hallucinations

One approach to evaluating hallucinations in models is to use benchmarks. These comprise a large pool of questions that test the model for the possibility of hallucinations or confirm that the model lacks the necessary data to answer. One of the most notable is TruthfulQA, which contains questions specifically designed to challenge the model in generating common misconceptions or false information. This benchmark evaluates not only the model’s actual accuracy, but also its ability to withhold a response when it is unsure of the information.

HaluEval is a more comprehensive approach that includes hallucination detection tasks in a variety of contexts, from text summarization to dialogue systems. The peculiarity of this benchmark is that it not only measures the frequency of hallucinations but also evaluates the ability of models to independently identify hallucinations in generated text. This then opens the door to creating self-testing and self-analyzing systems.

Another solution for identifying hallucinations in models is FactScore, which focuses on the detailed evaluation of factual accuracy in long texts, especially biographies. The methodology behind this approach involves breaking down the generated response from an AI decision into atomic facts and testing each of them against reliable sources. This approach enables a granular evaluation of accuracy, helping you understand which types of facts the model tends to distort most frequently.

For example, when analyzing the biography of a famous person, the system can identify dozens of individual statements from date of birth and place of study to career achievements and citations, and assign each a binary credibility rating. The result is not just an overall percentage of accuracy, but also a detailed map of the model’s reliability in various categories of information, which allows developers to target system weaknesses.

Metrics for RAG systems and AI agents

But even if the LLM model is tested on benchmarks and shows good results, this does not guarantee that when using the same LLM but in RAG or AI agents, hallucinations will completely disappear. The architectures of modern AI solutions also introduce a specific risk in the occurrence of hallucinations, but within the context of the technology used. For example, in RAG, hallucinations can manifest themselves in several forms. The model can overlook relevant information from the context, misinterpret it, or introduce information that is not present in the sources.

One key metric for evaluating hallucinations is the Faithfulness metric, which measures the extent to which the generated response is based on the provided context. At our company, we utilize our proprietary EvalTool solution and an algorithm to measure this metric. We calculate it by analyzing each statement in the response and determining whether that statement is supported by information from the extracted context. A high Faithfulness score indicates that the system is generating responses that are closely related to the sources, which reduces the risk of hallucinations. You can find out more about this metric in our open source library.

To measure hallucinations, it is also important to evaluate metrics such as Context Precision and Context Recall, which are paired metrics that evaluate the quality of information retrieval. Context Precision measures the proportion of relevant information in the retrieved context, while Context Recall evaluates the completeness of coverage of the required information. An imbalance between these metrics can indicate potential sources of hallucinations in the system that the development team needs to address.

Autonomous AI agents present a more complex case for evaluating hallucinations than RAG systems, as they not only generate text but also make decisions and take actions based on their understanding of the situation.

These can often be referred to as intermediate hallucinations, where the AI agent generates information as it runs, suggesting data that does not exist, or creating artificial connections between pieces of data it finds. Hallucinations of search results may also occur, where the AI agent believes it has found information that was not actually obtained from the sources. This may arise from confusion between the actual content found and the model’s pretrained knowledge, or from “filling in” expected results in situations where the actual search did not provide the required information. For example, an AI agent might claim to have found accurate financial figures in a document, even though the document only contains general discussions on the topic.

Another challenge with agent AI is its risk of generating data that is plausible but deviates from its actual context. An AI agent may correctly identify a technical document, but then summarize it with simplifications or generalizations that introduce inaccuracies not present in the original source. Hallucination of connections involves the AI agent making up connections between found pieces of information that do not exist in reality. For example, an agent might state that “Document A’s results are validated by Document B,” while Document B does not reference or cross-check Document A’s work.

Searching and detecting hallucinations in an AI agent is much more complicated. Additionally, it is not always possible to do this in a fully automatic mode, which requires the AI engineer to manually analyze traces for each request prepared within the dataset to evaluate hallucinations.

In terms of quality metrics, we use the following to evaluate hallucinations:

Claim Traceability is the ratio of claims with traceable sources to the total number of actual claims. This metric indicates how effectively the system supports its claims with specific sources, which is crucial for ensuring the transparency and verifiability of agent performance.
Reference Accuracy is calculated as the ratio of correct links to the total number of referenced sources. A high value for this metric indicates that the agent not only cites sources but also does so correctly, without distorting the context or attributing information to sources that does not exist.
Hallucination Rate is defined as the proportion of unsubstantiated statements out of the total number of statements in the system. This integral metric provides an overall understanding of the agent’s reliability and helps track progress in reducing hallucinations as the system iteratively improves.

In addition to direct metrics for evaluating hallucinations, a critical aspect of the reliability of AI agents is their robustness. Namely, the ability to provide similar answers to different formulations of the same request. The unstable behavior of an AI agent, with minimal changes in wording, often correlates with an increased tendency to hallucinate, as it indicates a lack of deep understanding of the context. To identify such problems, we employ metamorphic testing, which enables us to detect hallucinations by analyzing the consistency of responses to various query variations.

Hallucination detection methods

At First Line Software, dealing with hallucinations in our solutions is a critical process. Our integrated approach to detecting and preventing hallucinations involves evaluating metrics using our own AI system evaluation tool and an open Python library with evaluation algorithms, as well as the process of preparing for the evaluation itself.

Preparing an evaluation requires in-depth knowledge and understanding of the principles for evaluating AI solutions and LLMs, which our specialists possess. This includes extensive analytical work to determine the various ways and conditions of AI system operation, selecting the most relevant approaches for evaluation, and creating datasets that focus on detecting hallucinations in various situations and environments.

Our work is not always fully automated and depends on the solution’s complexity. Our expert judgment and analysis of the AI system’s actions remain an important part of our evaluation process, especially for identifying complex and context-sensitive hallucinations. Therefore, we try to combine automatic methods for initial evaluation using our Eval Tool with human verification of the most critical or questionable cases.

From a hallucination mitigation perspective, it is important to remember that mechanisms such as AI-built data and context verification solutions are an essential part of an AI system or LLM for greater protection against sudden hallucinations. Additionally, a more effective approach to data structuring, along with the introduction of explicit reasoning mechanisms, can help models better track the sources of information and avoid unsubstantiated claims.

It is worth noting that even prompt engineering remains an essential tool for reducing hallucinations at the application level. Techniques such as chain-of-thought prompting, few-shot learning with examples of correct behavior in the face of uncertainty, and explicit instructions to indicate missing information can significantly improve reliability.

What’s Next?

The problem of hallucinations in large language models poses a fundamental challenge to creating reliable and trustworthy AI systems. As previously discussed, modern methods for auditing and evaluating hallucinations encompass a range of approaches, including standardized benchmarks and specialized metrics tailored to different types of systems, as well as automated detection methods.

It is essential to understand that there is no universal solution to the problem of hallucinations. At First Line Software, we take a targeted approach to analyzing and eliminating this problem. This approach depends on the type of AI system, whether it is an isolated language model, a RAG system, or an autonomous agent. Moreover, the criticality of hallucinations can be highly dependent on the field of application. For example, what may be acceptable in creative systems is absolutely unacceptable in medical diagnosis or legal advice.

Ultimately, hallucination auditing should always be an integral part of the development and operational lifecycle of any LLM-based system. Just as security testing has become standard practice in software development, systematic evaluation and mitigation of hallucinations should become a mandatory element in the development of artificial intelligence systems. On our team, this practice is mandatory in our projects.

If you would like to learn more about how we prevent hallucinations from occurring in our company, you can schedule a meeting with us to discuss our experience and help you build a reliable AI solution that is not susceptible to hallucinations.

Book an AI Governance Assessment

FAQ

What is a hallucination in large language models (LLMs)?

A hallucination occurs when an LLM confidently generates information that is factually incorrect, fabricated, or unsupported by real data. This may include invented facts, quotes, or non-existent sources.

Why are hallucinations dangerous?

They erode user trust and can cause reputational or legal harm. Google removed its Gemma model after it was accused of making defamatory allegations about U.S. Senator Marsha Blackburn in 2025.

Why do LLMs hallucinate?

LLMs predict the most likely next words based on patterns from training data. When no accurate information exists, the model may “fill in gaps” with plausible-sounding but incorrect details

Which benchmarks are used to evaluate hallucinations?

Key benchmarks include TruthfulQA, which tests for factual accuracy and confidence; HaluEval, which assesses hallucination detection across tasks; and FactScore, which breaks outputs into atomic facts and verifies each one.

How are hallucinations evaluated in RAG systems?

RAG hallucinations are assessed using metrics like Faithfulness, Context Precision, and Context Recall, measuring whether answers align with retrieved sources and how well retrieval covers required information.

What types of hallucinations occur in AI agents?

These include intermediate hallucinations, hallucinated search results, simplified or incorrect contextual summaries, and fabricated connections between documents. Agents may assert finding data that does not exist or infer relationships that are not present in the sources.

Which metrics measure hallucinations in autonomous agents?

Metrics include Claim Traceability, Reference Accuracy, and Hallucination Rate, all of which evaluate how well the agent supports, cites, and verifies its statements.

What is metamorphic testing, and why is it used?

Metamorphic testing checks whether an agent produces consistent answers across variations of the same question. Instability often correlates with hallucination risk.

How does First Line Software detect hallucinations in AI systems?

FLS uses a hybrid approach combining automated evaluation via EvalTool and our open Python library, plus expert manual review for complex cases requiring contextual understanding.

What methods help reduce hallucinations?

Effective approaches include data and context verification mechanisms, structured data preparation, explicit reasoning processes, and prompt engineering techniques like chain-of-thought, few-shot examples, and instructions for expressing uncertainty.

November 2025