Introduction
Large language models (LLMs) are useful for many applications, including question answering, translation, summarization, and much more, with recent advancements in the area having increased their potential. As you are undoubtedly aware, there are times when LLMs provide factually incorrect answers, especially when the response desired for a given input prompt is not represented within the model’s training data. This leads to what we call hallucinations.
To mitigate the hallucination problem, retrieval augmented generation (RAG) was developed. This technique retrieves data from a knowledge base which could help satisfy a user prompt’s instructions. While a powerful technique, hallucinations can still manifest with RAG. This is why detecting hallucinations and formulating a plan to alert the user or otherwise deal with them in RAG systems is of the utmost importance.
As the foremost point of importance with contemporary LLM systems is the ability to trust their responses, the focus on detecting and handling hallucinations has become more important than ever.
In a nutshell, RAG works by retrieving information from a knowledge base using various types of search such as sparse or dense retrieval techniques. The most relevant results will then be passed into LLM alongside the user prompt in order to generate the desired output. However, hallucination can still occur in the output for numerous reasons, including:
- LLMs acquire accurate information, but they fail to generate correct responses. It often happens if the output requires reasoning within the given information.
- The retrieved information is incorrect or does not contain relevant information. In this case, LLM might try to answer questions and hallucinate.
As we are focusing on hallucinations in our discussion, we will focus on trying to detect the generated responses from RAG systems, as opposed to trying to fix the retrieval aspects. In this article, we will explore hallucination detection techniques that we can use to help build better RAG systems.
Hallucination Metrics
The first thing we will try is to use the hallucination metrics from the DeepEval library. Hallucination metrics are a simple approach to determining whether the model generates factual, correct information using a comparison method. It’s calculated by adding the number of context contradictions to the total number of contexts.
Let’s try it out with code examples. First, we need to install the DeepEval library.
The evaluation will be based on the LLM that evaluates the result. This means we will need the model as an evaluator. For this example, we will use the OpenAI model that is set by default from DeepEval. You can check the following documentation to switch to another LLM. As such, you will need to make available your OpenAI API key.
import os os.environ[“OPENAI_API_KEY”] = “YOUR-API-KEY” |
With the library installed, we will try to detect the hallucination that is present in the LLM output. First, let’s set up the context or the fact that should be present from the input. We will also create the actual output from the model in order to dictate what it is we are testing.
from deepeval import evaluate from deepeval.metrics import HallucinationMetric from deepeval.test_case import LLMTestCase
context = [ “The Great Wall of China is a series of fortifications made of stone, brick, tamped earth, wood, and other materials, “ “generally built along an east-to-west line across the historical northern borders of China to protect the Chinese states “ “and empires against the raids and invasions of the nomadic groups of the Eurasian Steppe.” ]
actual_output = (“The Great Wall of China is made entirely of gold and was built in a single year by the Ming Dynasty to store treasures.”) |
Next, we will set up the test case and set up the Hallucination Metrics. The threshold is something you want to set to tolerate how high the hallucination is allowed to be. If you want a strict no hallucination, then you can set it to zero.
test_case = LLMTestCase( input=“What is the Great Wall of China made of and why was it built?”, actual_output=actual_output, context=context )
halu_metric = HallucinationMetric(threshold=0.5) |
Let’s run the test and see the result.
halu_metric.measure(test_case) print(“Hallucination Metric:”) print(” Score: “, halu_metric.score) print(” Reason: “, halu_metric.reason) Output>> Hallucination Metric: Score: 1.0 Reason: The score is 1.00 because the actual output contains significant contradictions with the context, such as incorrect claims about the materials and purpose of the Great Wall of China, indicating a high level of hallucination. |
The hallucination metrics show a score of 1, which means the output is completely hallucinating. DeepEval also provides the reasons.
G-Eval
G-Eval is a framework that uses LLM with chain-of-thoughts (CoT) methods to automatically evaluate the LLM output based on a multi-step criteria we decide upon. We will then use DeepEval’s G-Eval framework and our criteria to test the RAG’s ability to generate output and determine whether they are hallucinating.
With G-Eval, we will need to set up the metrics ourselves based on our criteria and the evaluation steps. Here is how we set up the framework.
from deepeval.metrics import GEval from deepeval.test_case import LLMTestCaseParams
correctness_metric = GEval( name=“Correctness”, criteria=“Determine whether the actual output is factually accurate, logically consistent, and sufficiently detailed based on the expected output.”, evaluation_steps=[ “Check if the ‘actual output’ aligns with the facts in ‘expected output’ without any contradictions.”, “Identify whether the ‘actual output’ introduces new, unsupported facts or logical inconsistencies.”, “Evaluate whether the ‘actual output’ omits critical details needed to fully answer the question.”, “Ensure that the response avoids vague or ambiguous language unless explicitly required by the question.” ], evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT], ) |
Next, we will set up the test cases to simulate the RAG process. We will set up the user input, both the generated output and the expected output, and lastly, the retrieval context, which is the information pulled up by RAG.
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase( input=“When did the Apollo 11 mission land on the moon?”, actual_output=“Apollo 11 landed on the moon on July 21, 1969, marking humanity’s first successful moon landing.”, expected_output=“Apollo 11 landed on the moon on July 20, 1969, marking humanity’s first successful moon landing.”, retrieval_context=[ “”“The Apollo 11 mission achieved the first successful moon landing on July 20, 1969. Astronauts Neil Armstrong and Buzz Aldrin spent 21 hours on the lunar surface, while Michael Collins orbited above in the command module.”“” ] ) |
Now, let’s use the G-Eval framework we have set up previously.
correctness_metric.measure(test_case)
print(“Score:”, correctness_metric.score) print(“Reason:”, correctness_metric.reason) |
Output:
Score: 0.7242769207695651 Reason: The actual output provides the correct description but has an incorrect date, contradicting the expected output |
With the G-Eval framework we set, we can see that it can detect hallucinations that come from the RAG. The documentation provides further explanation about how the score is calculated.
Faithfulness Metric
If you want more quantified metrics, we can try out the RAG-specific metrics to test whether or not the retrieval process is good. The metrics also include a metric to detect hallucination called faithfulness.
There are five RAG-specific metrics available in DeepEval to use, which are:
- Contextual precision to evaluate the reranker
- Contextual recall to evaluate the embedding model to capture and retrieve relevant information accurately
- Contextual relevancy evaluates the text chunk size and the top-K
- Contextual answer relevancy evaluates whether the prompt is able to instruct the LLM to generate a relevant answer
- Faithfulness evaluates whether the LLM generates output that does not hallucinate and contradict any information in the retrieval
These metrics differ from the hallucination metric previously discussed, as these metrics focus on the RAG process and output. Let’s try these out with our test case from the above example to see how they perform.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
from deepeval.metrics import ( ContextualPrecisionMetric, ContextualRecallMetric, ContextualRelevancyMetric, AnswerRelevancyMetric, FaithfulnessMetric )
contextual_precision = ContextualPrecisionMetric() contextual_recall = ContextualRecallMetric() contextual_relevancy = ContextualRelevancyMetric() answer_relevancy = AnswerRelevancyMetric() faithfulness = FaithfulnessMetric()
contextual_precision.measure(test_case) print(“Contextual Precision:”) print(” Score: “, contextual_precision.score) print(” Reason: “, contextual_precision.reason)
contextual_recall.measure(test_case) print(“\nContextual Recall:”) print(” Score: “, contextual_recall.score) print(” Reason: “, contextual_recall.reason)
contextual_relevancy.measure(test_case) print(“\nContextual Relevancy:”) print(” Score: “, contextual_relevancy.score) print(” Reason: “, contextual_relevancy.reason)
answer_relevancy.measure(test_case) print(“\nAnswer Relevancy:”) print(” Score: “, answer_relevancy.score) print(” Reason: “, answer_relevancy.reason)
faithfulness.measure(test_case) print(“\nFaithfulness:”) print(” Score: “, faithfulness.score) print(” Reason: “, faithfulness.reason) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Contextual Precision: Score: 1.0 Reason: The score is 1.00 because the node in the retrieval context perfectly matches the input with accurate and relevant information. Great job maintaining relevance and precision!
Contextual Recall: Score: 1.0 Reason: The score is 1.00 because every detail in the expected output is perfectly supported by the nodes in retrieval context. Great job!
Contextual Relevancy: Score: 0.5 Reason: The score is 0.50 because while the retrieval context contains the relevant date ‘July 20, 1969’ for when the Apollo 11 mission landed on the moon, other details about the astronauts‘ activities are not directly related to the date of the landing.
Answer Relevancy: Score: 1.0 Reason: The score is 1.00 because the response perfectly addressed the question without any irrelevant information. Great job!
Faithfulness: Score: 0.5 Reason: The score is 0.50 because the actual output incorrectly states that Apollo 11 landed on the moon on July 21, 1969, while the retrieval context correctly specifies the date as July 20, 1969. |
The result shows that the RAG is performing well except for the contextual relevancy and faithfulness metrics. These metrics are able to detect the hallucinations that occur from the RAG system using the faithfulness metric along with the reasoning.
Summary
This article has explored different techniques for detecting hallucinations in RAG systems, focusing on three main approaches:
- hallucination metrics using the DeepEval library
- G-Eval framework with chain-of-thoughts methods
- RAG-specific metrics including faithfulness evaluation
We have looked at some practical code examples for implementing each technique, demonstrating how they can measure and quantify hallucinations in LLM outputs, with a particular emphasis on comparing generated responses against known context or expected outputs.
Best of luck with your RAG system optimizing, and I hope that this has helped.