FaithJudge benchmarks hallucinations generated by LLMs in Retrieval-Augmented Generation (RAG) tasks, focusing on reliability and factual accuracy.
We evaluate LLMs across three key RAG tasks:
- Summarization
- Question Answering
- Data-to-Text Generation
This benchmark helps assess how frequently LLMs introduce hallucinations when summarizing documents, answering questions using context, and generating detailed overviews from structured data in the JSON format.
Here's our current leaderboard based on hallucination rates across all evaluated tasks.
Rank | Model | Organization | # Parameters | Overall Hallucination Rate | Faithbench (Summarization) | RagTruth (Summarization) | RagTruth (Question-Answering) | RagTruth (Data-to-Text Writing) |
---|---|---|---|---|---|---|---|---|
1 | gemini-2.5-flash | ? | 6.26% | 19.44% (14/72) | 4.67% (7/150) | 2.88% (4/139) | 4.67% (7/150) | |
2 | gemini-2.5-pro | ? | 6.65% | 25.00% (18/72) | 4.67% (7/150) | 1.44% (2/139) | 4.67% (7/150) | |
3 | gemini-2.5-pro-exp-03-25 | ? | 7.63% | 25.00% (18/72) | 9.33% (14/150) | 0.72% (1/139) | 4.00% (6/150) | |
4 | r1-0528 | DeepSeek | 37B active / 671B total | 9.78% | 20.83% (15/72) | 8.00% (12/150) | 4.32% (6/139) | 11.33% (17/150) |
5 | gemini-2.0-flash-001 | ? | 10.18% | 29.17% (21/72) | 6.67% (10/150) | 0.72% (1/139) | 13.33% (20/150) | |
6 | o3-mini-medium-2025-01-31 | OpenAI | ? | 11.55% | 33.33% (24/72) | 9.33% (14/150) | 6.47% (9/139) | 8.00% (12/150) |
7 | gpt-4.1-2025-04-14 | OpenAI | ? | 11.94% | 36.11% (26/72) | 11.33% (17/150) | 4.32% (6/139) | 8.00% (12/150) |
8 | gpt-4.5-preview-2025-02-27 | OpenAI | ? | 11.94% | 37.50% (27/72) | 10.00% (15/150) | 5.04% (7/139) | 8.00% (12/150) |
9 | o3-mini-low-2025-01-31 | OpenAI | ? | 11.94% | 33.33% (24/72) | 6.67% (10/150) | 7.19% (10/139) | 11.33% (17/150) |
10 | o3-mini-high-2025-01-31 | OpenAI | ? | 12.52% | 34.72% (25/72) | 8.00% (12/150) | 6.47% (9/139) | 12.00% (18/150) |
11 | claude-opus-4-thinking-20250514 | Anthropic | ? | 14.09% | 38.89% (28/72) | 11.33% (17/150) | 5.04% (7/139) | 13.33% (20/150) |
12 | gpt-3.5-turbo-0125 | OpenAI | ? | 14.87% | 44.44% (32/72) | 8.67% (13/150) | 5.76% (8/139) | 15.33% (23/150) |
13 | grok-3 | xai | ? | 15.26% | 41.67% (30/72) | 12.00% (18/150) | 6.47% (9/139) | 14.00% (21/150) |
14 | gpt-4.1-mini-2025-04-14 | OpenAI | ? | 15.66% | 37.50% (27/72) | 10.67% (16/150) | 3.60% (5/139) | 21.33% (32/150) |
15 | gpt-4o-2024-11-20 | OpenAI | ? | 15.85% | 40.28% (29/72) | 10.00% (15/150) | 5.04% (7/139) | 20.00% (30/150) |
16 | claude-3-7-sonnet-20250219 | Anthropic | ? | 16.05% | 38.89% (28/72) | 14.67% (22/150) | 9.35% (13/139) | 12.67% (19/150) |
17 | claude-3-7-sonnet-thinking-20250219 | Anthropic | ? | 16.24% | 45.83% (33/72) | 10.67% (16/150) | 9.35% (13/139) | 14.00% (21/150) |
18 | Llama-3.3-70B-Instruct | Llama | 70B | 16.44% | 44.44% (32/72) | 8.67% (13/150) | 4.32% (6/139) | 22.00% (33/150) |
19 | phi-4 | Microsoft | 14B | 17.03% | 44.44% (32/72) | 8.00% (12/150) | 4.32% (6/139) | 24.67% (37/150) |
20 | Mistral-Small-24B-Instruct-2501 | Mistral AI | 24B | 17.03% | 43.06% (31/72) | 10.00% (15/150) | 10.07% (14/139) | 18.00% (27/150) |
21 | o3-medium-2025-04-16 | OpenAI | ? | 17.81% | 36.11% (26/72) | 19.33% (29/150) | 9.35% (13/139) | 15.33% (23/150) |
22 | claude-sonnet-4-thinking-20250514 | Anthropic | ? | 18.20% | 43.06% (31/72) | 15.33% (23/150) | 7.19% (10/139) | 19.33% (29/150) |
23 | gpt-4o-mini-2024-07-18 | OpenAI | ? | 18.59% | 51.39% (37/72) | 11.33% (17/150) | 6.47% (9/139) | 21.33% (32/150) |
24 | o3-high-2025-04-16 | OpenAI | ? | 18.59% | 43.06% (31/72) | 20.67% (31/150) | 3.60% (5/139) | 18.67% (28/150) |
25 | claude-sonnet-4-20250514 | Anthropic | ? | 18.59% | 48.61% (35/72) | 13.33% (20/150) | 5.04% (7/139) | 22.00% (33/150) |
26 | Qwen2.5-32B-Instruct | Qwen | 32B | 19.18% | 36.11% (26/72) | 13.33% (20/150) | 6.47% (9/139) | 28.67% (43/150) |
27 | claude-opus-4-20250514 | Anthropic | ? | 19.57% | 45.83% (33/72) | 12.67% (19/150) | 10.07% (14/139) | 22.67% (34/150) |
28 | llama-4-maverick | Llama | 17B active / 109B total | 20.55% | 51.39% (37/72) | 13.33% (20/150) | 9.35% (13/139) | 23.33% (35/150) |
29 | o3-low-2025-04-16 | OpenAI | ? | 20.55% | 47.22% (34/72) | 22.67% (34/150) | 7.19% (10/139) | 18.00% (27/150) |
30 | Qwen2.5-72B-Instruct | Qwen | 72B | 20.74% | 43.06% (31/72) | 12.67% (19/150) | 18.71% (26/139) | 20.00% (30/150) |
31 | QwQ-32B | Qwen | 32B | 24.66% | 50.00% (36/72) | 30.00% (45/150) | 6.47% (9/139) | 24.00% (36/150) |
32 | glm-4-9b-chat-hf | THUDM | 9B | 25.44% | 38.89% (28/72) | 9.33% (14/150) | 12.23% (17/139) | 47.33% (71/150) |
33 | o4-mini-medium-2025-04-16 | OpenAI | ? | 25.83% | 44.44% (32/72) | 23.33% (35/150) | 11.51% (16/139) | 32.67% (49/150) |
34 | o4-mini-low-2025-04-16 | OpenAI | ? | 27.98% | 44.44% (32/72) | 30.00% (45/150) | 16.55% (23/139) | 28.67% (43/150) |
35 | Llama-3.1-8B-Instruct | Llama | 8B | 28.38% | 44.44% (32/72) | 12.67% (19/150) | 12.23% (17/139) | 51.33% (77/150) |
36 | Qwen2.5-14B-Instruct | Qwen | 14B | 28.96% | 54.17% (39/72) | 18.00% (27/150) | 6.47% (9/139) | 48.67% (73/150) |
37 | o4-mini-high-2025-04-16 | OpenAI | ? | 29.94% | 54.17% (39/72) | 23.33% (35/150) | 17.99% (25/139) | 36.00% (54/150) |
38 | Ministral-8B-Instruct-2410 | Mistral AI | 8B | 30.92% | 56.94% (41/72) | 16.67% (25/150) | 11.51% (16/139) | 50.67% (76/150) |
39 | Phi-4-mini-instruct | Microsoft | 3.8B | 38.36% | 61.11% (44/72) | 21.33% (32/150) | 5.04% (7/139) | 75.33% (113/150) |
40 | Qwen2.5-7B-Instruct | Qwen | 7B | 38.55% | 62.50% (45/72) | 26.00% (39/150) | 15.83% (22/139) | 60.67% (91/150) |
41 | AI21-Jamba-mini-1.6 | AI21 Labs | 12B active / 52B total | 39.14% | 38.89% (28/72) | 25.33% (38/150) | 30.22% (42/139) | 61.33% (92/150) |
42 | Llama-3.2-3B-Instruct | Llama | 3B | 46.18% | 72.22% (52/72) | 28.67% (43/150) | 13.67% (19/139) | 81.33% (122/150) |
43 | Qwen2.5-3B-Instruct | Qwen | 3B | 55.97% | 69.44% (50/72) | 42.00% (63/150) | 27.34% (38/139) | 90.00% (135/150) |
44 | Qwen2.5-1.5B-Instruct | Qwen | 1.5B | 66.73% | 84.72% (61/72) | 60.67% (91/150) | 40.29% (56/139) | 88.67% (133/150) |
45 | Llama-3.2-1B-Instruct | Llama | 1B | 67.71% | 75.00% (54/72) | 64.67% (97/150) | 44.60% (62/139) | 88.67% (133/150) |
46 | Qwen2.5-0.5B-Instruct | Qwen | 0.5B | 76.32% | 88.89% (64/72) | 74.00% (111/150) | 58.27% (81/139) | 89.33% (134/150) |
Our framework combines the FaithBench and RagTruth benchmarks to offer evaluation over diverse RAG tasks.
-
FaithBench (Summarization): FaithBench provides hallucination annotations for summaries generated by 10 different LLMs, including
GPT-3.5
,GPT-4
,Gemini-1.5-Flash
,Claude-3.5-Sonnet
,Command-R
,Llama-3.1-70B
,Llama-3.1-8B
,Qwen2.5-7B
,Phi-3-mini-4k
, andMistral-7B
. The annotations of responses from diverse LLMs allowed for the analysis of diverse types of hallucinations. -
RagTruth (Summarization, Question Answering, and Data-to-Text Writing): RagTruth provides hallucination annotations for summaries, answers to questions, and overviews of structured data in the JSON format generated by 6 different LLMs, including
GPT-3.5
,GPT-4
,Llama-2 (7B, 13B, and 70B)
, andMistral-7B
.
We rank models according to their overall hallucination rate across all the above tasks, reflecting reliability in practical RAG deployments. The detailed methodology can be found in our accompanying paper (Coming Soon!).
For each task, we prompt the LLM to respond based strictly on the information that it is given. We measure hallucinations when the LLM adds claims, details, implications, or contexts that are unsupported or contradicted by the provided source information.
Summarization Prompt:
System Prompt: "You must respond based strictly on the information in a provided passage. Do not incorporate any external knowledge or infer any details beyond what is given in the passage."
User Prompt: "Provide a concise summary of the following passage, covering the core pieces of information described."
Question-Answering Prompt:
System Prompt: "You must respond based strictly on the information in provided passages. Do not incorporate any external knowledge or infer any details beyond what is given in the passages."
User Prompt: "Provide a concise answer to the following question based on the information in the provided passages."
Data-to-Text Writing Prompt:
System Prompt: "You must respond based strictly on the information in the provided structured data in the JSON format. Do not incorporate any external knowledge or infer any details beyond what is given in the data."
User Prompt: "Write a concise, objective overview of the following local business, based solely on the structured data provided in JSON format. You should include important details and cover key information mentioned in the customers' reviews.")
Unlike previous work in RAG hallucination and faithfulness benchmarking that involves fine-tuned hallucination detection models or LLM judges prompted in a zero-shot manner, we make use of FaithJudge. We prompt an LLM judge (currently o3-mini-high
) with human-annotated examples of LLM responses for the same task. For example, to evaluate generated summaries, we prompt the LLM judge with LLM-generated summaries for the same article with the corresponding hallucination annotations from human annotators involving hallucinated spans and brief explanatory notes. We find that these annotations help guide the LLM judge and allow for stronger agreement with the gold-standard human annotations.
Evaluating hallucinations in LLMs for RAG tasks is crucial for improving trustworthiness. By quantifying hallucinations in models, our benchmark helps researchers and practitioners make informed decisions about model selection and deployment.
-
Explore Responses and Judgements: We make the responses from LLMs available in
generated_outputs/
. We also make the LLM Judge's evaluations available ineval_results/
. The LLM Judge provides its reasoning in its evaluations, which may be of interest. -
Reproduce Results: We provide the scripts to generate:
generate_responses.py
and evaluate:eval.py
responsesTo generate responses:
python3 generate_responses.py --model openai/gpt-4o-2024-11-20
To evaluate responses:
python3 eval.py --model openai/gpt-4o-2024-11-20 --judge_model o3-mini
Please check out our paper Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards:
@article{tamber2025benchmarking,
title={Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards},
author={Tamber, Manveer Singh and Bao, Forrest Sheng and Xu, Chenyu and Luo, Ge and Kazi, Suleman and Bae, Minseok and Li, Miaoran and Mendelevitch, Ofer and Qu, Renyi and Lin, Jimmy},
journal={arXiv preprint arXiv:2505.04847},
year={2025}
}
- Vectara's Hallucination Leaderboard: We build upon our past hallucination leaderboard.
- FaithBench: We make use of hallucination annotations from FaithBench.
- RagTruth: We make use of hallucination annotations from RagTruth.
- Google's Facts Grounding Leaderboard: FACTS Grounding also benchmarks hallucinations with LLMs. We recommend checking out their work as well!