FaithJudge Hallucinations Benchmark

FaithJudge benchmarks hallucinations generated by LLMs in Retrieval-Augmented Generation (RAG) tasks, focusing on reliability and factual accuracy.

We evaluate LLMs across three key RAG tasks:

Summarization
Question Answering
Data-to-Text Generation

This benchmark helps assess how frequently LLMs introduce hallucinations when summarizing documents, answering questions using context, and generating detailed overviews from structured data in the JSON format.

Leaderboard

Here's our current leaderboard based on hallucination rates across all evaluated tasks.

Rank	Model	Organization	# Parameters	Overall Hallucination Rate	Faithbench (Summarization)	RagTruth (Summarization)	RagTruth (Question-Answering)	RagTruth (Data-to-Text Writing)
1	gemini-2.5-flash	Google	?	6.26%	19.44% (14/72)	4.67% (7/150)	2.88% (4/139)	4.67% (7/150)
2	gemini-2.5-pro	Google	?	6.65%	25.00% (18/72)	4.67% (7/150)	1.44% (2/139)	4.67% (7/150)
3	gemini-2.5-pro-exp-03-25	Google	?	7.63%	25.00% (18/72)	9.33% (14/150)	0.72% (1/139)	4.00% (6/150)
4	r1-0528	DeepSeek	37B active / 671B total	9.78%	20.83% (15/72)	8.00% (12/150)	4.32% (6/139)	11.33% (17/150)
5	gemini-2.0-flash-001	Google	?	10.18%	29.17% (21/72)	6.67% (10/150)	0.72% (1/139)	13.33% (20/150)
6	o3-mini-medium-2025-01-31	OpenAI	?	11.55%	33.33% (24/72)	9.33% (14/150)	6.47% (9/139)	8.00% (12/150)
7	gpt-4.1-2025-04-14	OpenAI	?	11.94%	36.11% (26/72)	11.33% (17/150)	4.32% (6/139)	8.00% (12/150)
8	gpt-4.5-preview-2025-02-27	OpenAI	?	11.94%	37.50% (27/72)	10.00% (15/150)	5.04% (7/139)	8.00% (12/150)
9	o3-mini-low-2025-01-31	OpenAI	?	11.94%	33.33% (24/72)	6.67% (10/150)	7.19% (10/139)	11.33% (17/150)
10	o3-mini-high-2025-01-31	OpenAI	?	12.52%	34.72% (25/72)	8.00% (12/150)	6.47% (9/139)	12.00% (18/150)
11	claude-opus-4-thinking-20250514	Anthropic	?	14.09%	38.89% (28/72)	11.33% (17/150)	5.04% (7/139)	13.33% (20/150)
12	gpt-3.5-turbo-0125	OpenAI	?	14.87%	44.44% (32/72)	8.67% (13/150)	5.76% (8/139)	15.33% (23/150)
13	grok-3	xai	?	15.26%	41.67% (30/72)	12.00% (18/150)	6.47% (9/139)	14.00% (21/150)
14	gpt-4.1-mini-2025-04-14	OpenAI	?	15.66%	37.50% (27/72)	10.67% (16/150)	3.60% (5/139)	21.33% (32/150)
15	gpt-4o-2024-11-20	OpenAI	?	15.85%	40.28% (29/72)	10.00% (15/150)	5.04% (7/139)	20.00% (30/150)
16	claude-3-7-sonnet-20250219	Anthropic	?	16.05%	38.89% (28/72)	14.67% (22/150)	9.35% (13/139)	12.67% (19/150)
17	claude-3-7-sonnet-thinking-20250219	Anthropic	?	16.24%	45.83% (33/72)	10.67% (16/150)	9.35% (13/139)	14.00% (21/150)
18	Llama-3.3-70B-Instruct	Llama	70B	16.44%	44.44% (32/72)	8.67% (13/150)	4.32% (6/139)	22.00% (33/150)
19	phi-4	Microsoft	14B	17.03%	44.44% (32/72)	8.00% (12/150)	4.32% (6/139)	24.67% (37/150)
20	Mistral-Small-24B-Instruct-2501	Mistral AI	24B	17.03%	43.06% (31/72)	10.00% (15/150)	10.07% (14/139)	18.00% (27/150)
21	o3-medium-2025-04-16	OpenAI	?	17.81%	36.11% (26/72)	19.33% (29/150)	9.35% (13/139)	15.33% (23/150)
22	claude-sonnet-4-thinking-20250514	Anthropic	?	18.20%	43.06% (31/72)	15.33% (23/150)	7.19% (10/139)	19.33% (29/150)
23	gpt-4o-mini-2024-07-18	OpenAI	?	18.59%	51.39% (37/72)	11.33% (17/150)	6.47% (9/139)	21.33% (32/150)
24	o3-high-2025-04-16	OpenAI	?	18.59%	43.06% (31/72)	20.67% (31/150)	3.60% (5/139)	18.67% (28/150)
25	claude-sonnet-4-20250514	Anthropic	?	18.59%	48.61% (35/72)	13.33% (20/150)	5.04% (7/139)	22.00% (33/150)
26	Qwen2.5-32B-Instruct	Qwen	32B	19.18%	36.11% (26/72)	13.33% (20/150)	6.47% (9/139)	28.67% (43/150)
27	claude-opus-4-20250514	Anthropic	?	19.57%	45.83% (33/72)	12.67% (19/150)	10.07% (14/139)	22.67% (34/150)
28	llama-4-maverick	Llama	17B active / 109B total	20.55%	51.39% (37/72)	13.33% (20/150)	9.35% (13/139)	23.33% (35/150)
29	o3-low-2025-04-16	OpenAI	?	20.55%	47.22% (34/72)	22.67% (34/150)	7.19% (10/139)	18.00% (27/150)
30	Qwen2.5-72B-Instruct	Qwen	72B	20.74%	43.06% (31/72)	12.67% (19/150)	18.71% (26/139)	20.00% (30/150)
31	QwQ-32B	Qwen	32B	24.66%	50.00% (36/72)	30.00% (45/150)	6.47% (9/139)	24.00% (36/150)
32	glm-4-9b-chat-hf	THUDM	9B	25.44%	38.89% (28/72)	9.33% (14/150)	12.23% (17/139)	47.33% (71/150)
33	o4-mini-medium-2025-04-16	OpenAI	?	25.83%	44.44% (32/72)	23.33% (35/150)	11.51% (16/139)	32.67% (49/150)
34	o4-mini-low-2025-04-16	OpenAI	?	27.98%	44.44% (32/72)	30.00% (45/150)	16.55% (23/139)	28.67% (43/150)
35	Llama-3.1-8B-Instruct	Llama	8B	28.38%	44.44% (32/72)	12.67% (19/150)	12.23% (17/139)	51.33% (77/150)
36	Qwen2.5-14B-Instruct	Qwen	14B	28.96%	54.17% (39/72)	18.00% (27/150)	6.47% (9/139)	48.67% (73/150)
37	o4-mini-high-2025-04-16	OpenAI	?	29.94%	54.17% (39/72)	23.33% (35/150)	17.99% (25/139)	36.00% (54/150)
38	Ministral-8B-Instruct-2410	Mistral AI	8B	30.92%	56.94% (41/72)	16.67% (25/150)	11.51% (16/139)	50.67% (76/150)
39	Phi-4-mini-instruct	Microsoft	3.8B	38.36%	61.11% (44/72)	21.33% (32/150)	5.04% (7/139)	75.33% (113/150)
40	Qwen2.5-7B-Instruct	Qwen	7B	38.55%	62.50% (45/72)	26.00% (39/150)	15.83% (22/139)	60.67% (91/150)
41	AI21-Jamba-mini-1.6	AI21 Labs	12B active / 52B total	39.14%	38.89% (28/72)	25.33% (38/150)	30.22% (42/139)	61.33% (92/150)
42	Llama-3.2-3B-Instruct	Llama	3B	46.18%	72.22% (52/72)	28.67% (43/150)	13.67% (19/139)	81.33% (122/150)
43	Qwen2.5-3B-Instruct	Qwen	3B	55.97%	69.44% (50/72)	42.00% (63/150)	27.34% (38/139)	90.00% (135/150)
44	Qwen2.5-1.5B-Instruct	Qwen	1.5B	66.73%	84.72% (61/72)	60.67% (91/150)	40.29% (56/139)	88.67% (133/150)
45	Llama-3.2-1B-Instruct	Llama	1B	67.71%	75.00% (54/72)	64.67% (97/150)	44.60% (62/139)	88.67% (133/150)
46	Qwen2.5-0.5B-Instruct	Qwen	0.5B	76.32%	88.89% (64/72)	74.00% (111/150)	58.27% (81/139)	89.33% (134/150)

📊 Methodology

Our framework combines the FaithBench and RagTruth benchmarks to offer evaluation over diverse RAG tasks.

FaithBench (Summarization): FaithBench provides hallucination annotations for summaries generated by 10 different LLMs, including GPT-3.5, GPT-4, Gemini-1.5-Flash, Claude-3.5-Sonnet, Command-R, Llama-3.1-70B, Llama-3.1-8B, Qwen2.5-7B, Phi-3-mini-4k, and Mistral-7B. The annotations of responses from diverse LLMs allowed for the analysis of diverse types of hallucinations.
RagTruth (Summarization, Question Answering, and Data-to-Text Writing): RagTruth provides hallucination annotations for summaries, answers to questions, and overviews of structured data in the JSON format generated by 6 different LLMs, including GPT-3.5, GPT-4, Llama-2 (7B, 13B, and 70B), and Mistral-7B.

We rank models according to their overall hallucination rate across all the above tasks, reflecting reliability in practical RAG deployments. The detailed methodology can be found in our accompanying paper (Coming Soon!).

For each task, we prompt the LLM to respond based strictly on the information that it is given. We measure hallucinations when the LLM adds claims, details, implications, or contexts that are unsupported or contradicted by the provided source information.

Summarization Prompt:

System Prompt: "You must respond based strictly on the information in a provided passage. Do not incorporate any external knowledge or infer any details beyond what is given in the passage."

User Prompt: "Provide a concise summary of the following passage, covering the core pieces of information described."

Question-Answering Prompt:

System Prompt: "You must respond based strictly on the information in provided passages. Do not incorporate any external knowledge or infer any details beyond what is given in the passages."

User Prompt: "Provide a concise answer to the following question based on the information in the provided passages."

Data-to-Text Writing Prompt:

System Prompt: "You must respond based strictly on the information in the provided structured data in the JSON format. Do not incorporate any external knowledge or infer any details beyond what is given in the data."

User Prompt: "Write a concise, objective overview of the following local business, based solely on the structured data provided in JSON format. You should include important details and cover key information mentioned in the customers' reviews.")

Unlike previous work in RAG hallucination and faithfulness benchmarking that involves fine-tuned hallucination detection models or LLM judges prompted in a zero-shot manner, we make use of FaithJudge. We prompt an LLM judge (currently o3-mini-high) with human-annotated examples of LLM responses for the same task. For example, to evaluate generated summaries, we prompt the LLM judge with LLM-generated summaries for the same article with the corresponding hallucination annotations from human annotators involving hallucinated spans and brief explanatory notes. We find that these annotations help guide the LLM judge and allow for stronger agreement with the gold-standard human annotations.

Why It Matters

Evaluating hallucinations in LLMs for RAG tasks is crucial for improving trustworthiness. By quantifying hallucinations in models, our benchmark helps researchers and practitioners make informed decisions about model selection and deployment.

📖 How to Use This Repository

Explore Responses and Judgements: We make the responses from LLMs available in generated_outputs/. We also make the LLM Judge's evaluations available in eval_results/. The LLM Judge provides its reasoning in its evaluations, which may be of interest.
Reproduce Results: We provide the scripts to generate: generate_responses.py and evaluate: eval.py responses

To generate responses:
```
  python3 generate_responses.py --model openai/gpt-4o-2024-11-20
```
To evaluate responses:
```
  python3 eval.py --model openai/gpt-4o-2024-11-20 --judge_model o3-mini
```

Paper

Please check out our paper Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards:

@article{tamber2025benchmarking,
  title={Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards},
  author={Tamber, Manveer Singh and Bao, Forrest Sheng and Xu, Chenyu and Luo, Ge and Kazi, Suleman and Bae, Minseok and Li, Miaoran and Mendelevitch, Ofer and Qu, Renyi and Lin, Jimmy},
  journal={arXiv preprint arXiv:2505.04847},
  year={2025}
}

🔗 Also check out

Vectara's Hallucination Leaderboard: We build upon our past hallucination leaderboard.
FaithBench: We make use of hallucination annotations from FaithBench.
RagTruth: We make use of hallucination annotations from RagTruth.
Google's Facts Grounding Leaderboard: FACTS Grounding also benchmarks hallucinations with LLMs. We recommend checking out their work as well!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
eval_data		eval_data
eval_results		eval_results
generated_outputs		generated_outputs
README.md		README.md
eval.py		eval.py
generate_responses.py		generate_responses.py
generate_table.py		generate_table.py
prompt_templates.py		prompt_templates.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FaithJudge Hallucinations Benchmark

Leaderboard

📊 Methodology

Why It Matters

📖 How to Use This Repository

Paper

🔗 Also check out

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

vectara/FaithJudge

Folders and files

Latest commit

History

Repository files navigation

FaithJudge Hallucinations Benchmark

Leaderboard

📊 Methodology

Why It Matters

📖 How to Use This Repository

Paper

🔗 Also check out

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages