Skip to content

vectara/FaithJudge

Repository files navigation

FaithJudge Hallucinations Benchmark

FaithJudge benchmarks hallucinations generated by LLMs in Retrieval-Augmented Generation (RAG) tasks, focusing on reliability and factual accuracy.

We evaluate LLMs across three key RAG tasks:

  • Summarization
  • Question Answering
  • Data-to-Text Generation

This benchmark helps assess how frequently LLMs introduce hallucinations when summarizing documents, answering questions using context, and generating detailed overviews from structured data in the JSON format.

Leaderboard

Here's our current leaderboard based on hallucination rates across all evaluated tasks.

Rank Model Organization # Parameters Overall Hallucination Rate Faithbench (Summarization) RagTruth (Summarization) RagTruth (Question-Answering) RagTruth (Data-to-Text Writing)
1 gemini-2.5-flash Google ? 6.26% 19.44% (14/72) 4.67% (7/150) 2.88% (4/139) 4.67% (7/150)
2 gemini-2.5-pro Google ? 6.65% 25.00% (18/72) 4.67% (7/150) 1.44% (2/139) 4.67% (7/150)
3 gemini-2.5-pro-exp-03-25 Google ? 7.63% 25.00% (18/72) 9.33% (14/150) 0.72% (1/139) 4.00% (6/150)
4 r1-0528 DeepSeek 37B active / 671B total 9.78% 20.83% (15/72) 8.00% (12/150) 4.32% (6/139) 11.33% (17/150)
5 gemini-2.0-flash-001 Google ? 10.18% 29.17% (21/72) 6.67% (10/150) 0.72% (1/139) 13.33% (20/150)
6 o3-mini-medium-2025-01-31 OpenAI ? 11.55% 33.33% (24/72) 9.33% (14/150) 6.47% (9/139) 8.00% (12/150)
7 gpt-4.1-2025-04-14 OpenAI ? 11.94% 36.11% (26/72) 11.33% (17/150) 4.32% (6/139) 8.00% (12/150)
8 gpt-4.5-preview-2025-02-27 OpenAI ? 11.94% 37.50% (27/72) 10.00% (15/150) 5.04% (7/139) 8.00% (12/150)
9 o3-mini-low-2025-01-31 OpenAI ? 11.94% 33.33% (24/72) 6.67% (10/150) 7.19% (10/139) 11.33% (17/150)
10 o3-mini-high-2025-01-31 OpenAI ? 12.52% 34.72% (25/72) 8.00% (12/150) 6.47% (9/139) 12.00% (18/150)
11 claude-opus-4-thinking-20250514 Anthropic ? 14.09% 38.89% (28/72) 11.33% (17/150) 5.04% (7/139) 13.33% (20/150)
12 gpt-3.5-turbo-0125 OpenAI ? 14.87% 44.44% (32/72) 8.67% (13/150) 5.76% (8/139) 15.33% (23/150)
13 grok-3 xai ? 15.26% 41.67% (30/72) 12.00% (18/150) 6.47% (9/139) 14.00% (21/150)
14 gpt-4.1-mini-2025-04-14 OpenAI ? 15.66% 37.50% (27/72) 10.67% (16/150) 3.60% (5/139) 21.33% (32/150)
15 gpt-4o-2024-11-20 OpenAI ? 15.85% 40.28% (29/72) 10.00% (15/150) 5.04% (7/139) 20.00% (30/150)
16 claude-3-7-sonnet-20250219 Anthropic ? 16.05% 38.89% (28/72) 14.67% (22/150) 9.35% (13/139) 12.67% (19/150)
17 claude-3-7-sonnet-thinking-20250219 Anthropic ? 16.24% 45.83% (33/72) 10.67% (16/150) 9.35% (13/139) 14.00% (21/150)
18 Llama-3.3-70B-Instruct Llama 70B 16.44% 44.44% (32/72) 8.67% (13/150) 4.32% (6/139) 22.00% (33/150)
19 phi-4 Microsoft 14B 17.03% 44.44% (32/72) 8.00% (12/150) 4.32% (6/139) 24.67% (37/150)
20 Mistral-Small-24B-Instruct-2501 Mistral AI 24B 17.03% 43.06% (31/72) 10.00% (15/150) 10.07% (14/139) 18.00% (27/150)
21 o3-medium-2025-04-16 OpenAI ? 17.81% 36.11% (26/72) 19.33% (29/150) 9.35% (13/139) 15.33% (23/150)
22 claude-sonnet-4-thinking-20250514 Anthropic ? 18.20% 43.06% (31/72) 15.33% (23/150) 7.19% (10/139) 19.33% (29/150)
23 gpt-4o-mini-2024-07-18 OpenAI ? 18.59% 51.39% (37/72) 11.33% (17/150) 6.47% (9/139) 21.33% (32/150)
24 o3-high-2025-04-16 OpenAI ? 18.59% 43.06% (31/72) 20.67% (31/150) 3.60% (5/139) 18.67% (28/150)
25 claude-sonnet-4-20250514 Anthropic ? 18.59% 48.61% (35/72) 13.33% (20/150) 5.04% (7/139) 22.00% (33/150)
26 Qwen2.5-32B-Instruct Qwen 32B 19.18% 36.11% (26/72) 13.33% (20/150) 6.47% (9/139) 28.67% (43/150)
27 claude-opus-4-20250514 Anthropic ? 19.57% 45.83% (33/72) 12.67% (19/150) 10.07% (14/139) 22.67% (34/150)
28 llama-4-maverick Llama 17B active / 109B total 20.55% 51.39% (37/72) 13.33% (20/150) 9.35% (13/139) 23.33% (35/150)
29 o3-low-2025-04-16 OpenAI ? 20.55% 47.22% (34/72) 22.67% (34/150) 7.19% (10/139) 18.00% (27/150)
30 Qwen2.5-72B-Instruct Qwen 72B 20.74% 43.06% (31/72) 12.67% (19/150) 18.71% (26/139) 20.00% (30/150)
31 QwQ-32B Qwen 32B 24.66% 50.00% (36/72) 30.00% (45/150) 6.47% (9/139) 24.00% (36/150)
32 glm-4-9b-chat-hf THUDM 9B 25.44% 38.89% (28/72) 9.33% (14/150) 12.23% (17/139) 47.33% (71/150)
33 o4-mini-medium-2025-04-16 OpenAI ? 25.83% 44.44% (32/72) 23.33% (35/150) 11.51% (16/139) 32.67% (49/150)
34 o4-mini-low-2025-04-16 OpenAI ? 27.98% 44.44% (32/72) 30.00% (45/150) 16.55% (23/139) 28.67% (43/150)
35 Llama-3.1-8B-Instruct Llama 8B 28.38% 44.44% (32/72) 12.67% (19/150) 12.23% (17/139) 51.33% (77/150)
36 Qwen2.5-14B-Instruct Qwen 14B 28.96% 54.17% (39/72) 18.00% (27/150) 6.47% (9/139) 48.67% (73/150)
37 o4-mini-high-2025-04-16 OpenAI ? 29.94% 54.17% (39/72) 23.33% (35/150) 17.99% (25/139) 36.00% (54/150)
38 Ministral-8B-Instruct-2410 Mistral AI 8B 30.92% 56.94% (41/72) 16.67% (25/150) 11.51% (16/139) 50.67% (76/150)
39 Phi-4-mini-instruct Microsoft 3.8B 38.36% 61.11% (44/72) 21.33% (32/150) 5.04% (7/139) 75.33% (113/150)
40 Qwen2.5-7B-Instruct Qwen 7B 38.55% 62.50% (45/72) 26.00% (39/150) 15.83% (22/139) 60.67% (91/150)
41 AI21-Jamba-mini-1.6 AI21 Labs 12B active / 52B total 39.14% 38.89% (28/72) 25.33% (38/150) 30.22% (42/139) 61.33% (92/150)
42 Llama-3.2-3B-Instruct Llama 3B 46.18% 72.22% (52/72) 28.67% (43/150) 13.67% (19/139) 81.33% (122/150)
43 Qwen2.5-3B-Instruct Qwen 3B 55.97% 69.44% (50/72) 42.00% (63/150) 27.34% (38/139) 90.00% (135/150)
44 Qwen2.5-1.5B-Instruct Qwen 1.5B 66.73% 84.72% (61/72) 60.67% (91/150) 40.29% (56/139) 88.67% (133/150)
45 Llama-3.2-1B-Instruct Llama 1B 67.71% 75.00% (54/72) 64.67% (97/150) 44.60% (62/139) 88.67% (133/150)
46 Qwen2.5-0.5B-Instruct Qwen 0.5B 76.32% 88.89% (64/72) 74.00% (111/150) 58.27% (81/139) 89.33% (134/150)

📊 Methodology

Our framework combines the FaithBench and RagTruth benchmarks to offer evaluation over diverse RAG tasks.

  • FaithBench (Summarization): FaithBench provides hallucination annotations for summaries generated by 10 different LLMs, including GPT-3.5, GPT-4, Gemini-1.5-Flash, Claude-3.5-Sonnet, Command-R, Llama-3.1-70B, Llama-3.1-8B, Qwen2.5-7B, Phi-3-mini-4k, and Mistral-7B. The annotations of responses from diverse LLMs allowed for the analysis of diverse types of hallucinations.

  • RagTruth (Summarization, Question Answering, and Data-to-Text Writing): RagTruth provides hallucination annotations for summaries, answers to questions, and overviews of structured data in the JSON format generated by 6 different LLMs, including GPT-3.5, GPT-4, Llama-2 (7B, 13B, and 70B), and Mistral-7B.

We rank models according to their overall hallucination rate across all the above tasks, reflecting reliability in practical RAG deployments. The detailed methodology can be found in our accompanying paper (Coming Soon!).

For each task, we prompt the LLM to respond based strictly on the information that it is given. We measure hallucinations when the LLM adds claims, details, implications, or contexts that are unsupported or contradicted by the provided source information.

Summarization Prompt:

System Prompt: "You must respond based strictly on the information in a provided passage. Do not incorporate any external knowledge or infer any details beyond what is given in the passage."

User Prompt: "Provide a concise summary of the following passage, covering the core pieces of information described."

Question-Answering Prompt:

System Prompt: "You must respond based strictly on the information in provided passages. Do not incorporate any external knowledge or infer any details beyond what is given in the passages."

User Prompt: "Provide a concise answer to the following question based on the information in the provided passages."

Data-to-Text Writing Prompt:

System Prompt: "You must respond based strictly on the information in the provided structured data in the JSON format. Do not incorporate any external knowledge or infer any details beyond what is given in the data."

User Prompt: "Write a concise, objective overview of the following local business, based solely on the structured data provided in JSON format. You should include important details and cover key information mentioned in the customers' reviews.")

Unlike previous work in RAG hallucination and faithfulness benchmarking that involves fine-tuned hallucination detection models or LLM judges prompted in a zero-shot manner, we make use of FaithJudge. We prompt an LLM judge (currently o3-mini-high) with human-annotated examples of LLM responses for the same task. For example, to evaluate generated summaries, we prompt the LLM judge with LLM-generated summaries for the same article with the corresponding hallucination annotations from human annotators involving hallucinated spans and brief explanatory notes. We find that these annotations help guide the LLM judge and allow for stronger agreement with the gold-standard human annotations.

Why It Matters

Evaluating hallucinations in LLMs for RAG tasks is crucial for improving trustworthiness. By quantifying hallucinations in models, our benchmark helps researchers and practitioners make informed decisions about model selection and deployment.

📖 How to Use This Repository

  • Explore Responses and Judgements: We make the responses from LLMs available in generated_outputs/. We also make the LLM Judge's evaluations available in eval_results/. The LLM Judge provides its reasoning in its evaluations, which may be of interest.

  • Reproduce Results: We provide the scripts to generate: generate_responses.py and evaluate: eval.py responses

    To generate responses:

      python3 generate_responses.py --model openai/gpt-4o-2024-11-20
    

    To evaluate responses:

      python3 eval.py --model openai/gpt-4o-2024-11-20 --judge_model o3-mini
    

Paper

Please check out our paper Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards:

@article{tamber2025benchmarking,
  title={Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards},
  author={Tamber, Manveer Singh and Bao, Forrest Sheng and Xu, Chenyu and Luo, Ge and Kazi, Suleman and Bae, Minseok and Li, Miaoran and Mendelevitch, Ofer and Qu, Renyi and Lin, Jimmy},
  journal={arXiv preprint arXiv:2505.04847},
  year={2025}
}

🔗 Also check out

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages