A comprehensive study on reducing hallucinations in Large Language Models through strategic prompt engineering techniques.
- Overview
- Problem Statement
- Methodology
- Results
- Installation
- Prompt Strategies
- Evaluation Framework
- Contributing
- Citation
- Team
Large Language Models (LLMs) like GPT-4, Claude, and Mistral demonstrate remarkable capabilities but suffer from hallucinations — generating factually incorrect or fabricated content with high confidence. This project explores prompt engineering strategies as a lightweight, non-invasive method to reduce hallucinations in open-domain question answering tasks.
- 🎯 15% improvement in F1 score using hybrid prompting strategies
- 📉 Significant reduction in hallucination rates across multiple LLMs
- 🔧 Zero-shot approach requiring no model retraining or fine-tuning
- 📊 Comprehensive evaluation across 4+ different LLM architectures
Despite advances in model scaling and fine-tuning, LLMs continue to produce hallucinations that:
- Reduce user trust in AI systems
- Create risks in critical domains (healthcare, education, legal)
- Limit practical deployment in production environments
- Generate confident but incorrect responses
Our Solution: Develop prompt engineering strategies that improve factual accuracy without requiring access to model internals or expensive retraining.
- Curated QA dataset spanning general knowledge, science, and abstract reasoning
- Balanced across different difficulty levels and domains
- Manual verification of ground truth answers
- GPT-3.5-turbo (baseline)
- Mistral-7B Instruct (open-weight)
Strategy | Description | Key Feature |
---|---|---|
Vanilla | Direct question-answer format | Baseline approach |
Chain-of-Thought (CoT) | Step-by-step reasoning | Encourages logical flow |
Chain-of-Verification (CoVe) | Self-verification stage | Validates answers |
Hybrid (CoT + CoVe) | Combined reasoning + verification | Best performance |
Prompt Strategy | F1 Score ↑ | Recall ↑ | Hallucination Rate ↓ |
---|---|---|---|
Vanilla QA | 0.05 | 0.03 | High |
Chain-of-Thought | 0.11 | 0.08 | Moderate |
Chain-of-Verification | 0.13 | 0.10 | Low |
Hybrid (CoT + CoVe) | 0.15 | 0.10 | Lowest |
- ✅ Hybrid prompting consistently outperformed individual strategies
- ✅ GPT-4 showed best absolute performance with structured prompts
- ✅ Open-source models (Mistral-7B) demonstrated significant improvements
- ✅ Reasoning quality improved alongside factual accuracy
# Clone the repository
git clone [https://github.com/meghajbhat/llm-hallucination-reduction.git](https://github.com/meghajbhat/Reducing-Hallucinations-in-LLMs-using-Prompt-Engineering-Strategies.git)
cd Reducing-Hallucinations-in-LLMs-using-Prompt-Engineering-Strategies
# Install dependencies
We have done this project in Kaggle using GPU P100.
### Requirements
openai>=1.0.0 anthropic>=0.8.0 transformers>=4.30.0 torch>=2.0.0 pandas>=1.5.0 numpy>=1.21.0 scikit-learn>=1.0.0 matplotlib>=3.5.0 seaborn>=0.11.0 tqdm>=4.64.0
## 🎯 Prompt Strategies
### Hybrid Prompt Template
```python
HYBRID_TEMPLATE = """
Question: {question}
Let's think step-by-step:
1. [Reasoning step 1]
2. [Reasoning step 2]
3. [Reasoning step 3]
Now let me verify this answer:
- Does this make logical sense?
- Are there any contradictions?
- Is this consistent with known facts?
Final Answer: [Verified answer]
"""
- F1 Score: Harmonic mean of precision and recall
- Recall: Coverage of relevant information
- Hallucination Rate: Frequency of factually incorrect statements
- Explanation Coherence: Alignment between reasoning and final answer
- Accuracy
- Precision
We welcome contributions!
If you use this work in your research, please cite!
Name | Role |
---|---|
Megha Bhat | [PES1UG22CS344 |
Hrishita Patra | PES1UG22CS241 |
Jahnavi Bobba | PES1UG22CS246 |
Keerthi K | PES1UG22CS284 |
Supervisor: Dr. Surabhi Narayan, PES University, Bengaluru
For questions or collaboration opportunities:
- 📧 Email: meghajbhat@gmail.com
- 💼 LinkedIn: [Megha Bhat](https://www.linkedin.com/in/megha-bhat-20baaa293/)
⭐ Star this repository if you find it helpful!