-
Notifications
You must be signed in to change notification settings - Fork 76
Open
Description
Overview
I've implemented a modular Python package for evaluating the factuality of LLM responses based on the research paper "Long-form factuality in large language models" by Wei et al. (2024). This package (SAF-Eval) breaks down responses into atomic facts and evaluates them against retrieved evidence to provide detailed factuality scoring.
https://github.com/chandralegend/saf-eval
Features implemented
- 🔍 Atomic Fact Extraction: Extracts atomic factual claims from responses with optional few-shot learning
- ✅ Self-Containment Processing: Automatically detects and fixes non-self-contained facts
- 🧠 Flexible Retrieval: Integrates with any document retrieval system (with a simple reference implementation)
- 🔄 Fact Deduplication: Eliminates redundant or highly similar facts
- 📊 Comprehensive Evaluation: Classifies facts and calculates factuality scores
- 📝 Detailed Logging: Full pipeline logging for analysis and debugging
- 🧩 Modular Architecture: All components can be customized or replaced
Technical details
- Fully async implementation
- Provider-agnostic (works with any LLM)
- Supports the new OpenAI structured output format
- Includes a unified configuration system
- Comes with comprehensive test suite
- Includes example scripts for all key features
This implementation follows the methodology outlined in the cited paper where responses are evaluated by:
- Breaking them down into atomic facts
- Ensuring each fact is self-contained
- Deduplicating similar facts
- Retrieving evidence for each fact
- Classifying facts based on evidence
- Calculating an overall factuality score
I'd appreciate feedback on the implementation and would welcome any suggestions for improvements or additional features.
Metadata
Metadata
Assignees
Labels
No labels