Skip to content

Modular Implementation of Search-Augmented Factuality Evaluator #20

@chandralegend

Description

@chandralegend

Overview

I've implemented a modular Python package for evaluating the factuality of LLM responses based on the research paper "Long-form factuality in large language models" by Wei et al. (2024). This package (SAF-Eval) breaks down responses into atomic facts and evaluates them against retrieved evidence to provide detailed factuality scoring.

https://github.com/chandralegend/saf-eval

Features implemented

  • 🔍 Atomic Fact Extraction: Extracts atomic factual claims from responses with optional few-shot learning
  • Self-Containment Processing: Automatically detects and fixes non-self-contained facts
  • 🧠 Flexible Retrieval: Integrates with any document retrieval system (with a simple reference implementation)
  • 🔄 Fact Deduplication: Eliminates redundant or highly similar facts
  • 📊 Comprehensive Evaluation: Classifies facts and calculates factuality scores
  • 📝 Detailed Logging: Full pipeline logging for analysis and debugging
  • 🧩 Modular Architecture: All components can be customized or replaced

Technical details

  • Fully async implementation
  • Provider-agnostic (works with any LLM)
  • Supports the new OpenAI structured output format
  • Includes a unified configuration system
  • Comes with comprehensive test suite
  • Includes example scripts for all key features

This implementation follows the methodology outlined in the cited paper where responses are evaluated by:

  1. Breaking them down into atomic facts
  2. Ensuring each fact is self-contained
  3. Deduplicating similar facts
  4. Retrieving evidence for each fact
  5. Classifying facts based on evidence
  6. Calculating an overall factuality score

I'd appreciate feedback on the implementation and would welcome any suggestions for improvements or additional features.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions