APE-Bench I: An Automated Proof Engineering Benchmark

👋 Hi, everyone!
We are ByteDance Seed team.

You can get to know us better through the following channels👇

APE-Bench I: An Automated Proof Engineering Benchmark

APE-Bench I is a comprehensive benchmark and accompanying codebase for evaluating the capabilities of Large Language Models (LLMs) in the domain of automated proof engineering within the Lean 4 theorem proving environment. This project is the official implementation for the research paper "APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries".

The benchmark focuses on realistic proof engineering tasks, such as bug fixing, feature implementation, and refactoring within the context of the Mathlib4 library. A core component of this project is Eleanstic, an efficient, version-aware Lean/Mathlib environment designed to manage and verify Lean code against specific historical Mathlib commit states with significantly reduced computational and storage overhead.

Welcome to explore the codebase and contribute to the advancement of AI in formal mathematics!

Introduction

Automated proof engineering aims to leverage AI to assist mathematicians and developers in creating, maintaining, and verifying formal proofs. APE-Bench I provides a standardized set of tasks derived from real-world Mathlib4 development activities, enabling rigorous assessment of LLMs on their ability to generate correct and meaningful code patches.

This repository contains:

The APE-Bench I dataset specification.
The Eleanstic system for efficient, version-aware Lean verification.
The APE-Bench core codebase for task execution, patch generation via LLMs, the DiffRepair utility for patch normalization, and a two-stage evaluation pipeline (syntactic and semantic).
Scripts and configurations to reproduce paper results and run new experiments.
Comprehensive documentation.

Key Features

Realistic Benchmark Tasks: A curated set of tasks covering bug fixes, feature implementations, and refactoring in Lean 4, based on Mathlib4 history.
Eleanstic: A novel system for managing multiple Mathlib4 versions efficiently. It uses content-addressable storage (CAS) and snapshotting to enable rapid restoration and verification of Lean code against specific commit states, drastically reducing disk space and setup time.
DiffRepair Utility: A robust tool to parse, clean, and apply noisy, LLM-generated diff patches, significantly improving patch application success rates.
Two-Stage Evaluation:
1. Syntactic Verification: Uses Eleanstic to compile and check the patched Lean code against the correct Mathlib version.
2. Semantic Judgement: Employs an "LLM-as-a-Judge" approach to assess whether syntactically valid patches correctly fulfill the task's natural language instruction.
Modular and Extensible Codebase: Designed to facilitate the integration of new LLMs, custom datasets, and novel evaluation methodologies.
Reproducibility: Provides tools and configurations to replicate the experiments and results presented in the APE-Bench I paper.

Project Structure

.
├── datasets/                 # Directory for APE-Bench I dataset files (e.g., downloaded by helper scripts)
├── docs/                     # Comprehensive project documentation
│   ├── 01_introduction.md
│   ├── 02_setup.md
│   ├── ...
│   └── 04_core_components/
│       ├── 04_1_eleanstic.md
│       └── ...
├── src/
│   ├── apebench/             # Core APE-Bench logic (inference, evaluation, etc.)
│   │   ├── config/
│   │   ├── data/
│   │   ├── evaluation_pipelines/
│   │   ├── inference/
│   │   └── scripts/
│   └── eleanstic/            # Eleanstic system for version-aware Lean verification
│       ├── core/
│       ├── utils/
│       └── main.py
├── README.md                 # This file
├── requirements.txt          # Python dependencies
└── ...                       # Other configuration and script files

Getting Started

For detailed instructions on setting up the environment, installing dependencies, and downloading the dataset, please refer to docs/01_setup.md.

For a comprehensive guide on the main workflow, including running the benchmark pipeline (patch generation, verification, and evaluation), please see the Automated Workflow Script section below.

Basic Workflow

The typical workflow for using APE-Bench I involves:

Patch Generation: Use an LLM to generate patches for tasks in the APE-Bench I dataset.

python -m src.apebench.scripts.1_generate_patches --config path/to/your_config.yaml ...

Syntactic Verification: Verify the grammatical correctness and type-safety of generated patches using Eleanstic.
```
python -m src.apebench.scripts.2_verify_patches --config path/to/your_config.yaml ...
```
Semantic Evaluation: Assess whether syntactically correct patches fulfill the task's intent using an LLM-as-a-Judge.
```
python -m src.apebench.scripts.3_evaluate_patches --config path/to/your_config.yaml ...
```

Automated Workflow Script

For convenience, we provide run_ape_bench_example.sh, a bash script that automates the complete APE-Bench workflow:

Setup: Clones necessary repositories (mathlib4 and the APE-Bench dataset)
Eleanstic Build: Preprocesses Mathlib commits (identified from the dataset file) required for evaluation using Eleanstic.
APE-Bench Pipeline:
- Generates patches using the configured LLM
- Verifies patches using Eleanstic
- Evaluates patches using the LLM-as-Judge

To use this script:

# Make the script executable
chmod +x run_ape_bench_example.sh

# Edit the script to set your configuration paths
# Then run it
./run_ape_bench_example.sh

You'll need to customize the configuration parameters at the top of the script, particularly the paths to your mathlib4 clone and configuration files. Additionally, ensure that the main APE-Bench configuration file (default configs/config.yaml) and the Eleanstic configuration file (default src/eleanstic/config.yaml), whose paths are also set in the script, are correctly configured internally (e.g., their internal Mathlib path and dataset file settings should align with the script's variable settings).

Documentation

Comprehensive documentation for the APE-Bench I project, including detailed setup instructions, explanations of core components, workflow guides, and development information, can be found in the ./docs directory.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Citation

If you use APE-Bench I or Eleanstic in your research, please cite our paper:

@article{xin2025apebench,
    title={{APE-Bench I}: Towards File-level Automated Proof Engineering of Formal Math Libraries},
    author={Huajian Xin and Luming Li and Xiaoran Jin and Jacques Fleuriot and Wenda Li},
    year={2025},
    journal={arXiv preprint arXiv:2504.19110}
}

About ByteDance Seed Team

Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society.

👋 大家好！
我们是字节跳动 Seed 团队。

您可以通过以下渠道更好地了解我们👇

APE-Bench I: 自动化证明工程基准测试

APE-Bench I 是一个全面的基准测试和配套代码库，用于评估大型语言模型 (LLM) 在 Lean 4 定理证明环境中自动化证明工程领域的能力。本项目是研究论文"APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries"的官方实现。

该基准测试专注于真实的证明工程任务，例如在 Mathlib4 库的上下文中进行错误修复、功能实现和重构。本项目的核心组件是 Eleanstic，这是一个高效的、版本感知的 Lean/Mathlib 环境，旨在以显著降低的计算和存储开销，管理和验证针对特定历史 Mathlib 提交状态的 Lean 代码。

欢迎探索代码库并为人工智能在形式数学领域的进步做出贡献！

引言

自动化证明工程旨在利用人工智能协助数学家和开发人员创建、维护和验证形式证明。APE-Bench I 提供了一套源自真实 Mathlib4 开发活动的标准化任务，从而能够严格评估 LLM 生成正确且有意义的代码补丁的能力。

此代码仓库包含：

APE-Bench I 数据集规范。
用于高效、版本感知的 Lean 验证的 Eleanstic 系统。
APE-Bench 核心代码库，用于任务执行、通过 LLM 生成补丁、用于补丁规范化的 DiffRepair 实用程序以及两阶段评估流程（语法和语义）。
用于复现论文结果和运行新实验的脚本和配置。
全面的文档。

主要特性

真实的基准测试任务：一组精选的任务，涵盖 Lean 4 中的错误修复、功能实现和重构，基于 Mathlib4 的历史记录。
Eleanstic：一个新颖的系统，用于高效管理多个 Mathlib4 版本。它使用内容寻址存储 (CAS) 和快照技术，可以快速恢复和验证针对特定提交状态的 Lean 代码，从而大大减少磁盘空间和设置时间。
DiffRepair 实用程序：一个强大的工具，用于解析、清理和应用 LLM 生成的嘈杂的差异补丁，显著提高补丁应用成功率。
两阶段评估：
1. 语法验证：使用 Eleanstic 根据正确的 Mathlib 版本编译和检查修补后的 Lean 代码。
2. 语义判断：采用"LLM 作为裁判"的方法来评估语法有效的补丁是否正确地满足了任务的自然语言指令。
模块化和可扩展的代码库：旨在促进新 LLM、自定义数据集和新颖评估方法的集成。
可复现性：提供工具和配置，以复制 APE-Bench I 论文中提出的实验和结果。

项目结构

.
├── datasets/                 # APE-Bench I 数据集文件目录 (例如，通过辅助脚本下载)
├── docs/                     # 完整的项目文档
│   ├── 01_introduction.md
│   ├── 02_setup.md
│   ├── ...
│   └── 04_core_components/
│       ├── 04_1_eleanstic.md
│       └── ...
├── src/
│   ├── apebench/             # APE-Bench 核心逻辑 (推理、评估等)
│   │   ├── config/
│   │   ├── data/
│   │   ├── evaluation_pipelines/
│   │   ├── inference/
│   │   └── scripts/
│   └── eleanstic/            # 用于版本感知 Lean 验证的 Eleanstic 系统
│       ├── core/
│       ├── utils/
│       └── main.py
├── README.md                 # 此文件
├── requirements.txt          # Python 依赖项
└── ...                       # 其他配置和脚本文件

快速上手

有关设置环境、安装依赖项和下载数据集的详细说明，请参阅 docs/01_setup.md。

有关主要工作流程的全面指南，包括运行基准测试流水线（补丁生成、验证和评估），请参阅下面的自动化工作流脚本部分。

基本工作流

使用 APE-Bench I 的典型工作流包括：

补丁生成：使用 LLM 为 APE-Bench I 数据集中的任务生成补丁。

python -m src.apebench.scripts.1_generate_patches --config path/to/your_config.yaml ...

语法验证：使用 Eleanstic 验证生成的补丁的语法正确性和类型安全性。
```
python -m src.apebench.scripts.2_verify_patches --config path/to/your_config.yaml ...
```
语义评估：使用"LLM 作为裁判"评估语法正确的补丁是否满足任务的意图。
```
python -m src.apebench.scripts.3_evaluate_patches --config path/to/your_config.yaml ...
```

自动化工作流脚本

为了方便使用，我们提供了 run_ape_bench_example.sh，这是一个自动化完整 APE-Bench 工作流的 bash 脚本：

设置：克隆必要的代码仓库（mathlib4 和 APE-Bench 数据集）
Eleanstic 构建：使用 Eleanstic 预处理评估所需的 Mathlib 提交 (从数据集中识别)。
APE-Bench 流程：
- 使用配置的 LLM 生成补丁
- 使用 Eleanstic 验证补丁
- 使用作为裁判的 LLM 评估补丁

使用此脚本：

# 使脚本可执行
chmod +x run_ape_bench_example.sh

# 编辑脚本以设置您的配置路径
# 然后运行它
./run_ape_bench_example.sh

您需要自定义脚本顶部的配置参数，特别是指向您的 mathlib4 克隆和配置文件的路径。此外，请确保脚本中也设置了其路径的主要 APE-Bench 配置文件 (默认为 configs/config.yaml) 和 Eleanstic 配置文件 (默认为 src/eleanstic/config.yaml) 内部也已正确配置 (例如，其内部指向的 Mathlib 路径和数据集文件应与脚本变量的设置保持一致)。

文档

APE-Bench I 项目的完整文档，包括详细的设置说明、核心组件的解释、工作流指南和开发信息，可以在 ./docs 目录中找到。

许可证

本项目根据 MIT 许可证授权。有关详细信息，请参阅 LICENSE 文件。

引用

如果您在研究中使用 APE-Bench I 或 Eleanstic，请引用我们的论文：

@article{xin2025apebench,
    title={{APE-Bench I}: Towards File-level Automated Proof Engineering of Formal Math Libraries},
    author={Huajian Xin and Luming Li and Xiaoran Jin and Jacques Fleuriot and Wenda Li},
    year={2025},
    journal={arXiv preprint arXiv:2504.19110}
}

关于字节跳动 Seed 团队

字节跳动 Seed 团队成立于 2023 年，致力于打造业界最先进的人工智能基础模型。团队渴望成为世界一流的研究团队，为科学和社会的进步做出重大贡献。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

APE-Bench I: An Automated Proof Engineering Benchmark

Table of Contents

Introduction

Key Features

Project Structure

Getting Started

Basic Workflow

Automated Workflow Script

Documentation

License

Citation

About ByteDance Seed Team

APE-Bench I: 自动化证明工程基准测试

目录

引言

主要特性

项目结构

快速上手

基本工作流

自动化工作流脚本

文档

许可证

引用

关于字节跳动 Seed 团队

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
configs		configs
docs		docs
examples		examples
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_ape_bench_example.sh		run_ape_bench_example.sh
submission.py		submission.py

License

xinhjBrant/APE-Bench_I

Folders and files

Latest commit

History

Repository files navigation

APE-Bench I: An Automated Proof Engineering Benchmark

Table of Contents

Introduction

Key Features

Project Structure

Getting Started

Basic Workflow

Automated Workflow Script

Documentation

License

Citation

About ByteDance Seed Team

APE-Bench I: 自动化证明工程基准测试

目录

引言

主要特性

项目结构

快速上手

基本工作流

自动化工作流脚本

文档

许可证

引用

关于字节跳动 Seed 团队

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages