Recovery-Bench is a benchmark for evaluating the capability of LLM agents to recover from mistakes. This repository provides the tools to generate Recovery-Bench traces and run replay/recovery agents on terminal-based tasks.
🔗 Read more on our blog: letta.com/blog/recovery-bench
python3 -m recovery-bench.generate_traces openai/gpt-4o-minipython3 -m recovery-bench.run_replay_agent \
--trajectory-folder runs/gpt-4o-mini-collected-20250714_232243 \
--model-name anthropic/claude-sonnet-4-20250514Generate complete recovery-bench traces for a model:
python3 -m recovery-bench.generate_traces openai/gpt-4o-mini \
--dataset-version 0.2.15 \
--min-episodes 10 \
--n-concurrent 4 \
--max-iterations 3Key options:
--dataset-version: Dataset version (default: 0.2.15)--min-episodes: Minimum episodes per task (default: 10)--n-concurrent: Number of concurrent processes (default: 4)--max-iterations: Maximum replay iterations (default: 3)--run-initial: Only run initial traces, skip replay iterations--task-folder: Path to task definitions folder (default: ./terminal-bench/tasks)
Run the replay/recovery agent on collected traces:
python3 -m recovery-bench.run_replay_agent \
--trajectory-folder runs/gpt-4o-mini-collected-20250714_232243 \
--model-name anthropic/claude-sonnet-4-20250514 \
--run-id sonnet-correction-1 \
--n-concurrent 4--trajectory-folder: Path to the trajectory folder (required)--model-name: Model name to use (required)--run-id: Custom run identifier--n-concurrent: Number of concurrent processes--task-folder: Path to task definitions folder (default: ./terminal-bench/tasks)--cleanup-container: Clean up Docker containers before running
TODOs:
- Bumping up terminal-bench version (blocked, pending tb 2.0)
- ReplayAgent for terminus2
- Looking into swebench.