LoCoDiff is a novel long-context benchmark for evaluating language models' ability to understand git history and reconstruct code. Developed by the Mentat AI team, this benchmark offers several unique strengths:
- Utilizes naturally interconnected content, not artificially generated or padded context
- No junk context: every part of the context is required for the task
- Tests a real skill critical for coding agents: keeping track of the state of edited files
- Prompt generation and output evaluation are simple and easy to understand
- Challenges models' capacity to generate long-form outputs
- Surprisingly difficult for reasoning models to reason about
- Easy to procedurally generate: any file in any git repo can be made into a benchmark case
To see results, methodology, and analysis:
For instructions on running the benchmark yourself, see the benchmark pipeline README.