LoCoBench: Long-Context Code Evaluation Benchmark

LoCoBench is a comprehensive benchmark specifically designed to evaluate long-context Large Language Models (LLMs) in complex software development scenarios. It provides 8,000 evaluation scenarios across 10 programming languages with context lengths spanning 10K to 1M tokens.

🚀 Quick Start

Prerequisites

Python 3.8 or higher
Git

Installation

# Clone the repository
git clone https://github.com/SalesforceAIResearch/LoCoBench.git
cd LoCoBench

# Install dependencies
pip install -r requirements.txt

# Install LoCoBench package
pip install -e .

Environment Setup

Configure API Keys

Create an api.sh file (gitignored) with your LLM API credentials:

# Copy the template
cp api.sh.template api.sh

# Edit api.sh with your API keys
export OPENAI_API_KEY="your_openai_key_here"
export ANTHROPIC_API_KEY="your_anthropic_key_here"
export GOOGLE_API_KEY="your_google_key_here"

# Source the file
source api.sh

📊 Running Evaluations

Option 1: Quick Evaluation (Recommended)

Run evaluation on pre-generated scenarios:

# Evaluate a single model on all scenarios
locobench evaluate --model gpt-4o --config-path config.yaml

# Evaluate specific task categories
locobench evaluate --model claude-sonnet-4 --task-category architectural_understanding --difficulty hard

# Evaluate multiple models in parallel
locobench evaluate --model gpt-4o,claude-sonnet-4,gemini-2.5-pro --config-path config.yaml

Option 2: Custom Evaluation

# Evaluate on specific programming languages
locobench evaluate --model gpt-4o --languages python,java,cpp

# Evaluate specific domains
locobench evaluate --model gemini-2.5-pro --domains web_applications,ml_systems

Evaluation Results

Results are saved in evaluation_results/ directory:

evaluation_results/
├── gpt4o_evaluation_results.json          # Detailed results
├── gpt4o_evaluation_results_summary.md    # Human-readable summary
└── all_model_results.csv                  # Comparative analysis

📈 Understanding Results

LoCoBench Score (LCBS)

The unified score (0-5 scale) combines 17 metrics across 4 dimensions:

Software Engineering Excellence (40%): ACS, DTA, CFRD, STS, RS, CS, IS, SES
Functional Correctness (30%): Compilation, Unit Tests, Integration Tests, IDC
Code Quality Assessment (20%): Security Analysis, Code Issues, Style Adherence
Long-Context Utilization (10%): ICU, MMR

Key Metrics Explained

ACS (Architectural Coherence Score): System-level design consistency
DTA (Dependency Traversal Accuracy): Cross-file reasoning ability
CFRD (Cross-File Reasoning Depth): Multi-file understanding
ICU (Information Coverage Utilization): Effective use of long context
MMR (Multi-Session Memory Retention): Context persistence across sessions

📚 Documentation

Generation Guide: How to generate custom scenarios (Phases 1-4)
Contributing: How to contribute to LoCoBench

📄 Citation

@article{locobench2024,
  title={LoCoBench: A Benchmark for Evaluating Long-Context LLMs in Complex Software Development Tasks},
  author={Qiu, Jielin and others},
  journal={arXiv preprint arXiv:2025.XXXXX},
  year={2025}
}

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

📜 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgments

Salesforce AI Research for supporting this research

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
locobench		locobench
AI_ETHICS.md		AI_ETHICS.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
LoCoBench_generation.md		LoCoBench_generation.md
README.md		README.md
SECURITY.md		SECURITY.md
api.sh.template		api.sh.template
config.yaml		config.yaml
how_to_license.md		how_to_license.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LoCoBench: Long-Context Code Evaluation Benchmark

🚀 Quick Start

Prerequisites

Installation

Environment Setup

📊 Running Evaluations

Option 1: Quick Evaluation (Recommended)

Option 2: Custom Evaluation

Evaluation Results

📈 Understanding Results

LoCoBench Score (LCBS)

Key Metrics Explained

📚 Documentation

📄 Citation

🤝 Contributing

📜 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

SalesforceAIResearch/LoCoBench

Folders and files

Latest commit

History

Repository files navigation

LoCoBench: Long-Context Code Evaluation Benchmark

🚀 Quick Start

Prerequisites

Installation

Environment Setup

📊 Running Evaluations

Option 1: Quick Evaluation (Recommended)

Option 2: Custom Evaluation

Evaluation Results

📈 Understanding Results

LoCoBench Score (LCBS)

Key Metrics Explained

📚 Documentation

📄 Citation

🤝 Contributing

📜 License

🙏 Acknowledgments

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages