ALE-Bench is a benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real-world tasks from the AtCoder Heuristic Contest (AHC), ALE-Bench presents optimization problems (e.g., routing and scheduling) that are computationally hard and admit no known exact solution.
Note: This repository is not an official product of SakanaAI or AtCoder and is therefore not officially supported.
-
Install Docker: Follow the official instructions at docker.com.
-
Install CairoSVG Dependencies: Refer to the CairoSVG documentation.
# Linux sudo apt install libcairo2-dev libffi-dev # macOS brew install cairo libffi pkgconf export PKG_CONFIG_PATH="/usr/local/lib/pkgconfig:/opt/homebrew/lib/pkgconfig:$PKG_CONFIG_PATH" export DYLD_LIBRARY_PATH="/usr/local/lib:/opt/homebrew/lib:$DYLD_LIBRARY_PATH"
Note: These paths might vary depending on your macOS version and Homebrew installation. If you encounter issues, verify the correct paths for
cairo
andlibffi
installed by Homebrew. -
Install Python (3.9 - 3.13) and ALE-Bench Toolkit:
# Install via this GitHub repository pip install git+https://github.com/SakanaAI/ALE-Bench.git # Or clone this GitHub repository and install locally git clone https://github.com/SakanaAI/ALE-Bench.git cd ALE-Bench pip install . # Using uv (recommended for faster environment management) git clone https://github.com/SakanaAI/ALE-Bench.git cd ALE-Bench uv venv --python 3.12.9 # Or any supported Python version (3.9 ~ 3.13) uv sync source .venv/bin/activate
-
Build Docker Images: This script will build the necessary Docker execution images for ALE-Bench. It automatically pulls pre-built base images from Docker Hub (repository:
yimjk/ale-bench
) and then creates local images tagged asale-bench:<language>-<version>
with appropriate permissions for your user.bash ./scripts/docker_build_all.sh $(id -u) $(id -g)
If you prefer to pull all base images beforehand, you can optionally run:
bash ./scripts/docker_pull_all.sh
-
[Optional] Download Data via Hugging Face Repository:
# Create a directory for the data mkdir -p /tmp/data && cd /tmp/data git lfs install git clone https://huggingface.co/datasets/SakanaAI/ALE-Bench # Set the ALE_BENCH_DATA environment variable to use this local copy. # If not set, data will be downloaded on demand using hf_hub_download (default). export ALE_BENCH_DATA=/tmp/data/ALE-Bench
-
[Optional] Install AWS CLI and Terraform: For cloud-based evaluations, install the AWS CLI and Terraform.
The Session
object is central to ALE-Bench, encapsulating the state and functionalities for an evaluation session on a specific problem. It facilitates input case generation, code execution, visualization, and evaluation.
A session is initiated using the ale_bench.start()
function:
import ale_bench
import datetime as dt
session = ale_bench.start(
problem_id="ahc001", # Target problem ID
lite_version=False, # Use full dataset (True for a smaller subset)
num_workers=13, # Parallel workers for judging (adjust based on CPU cores)
run_visualization_server=True, # Enable visualization server
visualization_server_port=8080, # Port for the visualization server (None to disable)
session_duration=dt.timedelta(hours=2) # Optional: set a duration for the session
)
Key Initialization Parameters for ale_bench.start()
:
problem_id (str)
: The ID of the problem to start a session for. This is a required parameter.lite_version (bool, optional)
: IfTrue
, uses a smaller "lite" version of seeds and problem data for quicker evaluations. Defaults toFalse
.use_same_time_scale (bool, optional)
: IfTrue
, the session simulates contest time progression (e.g., limiting the frequency ofpublic_eval
calls as the submission interval in an actual contest). Defaults toFalse
.maximum_num_case_gen (int, optional)
: Maximum number of input cases that can be generated usingsession.case_gen()
orsession.case_gen_eval()
. Defaults to a very large number (effectively unlimited).maximum_num_case_eval (int, optional)
: Maximum number of input cases that can be evaluated usingsession.case_eval()
orsession.case_gen_eval()
. Defaults to a very large number.maximum_execution_time_case_eval (float, optional)
: Cumulative maximum execution time (in seconds) usingsession.case_eval()
orsession.case_gen_eval()
. Defaults to a very large number.maximum_num_call_public_eval (int, optional)
: Maximum number of timessession.public_eval()
can be called. Defaults to a very large number (but is overridden by problem-defined limits ifuse_same_time_scale
isTrue
).session_duration (dt.timedelta | int | float, optional)
: Sets a maximum duration for the entire session. Can be adatetime.timedelta
object, or seconds as anint
orfloat
. Defaults toNone
(uses the problem's predefined duration).num_workers (int, optional)
: The number of worker processes to use for running judge evaluations in parallel. Defaults to1
.run_visualization_server (bool, optional)
: IfTrue
, attempts to start a local visualization server for the problem. Defaults toFalse
.visualization_server_port (int | None, optional)
: Specifies the port for the visualization server. IfNone
andrun_visualization_server
isTrue
, a free port between 9000-65535 will be automatically selected. Defaults toNone
.
Each method is described below with its parameters and return values.
Generates input case(s) based on the provided seed(s) and generation arguments.
Parameters:
seed (list[int] | int, optional)
: The seed or list of seeds for case generation. Defaults to0
.gen_kwargs (dict, optional)
: Dictionary of arguments for the case generator. Defaults to an empty dictionary.
Returns:
list[str] | str
: The generated case(s) as string(s). Returns a single string ifseed
is anint
, or a list of strings ifseed
is alist[int]
.
Evaluates the provided code against the given input string(s). This method is intended for local evaluation, allowing users to specify custom time and memory limits.
Parameters:
input_str (list[str] | str)
: The input string or list of input strings for the evaluation.code (str)
: The source code to be evaluated.code_language (CodeLanguage | str)
: The programming language of the code. Can be aCodeLanguage
enum member or its string representation (e.g., "python", "cpp17").judge_version (JudgeVersion | str, optional)
: The version of the judge to use. Defaults toNone
(uses the latest or problem-specific default).time_limit (float, optional)
: Custom time limit for execution in seconds. Defaults toNone
(uses problem-specific default).memory_limit (int | str, optional)
: Custom memory limit for execution (e.g.,256_000_000
for 256MB, or "256m"). Defaults toNone
(uses problem-specific default).skip_local_visualization (bool, optional)
: IfTrue
, skips generating local visualizations even if available. Defaults toFalse
.
Returns:
Result
: AResult
object containing the evaluation details, including scores, execution time, and memory usage for each case.
A convenience method that first generates test case(s) using specified seeds and generation arguments, and then immediately evaluates the provided code against these newly generated cases.
Parameters:
code (str)
: The source code to be evaluated.code_language (CodeLanguage | str)
: The programming language of the code.judge_version (JudgeVersion | str, optional)
: The judge version. Defaults toNone
.seed (list[int] | int, optional)
: Seed(s) for case generation. Defaults to0
.time_limit (float, optional)
: Custom time limit in seconds. Defaults toNone
.memory_limit (int | str, optional)
: Custom memory limit. Defaults toNone
.gen_kwargs (dict, optional)
: Arguments for the case generator. Defaults to an empty dictionary.skip_local_visualization (bool, optional)
: IfTrue
, skips local visualizations. Defaults toFalse
.
Returns:
Result
: AResult
object with the evaluation outcome.
Evaluates the provided code against the predefined set of public test cases for the current problem.
Parameters:
code (str)
: The source code to evaluate.code_language (CodeLanguage | str)
: The programming language of the code.judge_version (JudgeVersion | str, optional)
: The judge version. Defaults toNone
.skip_local_visualization (bool, optional)
: IfTrue
, skips local visualizations. Defaults toTrue
for public evaluations.
Returns:
Result
: AResult
object detailing the performance on public test cases.
Evaluates the provided code against the predefined set of private test cases. This is typically called during the final evaluation step to determine the official score, rank, and performance.
Parameters:
code (str)
: The source code to evaluate.code_language (CodeLanguage | str)
: The programming language of the code.judge_version (JudgeVersion | str, optional)
: The judge version. Defaults toNone
.
Returns:
Result
: AResult
object detailing the performance on private test cases.int
: The new rank achieved with this submission.int
: The new performance score.
Saves the current state of the session to a JSON file. This allows the session to be paused and resumed later using ale_bench.restart(filepath)
.
Parameters:
filepath (str | os.PathLike, optional)
: The path where the session file will be saved. Defaults to"session.json"
.
Returns:
None
Terminates the current session and cleans up all associated resources. This includes stopping the running visualization server and removing the temporary directory used by the current session.
Parameters:
- None
Returns:
None
problem (Problem)
: Accesses theProblem
object associated with the session, containing details such as the problem statement and constraints.problem_id (str)
: The ID of the current problem.lite_version (bool)
: Indicates if the session is running in "lite" mode.public_seeds (list[int])
: The list of seeds used for public test cases.num_public_cases (int)
: The number of public test cases.num_private_cases (int)
: The number of private test cases. (Note:private_seeds
itself is not directly accessible).tool_dir (Path)
: The directory where tools for the current problem are stored.rust_src_dir (Path)
: Path to the source code of Rust-based tools, if applicable.maximum_resource_usage (ResourceUsage)
: The configured maximum resource limits for the session.current_resource_usage (ResourceUsage)
: The current accumulated resource usage.remaining_resource_usage (ResourceUsage)
: The difference between maximum and current resource usage.action_log (list[str])
: A log of all actions performed during the session (e.g.,case_gen
,public_eval
).session_duration (dt.timedelta)
: The total configured duration for the session.session_started_at (dt.datetime)
: Timestamp of when the session was initiated.session_remaining_time (dt.timedelta)
: The time remaining before the session expires.session_finished (bool)
: ReturnsTrue
if the session has concluded (either by time or resource limits).run_visualization_server (bool)
: Indicates whether the visualization server is active.visualization_server_port (int | None)
: The port number of the visualization server, orNone
if not running.
For fair and reproducible performance comparisons, we strongly recommend running evaluations on a consistent, specified AWS instance (e.g., c6i.32xlarge
).
import ale_bench
import ale_bench.utils
import datetime as dt
# Start a new evaluation session
session = ale_bench.start(
problem_id="ahc001",
lite_version=False,
num_workers=13, # Adjust based on your machine's physical cores
run_visualization_server=True,
visualization_server_port=8080
)
# NOTE: While the `session` object contains attributes like `private_seeds`,
# `rank_performance_map`, and `standings`, these (and any other attributes
# prefixed with an underscore, e.g., `_private_inputs`) MUST NOT be accessed
# or used during your experiment to ensure fair evaluation.
# Access problem details
problem = session.problem
problem_statement_md = problem.statement # Markdown-formatted problem statement
problem_images = problem.statement_images # Associated images
problem_constraints_obj = problem.constraints # Structured constraints
# --- Your Agent's Logic Begins ---
# Example: Constructing an initial prompt for an LLM/LMM
# (Replace with your agent's prompt engineering)
initial_messages = my_agent.construct_initial_prompt(
problem_statement_md,
problem_images,
problem_constraints_obj
)
# Utility for parsing problem statements (e.g., for OpenAI models)
parsed_content = ale_bench.utils.parse_statement(
problem_statement_md, problem_images, return_openai=True
)
# Obtain a solution from your LLM/LMM agent
agent_response = my_agent.get_llm_response(initial_messages)
extracted_code = my_agent.parse_code_from_response(agent_response)
detected_language = my_agent.detect_code_language(extracted_code)
# Ensure detected_language is one of: "cpp17", "cpp20", "cpp23", "python", "rust"
# Evaluate against public test cases
public_result = session.public_eval(extracted_code, code_language=detected_language)
print(f"Initial Public Score: {public_result.overall_absolute_score}")
# Iterative refinement loop (example)
solution_attempts = [(extracted_code, public_result)]
current_best_code = extracted_code
# Define your maximum refinement iterations, e.g., MAX_REFINEMENT_ITERATIONS = 5
for i in range(MAX_REFINEMENT_ITERATIONS):
feedback_prompt = my_agent.construct_feedback_prompt(
problem, current_best_code, public_result
)
refined_response = my_agent.get_llm_response(feedback_prompt)
refined_code = my_agent.parse_code_from_response(refined_response)
if refined_code: # Agent might not always produce new code
public_result = session.public_eval(refined_code, code_language=detected_language)
solution_attempts.append((refined_code, public_result))
# Update current_best_code based on problem's score type (minimize/maximize)
# (Implementation depends on your agent's strategy)
current_best_code = my_agent.select_best_code(solution_attempts, problem.metadata.score_type)
else:
print(f"Iteration {i+1}: No new code generated.")
break # Or implement other logic like re-prompting
# Select the final submission based on overall public performance
final_submission_code = my_agent.select_best_code(solution_attempts, problem.metadata.score_type)
# --- Your Agent's Logic Ends ---
# Evaluate the final submission against private test cases
# Ensure `lite_version=False` during session start for rank and performance calculation.
private_result, final_rank, final_performance = session.private_eval(
final_submission_code, code_language=detected_language
)
print(f"Final Private Score: {private_result.overall_absolute_score}")
print(f"Rank: {final_rank}, Performance: {final_performance}")
# Monitor resource consumption
print(f"Current Resource Usage: {session.current_resource_usage}")
print(f"Remaining Resources: {session.remaining_resource_usage}")
# Inspect local Rust tool sources (if applicable)
if session.problem.metadata.problem_type == "reactive": # Example condition
ale_bench.utils.print_dir_tree(session.rust_src_dir)
# Persist session state for later analysis or resumption
session.save("my_ahc001_session.json")
# Explicitly close the session to release resources
session.close()
# To resume a saved session:
# resumed_session = ale_bench.restart("/path/to/my_ahc001_session.json")
# To clear all cached ALE-Bench data (problem data, toolchains):
# ale_bench.clear_cache()
ALE-Bench also provides utilities for calculating ratings and rankings based on contest performance.
The RatingCalculator
class helps estimate a user's rating based on their performance in various contests. It uses a formula similar to the one described in the official AHC rating document.
Initialization:
from ale_bench.data import RatingCalculator
rating_calculator = RatingCalculator()
Core Method: calculate_rating
Calculates the rating based on a dictionary of performances and the ID of the final contest considered.
Parameters:
performances (dict[str, int])
: A dictionary where keys are problem IDs (e.g., "ahc001") and values are the performance scores achieved in those problems.final_contest (str)
: The problem ID of the last contest to be included in the rating calculation. Performances from contests ending after this date will be ignored.
Returns:
int
: The calculated rating, rounded to the nearest integer.
Example:
performances = {
"ahc001": 2000,
"ahc002": 2200,
"ahc003": 1800
}
# Assuming ahc003 is the latest contest to consider for this rating calculation
final_rating = rating_calculator.calculate_rating(performances, "ahc003")
print(f"Calculated Rating: {final_rating}")
The RankingCalculator
class allows you to determine a user's rank based on their average performance or overall rating, compared against a pre-compiled dataset of existing user rankings. This dataset is automatically downloaded from the Hugging Face Hub.
Initialization:
from ale_bench.data import RankingCalculator
# Initialize with a minimum number of contest participations to be included in the ranking pool
# (default is 5)
ranking_calculator = RankingCalculator(minimum_participation=5)
Core Methods:
Calculates the rank based on average performance.
Parameters:
avg_perf (float)
: The average performance score.
Returns:
int
: The calculated rank. Lower is better.
Calculates the rank based on an overall rating.
Parameters:
rating (int)
: The overall rating.
Returns:
int
: The calculated rank. Lower is better.
Example:
# Example average performance and rating
my_avg_performance = 2150.75
my_rating = 2345
avg_perf_rank = ranking_calculator.calculate_avg_perf_rank(my_avg_performance)
rating_rank = ranking_calculator.calculate_rating_rank(my_rating)
print(f"Rank based on Average Performance ({my_avg_performance}): {avg_perf_rank}")
print(f"Rank based on Rating ({my_rating}): {rating_rank}")
For standardized benchmarking, we leverage Terraform by HashiCorp to provision a consistent AWS environment.
Consult the Terraform configuration, variables file, and setup script for detailed information. We recommend the use of Amazon EC2 C6i Instances for optimal performance.
Recommended num_workers
(parallel evaluations):
Set num_workers
to at most the number of physical cores of your instance, as most solutions are CPU-bound.
Instance | vCPUs | Memory (GiB) | Max num_workers |
Max num_workers (w/ Vis Server) |
---|---|---|---|---|
c6i.xlarge |
4 | 8 | 2 | 1 |
c6i.2xlarge |
8 | 16 | 4 | 3 |
c6i.4xlarge |
16 | 32 | 8 | 7 |
c6i.8xlarge |
32 | 64 | 16 | 15 |
c6i.12xlarge |
48 | 96 | 24 | 23 |
c6i.16xlarge |
64 | 128 | 32 | 31 |
c6i.24xlarge |
96 | 192 | 48 | 47 |
c6i.32xlarge |
128 | 256 | 64 | 63 |
c6i.metal |
128 | 256 | 64 | 63 |
Workflow:
-
Initialize & Apply Terraform: Before applying, ensure you have an AWS key pair ready. You will need to provide the path to your public key. It is also highly recommended to restrict SSH access to your IP address for security.
cd cloud terraform init terraform apply \ -var "ssh_public_key_path=</path/to/your/key.pub>" \ -var "aws_key_name=your-key-pair-name" \ -var "allowed_ssh_cidr=YOUR_IP_ADDRESS/32" \ -var "instance_type=c6i.32xlarge" \ -var "region=us-east-1" # Replace </path/to/your/key.pub> with the actual path to your public key. # Replace your-key-pair-name with a unique name for the key pair that will be created in AWS. # Replace YOUR_IP_ADDRESS/32 with your actual public IP address CIDR (e.g., 123.45.67.89/32). # Confirm with 'yes' or use -auto-approve.
The
aws_key_name
variable specifies the name of the key pair to be created in AWS using your provided public key. Ensure the corresponding private key is used for SSH access. -
Connect to the EC2 Instance:
ssh -i /path/to/your/key.pem ubuntu@<INSTANCE_PUBLIC_IP>
-
Verify Setup:
# Check cloud-init logs for successful completion (This setup takes ~10-20 minutes) cat /var/log/cloud-init-output.log # Look for "ALE-Bench setup completed!" in green text # Confirm ALE-Bench directory and activate virtual environment ls /home/ubuntu/ALE-Bench source /home/ubuntu/ALE-Bench/.venv/bin/activate
-
Terminate Instance:
cd cloud # (Run from your local machine, where the terraform state is located) terraform destroy -var "ssh_public_key_path=</path/to/your/key.pub>" -var "region=us-east-1" # Confirm with 'yes' or use -auto-approve
-
Environment Setup:
git clone https://github.com/SakanaAI/ALE-Bench.git cd ALE-Bench pip install ".[dev]" # Using uv uv venv --python 3.12.9 uv sync --extra dev source .venv/bin/activate
-
Docker Image Management:
# Build a base image (see scripts/docker_build_base_all.sh) # Specify --platform linux/amd64 if building on ARM for x86 compatibility docker build ./dockerfiles -t yimjk/ale-bench:python-202301-base -f ./dockerfiles/Dockerfile_python_202301_base # Push to Docker Hub (see scripts/docker_push_all.sh) docker image push yimjk/ale-bench:python-202301-base # Build a user-specific image with correct permissions (see scripts/docker_build_all.sh) docker build ./dockerfiles -t ale-bench:python-202301 -f ./dockerfiles/Dockerfile_python_202301 --build-arg UID=$(id -u) --build-arg GID=$(id -g)
Note: When pushing to Docker Hub, please change the image tag prefix
yimjk/
to your own username or organization name as appropriate (e.g.,your-username/ale-bench
oryour-organization/ale-bench
). -
Python Library Development:
# Linting ruff check src tests # Formatting ruff format src tests # Static Type Checking mypy src tests # Running Tests pytest pytest -m "not docker" # Exclude tests requiring Docker
Please cite ALE-Bench as follows:
@misc{imajuku2025ale-bench,
title = {{ALE-Bench}: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering},
author = {Imajuku, Yuki and Horie, Kohki and Iwata, Yoichi and Aoki, Kensho and Takahashi, Naohiro and Akiba, Takuya},
url = {https://github.com/SakanaAI/ALE-Bench},
year = {2025}
}