Skip to content

SakanaAI/ALE-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ALE-Bench

GitHub license GitHub stars GitHub downloads Hugging Face repository Docker Hub

ALE-Bench is a benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real-world tasks from the AtCoder Heuristic Contest (AHC), ALE-Bench presents optimization problems (e.g., routing and scheduling) that are computationally hard and admit no known exact solution.

Note: This repository is not an official product of SakanaAI or AtCoder and is therefore not officially supported.

Setup

  1. Install Docker: Follow the official instructions at docker.com.

  2. Install CairoSVG Dependencies: Refer to the CairoSVG documentation.

    # Linux
    sudo apt install libcairo2-dev libffi-dev
    # macOS
    brew install cairo libffi pkgconf
    export PKG_CONFIG_PATH="/usr/local/lib/pkgconfig:/opt/homebrew/lib/pkgconfig:$PKG_CONFIG_PATH"
    export DYLD_LIBRARY_PATH="/usr/local/lib:/opt/homebrew/lib:$DYLD_LIBRARY_PATH"

    Note: These paths might vary depending on your macOS version and Homebrew installation. If you encounter issues, verify the correct paths for cairo and libffi installed by Homebrew.

  3. Install Python (3.9 - 3.13) and ALE-Bench Toolkit:

    # Install via this GitHub repository
    pip install git+https://github.com/SakanaAI/ALE-Bench.git
    
    # Or clone this GitHub repository and install locally
    git clone https://github.com/SakanaAI/ALE-Bench.git
    cd ALE-Bench
    pip install .
    
    # Using uv (recommended for faster environment management)
    git clone https://github.com/SakanaAI/ALE-Bench.git
    cd ALE-Bench
    uv venv --python 3.12.9  # Or any supported Python version (3.9 ~ 3.13)
    uv sync
    source .venv/bin/activate
  4. Build Docker Images: This script will build the necessary Docker execution images for ALE-Bench. It automatically pulls pre-built base images from Docker Hub (repository: yimjk/ale-bench) and then creates local images tagged as ale-bench:<language>-<version> with appropriate permissions for your user.

    bash ./scripts/docker_build_all.sh $(id -u) $(id -g)

    If you prefer to pull all base images beforehand, you can optionally run:

    bash ./scripts/docker_pull_all.sh
  5. [Optional] Download Data via Hugging Face Repository:

    # Create a directory for the data
    mkdir -p /tmp/data && cd /tmp/data
    git lfs install
    git clone https://huggingface.co/datasets/SakanaAI/ALE-Bench
    # Set the ALE_BENCH_DATA environment variable to use this local copy.
    # If not set, data will be downloaded on demand using hf_hub_download (default).
    export ALE_BENCH_DATA=/tmp/data/ALE-Bench
  6. [Optional] Install AWS CLI and Terraform: For cloud-based evaluations, install the AWS CLI and Terraform.

The Session Object

The Session object is central to ALE-Bench, encapsulating the state and functionalities for an evaluation session on a specific problem. It facilitates input case generation, code execution, visualization, and evaluation.

Initialization

A session is initiated using the ale_bench.start() function:

import ale_bench
import datetime as dt

session = ale_bench.start(
    problem_id="ahc001",              # Target problem ID
    lite_version=False,               # Use full dataset (True for a smaller subset)
    num_workers=13,                   # Parallel workers for judging (adjust based on CPU cores)
    run_visualization_server=True,    # Enable visualization server
    visualization_server_port=8080,   # Port for the visualization server (None to disable)
    session_duration=dt.timedelta(hours=2) # Optional: set a duration for the session
)

Key Initialization Parameters for ale_bench.start():

  • problem_id (str): The ID of the problem to start a session for. This is a required parameter.
  • lite_version (bool, optional): If True, uses a smaller "lite" version of seeds and problem data for quicker evaluations. Defaults to False.
  • use_same_time_scale (bool, optional): If True, the session simulates contest time progression (e.g., limiting the frequency of public_eval calls as the submission interval in an actual contest). Defaults to False.
  • maximum_num_case_gen (int, optional): Maximum number of input cases that can be generated using session.case_gen() or session.case_gen_eval(). Defaults to a very large number (effectively unlimited).
  • maximum_num_case_eval (int, optional): Maximum number of input cases that can be evaluated using session.case_eval() or session.case_gen_eval(). Defaults to a very large number.
  • maximum_execution_time_case_eval (float, optional): Cumulative maximum execution time (in seconds) using session.case_eval() or session.case_gen_eval(). Defaults to a very large number.
  • maximum_num_call_public_eval (int, optional): Maximum number of times session.public_eval() can be called. Defaults to a very large number (but is overridden by problem-defined limits if use_same_time_scale is True).
  • session_duration (dt.timedelta | int | float, optional): Sets a maximum duration for the entire session. Can be a datetime.timedelta object, or seconds as an int or float. Defaults to None (uses the problem's predefined duration).
  • num_workers (int, optional): The number of worker processes to use for running judge evaluations in parallel. Defaults to 1.
  • run_visualization_server (bool, optional): If True, attempts to start a local visualization server for the problem. Defaults to False.
  • visualization_server_port (int | None, optional): Specifies the port for the visualization server. If None and run_visualization_server is True, a free port between 9000-65535 will be automatically selected. Defaults to None.

Core Methods

Each method is described below with its parameters and return values.

case_gen

Generates input case(s) based on the provided seed(s) and generation arguments.

Parameters:

  • seed (list[int] | int, optional): The seed or list of seeds for case generation. Defaults to 0.
  • gen_kwargs (dict, optional): Dictionary of arguments for the case generator. Defaults to an empty dictionary.

Returns:

  • list[str] | str: The generated case(s) as string(s). Returns a single string if seed is an int, or a list of strings if seed is a list[int].

case_eval

Evaluates the provided code against the given input string(s). This method is intended for local evaluation, allowing users to specify custom time and memory limits.

Parameters:

  • input_str (list[str] | str): The input string or list of input strings for the evaluation.
  • code (str): The source code to be evaluated.
  • code_language (CodeLanguage | str): The programming language of the code. Can be a CodeLanguage enum member or its string representation (e.g., "python", "cpp17").
  • judge_version (JudgeVersion | str, optional): The version of the judge to use. Defaults to None (uses the latest or problem-specific default).
  • time_limit (float, optional): Custom time limit for execution in seconds. Defaults to None (uses problem-specific default).
  • memory_limit (int | str, optional): Custom memory limit for execution (e.g., 256_000_000 for 256MB, or "256m"). Defaults to None (uses problem-specific default).
  • skip_local_visualization (bool, optional): If True, skips generating local visualizations even if available. Defaults to False.

Returns:

  • Result: A Result object containing the evaluation details, including scores, execution time, and memory usage for each case.

case_gen_eval

A convenience method that first generates test case(s) using specified seeds and generation arguments, and then immediately evaluates the provided code against these newly generated cases.

Parameters:

  • code (str): The source code to be evaluated.
  • code_language (CodeLanguage | str): The programming language of the code.
  • judge_version (JudgeVersion | str, optional): The judge version. Defaults to None.
  • seed (list[int] | int, optional): Seed(s) for case generation. Defaults to 0.
  • time_limit (float, optional): Custom time limit in seconds. Defaults to None.
  • memory_limit (int | str, optional): Custom memory limit. Defaults to None.
  • gen_kwargs (dict, optional): Arguments for the case generator. Defaults to an empty dictionary.
  • skip_local_visualization (bool, optional): If True, skips local visualizations. Defaults to False.

Returns:

  • Result: A Result object with the evaluation outcome.

public_eval

Evaluates the provided code against the predefined set of public test cases for the current problem.

Parameters:

  • code (str): The source code to evaluate.
  • code_language (CodeLanguage | str): The programming language of the code.
  • judge_version (JudgeVersion | str, optional): The judge version. Defaults to None.
  • skip_local_visualization (bool, optional): If True, skips local visualizations. Defaults to True for public evaluations.

Returns:

  • Result: A Result object detailing the performance on public test cases.

private_eval

Evaluates the provided code against the predefined set of private test cases. This is typically called during the final evaluation step to determine the official score, rank, and performance.

Parameters:

  • code (str): The source code to evaluate.
  • code_language (CodeLanguage | str): The programming language of the code.
  • judge_version (JudgeVersion | str, optional): The judge version. Defaults to None.

Returns:

  • Result: A Result object detailing the performance on private test cases.
  • int: The new rank achieved with this submission.
  • int: The new performance score.

save

Saves the current state of the session to a JSON file. This allows the session to be paused and resumed later using ale_bench.restart(filepath).

Parameters:

  • filepath (str | os.PathLike, optional): The path where the session file will be saved. Defaults to "session.json".

Returns:

  • None

close

Terminates the current session and cleans up all associated resources. This includes stopping the running visualization server and removing the temporary directory used by the current session.

Parameters:

  • None

Returns:

  • None

Key Properties

  • problem (Problem): Accesses the Problem object associated with the session, containing details such as the problem statement and constraints.
  • problem_id (str): The ID of the current problem.
  • lite_version (bool): Indicates if the session is running in "lite" mode.
  • public_seeds (list[int]): The list of seeds used for public test cases.
  • num_public_cases (int): The number of public test cases.
  • num_private_cases (int): The number of private test cases. (Note: private_seeds itself is not directly accessible).
  • tool_dir (Path): The directory where tools for the current problem are stored.
  • rust_src_dir (Path): Path to the source code of Rust-based tools, if applicable.
  • maximum_resource_usage (ResourceUsage): The configured maximum resource limits for the session.
  • current_resource_usage (ResourceUsage): The current accumulated resource usage.
  • remaining_resource_usage (ResourceUsage): The difference between maximum and current resource usage.
  • action_log (list[str]): A log of all actions performed during the session (e.g., case_gen, public_eval).
  • session_duration (dt.timedelta): The total configured duration for the session.
  • session_started_at (dt.datetime): Timestamp of when the session was initiated.
  • session_remaining_time (dt.timedelta): The time remaining before the session expires.
  • session_finished (bool): Returns True if the session has concluded (either by time or resource limits).
  • run_visualization_server (bool): Indicates whether the visualization server is active.
  • visualization_server_port (int | None): The port number of the visualization server, or None if not running.

Evaluation Example

For fair and reproducible performance comparisons, we strongly recommend running evaluations on a consistent, specified AWS instance (e.g., c6i.32xlarge).

import ale_bench
import ale_bench.utils
import datetime as dt

# Start a new evaluation session
session = ale_bench.start(
    problem_id="ahc001",
    lite_version=False,
    num_workers=13,  # Adjust based on your machine's physical cores
    run_visualization_server=True,
    visualization_server_port=8080
)

# NOTE: While the `session` object contains attributes like `private_seeds`,
# `rank_performance_map`, and `standings`, these (and any other attributes
# prefixed with an underscore, e.g., `_private_inputs`) MUST NOT be accessed
# or used during your experiment to ensure fair evaluation.

# Access problem details
problem = session.problem
problem_statement_md = problem.statement  # Markdown-formatted problem statement
problem_images = problem.statement_images  # Associated images
problem_constraints_obj = problem.constraints  # Structured constraints

# --- Your Agent's Logic Begins ---

# Example: Constructing an initial prompt for an LLM/LMM
# (Replace with your agent's prompt engineering)
initial_messages = my_agent.construct_initial_prompt(
    problem_statement_md,
    problem_images,
    problem_constraints_obj
)

# Utility for parsing problem statements (e.g., for OpenAI models)
parsed_content = ale_bench.utils.parse_statement(
    problem_statement_md, problem_images, return_openai=True
)

# Obtain a solution from your LLM/LMM agent
agent_response = my_agent.get_llm_response(initial_messages)
extracted_code = my_agent.parse_code_from_response(agent_response)
detected_language = my_agent.detect_code_language(extracted_code)
# Ensure detected_language is one of: "cpp17", "cpp20", "cpp23", "python", "rust"

# Evaluate against public test cases
public_result = session.public_eval(extracted_code, code_language=detected_language)
print(f"Initial Public Score: {public_result.overall_absolute_score}")

# Iterative refinement loop (example)
solution_attempts = [(extracted_code, public_result)]
current_best_code = extracted_code

# Define your maximum refinement iterations, e.g., MAX_REFINEMENT_ITERATIONS = 5
for i in range(MAX_REFINEMENT_ITERATIONS):
    feedback_prompt = my_agent.construct_feedback_prompt(
        problem, current_best_code, public_result
    )
    refined_response = my_agent.get_llm_response(feedback_prompt)
    refined_code = my_agent.parse_code_from_response(refined_response)

    if refined_code: # Agent might not always produce new code
        public_result = session.public_eval(refined_code, code_language=detected_language)
        solution_attempts.append((refined_code, public_result))
        # Update current_best_code based on problem's score type (minimize/maximize)
        # (Implementation depends on your agent's strategy)
        current_best_code = my_agent.select_best_code(solution_attempts, problem.metadata.score_type)
    else:
        print(f"Iteration {i+1}: No new code generated.")
        break # Or implement other logic like re-prompting

# Select the final submission based on overall public performance
final_submission_code = my_agent.select_best_code(solution_attempts, problem.metadata.score_type)

# --- Your Agent's Logic Ends ---

# Evaluate the final submission against private test cases
# Ensure `lite_version=False` during session start for rank and performance calculation.
private_result, final_rank, final_performance = session.private_eval(
    final_submission_code, code_language=detected_language
)
print(f"Final Private Score: {private_result.overall_absolute_score}")
print(f"Rank: {final_rank}, Performance: {final_performance}")

# Monitor resource consumption
print(f"Current Resource Usage: {session.current_resource_usage}")
print(f"Remaining Resources: {session.remaining_resource_usage}")

# Inspect local Rust tool sources (if applicable)
if session.problem.metadata.problem_type == "reactive": # Example condition
    ale_bench.utils.print_dir_tree(session.rust_src_dir)

# Persist session state for later analysis or resumption
session.save("my_ahc001_session.json")

# Explicitly close the session to release resources
session.close()

# To resume a saved session:
# resumed_session = ale_bench.restart("/path/to/my_ahc001_session.json")

# To clear all cached ALE-Bench data (problem data, toolchains):
# ale_bench.clear_cache()

RatingCalculator and RankingCalculator

ALE-Bench also provides utilities for calculating ratings and rankings based on contest performance.

RatingCalculator

The RatingCalculator class helps estimate a user's rating based on their performance in various contests. It uses a formula similar to the one described in the official AHC rating document.

Initialization:

from ale_bench.data import RatingCalculator

rating_calculator = RatingCalculator()

Core Method: calculate_rating Calculates the rating based on a dictionary of performances and the ID of the final contest considered.

Parameters:

  • performances (dict[str, int]): A dictionary where keys are problem IDs (e.g., "ahc001") and values are the performance scores achieved in those problems.
  • final_contest (str): The problem ID of the last contest to be included in the rating calculation. Performances from contests ending after this date will be ignored.

Returns:

  • int: The calculated rating, rounded to the nearest integer.

Example:

performances = {
    "ahc001": 2000,
    "ahc002": 2200,
    "ahc003": 1800
}
# Assuming ahc003 is the latest contest to consider for this rating calculation
final_rating = rating_calculator.calculate_rating(performances, "ahc003")
print(f"Calculated Rating: {final_rating}")

RankingCalculator

The RankingCalculator class allows you to determine a user's rank based on their average performance or overall rating, compared against a pre-compiled dataset of existing user rankings. This dataset is automatically downloaded from the Hugging Face Hub.

Initialization:

from ale_bench.data import RankingCalculator

# Initialize with a minimum number of contest participations to be included in the ranking pool
# (default is 5)
ranking_calculator = RankingCalculator(minimum_participation=5)

Core Methods:

calculate_avg_perf_rank

Calculates the rank based on average performance.

Parameters:

  • avg_perf (float): The average performance score.

Returns:

  • int: The calculated rank. Lower is better.

calculate_rating_rank

Calculates the rank based on an overall rating.

Parameters:

  • rating (int): The overall rating.

Returns:

  • int: The calculated rank. Lower is better.

Example:

# Example average performance and rating
my_avg_performance = 2150.75
my_rating = 2345

avg_perf_rank = ranking_calculator.calculate_avg_perf_rank(my_avg_performance)
rating_rank = ranking_calculator.calculate_rating_rank(my_rating)

print(f"Rank based on Average Performance ({my_avg_performance}): {avg_perf_rank}")
print(f"Rank based on Rating ({my_rating}): {rating_rank}")

Cloud Evaluation with AWS

For standardized benchmarking, we leverage Terraform by HashiCorp to provision a consistent AWS environment.

Consult the Terraform configuration, variables file, and setup script for detailed information. We recommend the use of Amazon EC2 C6i Instances for optimal performance.

Recommended num_workers (parallel evaluations): Set num_workers to at most the number of physical cores of your instance, as most solutions are CPU-bound.

Instance vCPUs Memory (GiB) Max num_workers Max num_workers (w/ Vis Server)
c6i.xlarge 4 8 2 1
c6i.2xlarge 8 16 4 3
c6i.4xlarge 16 32 8 7
c6i.8xlarge 32 64 16 15
c6i.12xlarge 48 96 24 23
c6i.16xlarge 64 128 32 31
c6i.24xlarge 96 192 48 47
c6i.32xlarge 128 256 64 63
c6i.metal 128 256 64 63

Workflow:

  1. Initialize & Apply Terraform: Before applying, ensure you have an AWS key pair ready. You will need to provide the path to your public key. It is also highly recommended to restrict SSH access to your IP address for security.

    cd cloud
    terraform init
    terraform apply \
      -var "ssh_public_key_path=</path/to/your/key.pub>" \
      -var "aws_key_name=your-key-pair-name" \
      -var "allowed_ssh_cidr=YOUR_IP_ADDRESS/32" \
      -var "instance_type=c6i.32xlarge" \
      -var "region=us-east-1"
    # Replace </path/to/your/key.pub> with the actual path to your public key.
    # Replace your-key-pair-name with a unique name for the key pair that will be created in AWS.
    # Replace YOUR_IP_ADDRESS/32 with your actual public IP address CIDR (e.g., 123.45.67.89/32).
    # Confirm with 'yes' or use -auto-approve.

    The aws_key_name variable specifies the name of the key pair to be created in AWS using your provided public key. Ensure the corresponding private key is used for SSH access.

  2. Connect to the EC2 Instance:

    ssh -i /path/to/your/key.pem ubuntu@<INSTANCE_PUBLIC_IP>
  3. Verify Setup:

    # Check cloud-init logs for successful completion (This setup takes ~10-20 minutes)
    cat /var/log/cloud-init-output.log
    # Look for "ALE-Bench setup completed!" in green text
    
    # Confirm ALE-Bench directory and activate virtual environment
    ls /home/ubuntu/ALE-Bench
    source /home/ubuntu/ALE-Bench/.venv/bin/activate
  4. Terminate Instance:

    cd cloud # (Run from your local machine, where the terraform state is located)
    terraform destroy -var "ssh_public_key_path=</path/to/your/key.pub>" -var "region=us-east-1"
    # Confirm with 'yes' or use -auto-approve

Development

  • Environment Setup:

    git clone https://github.com/SakanaAI/ALE-Bench.git
    cd ALE-Bench
    pip install ".[dev]"
    
    # Using uv
    uv venv --python 3.12.9
    uv sync --extra dev
    source .venv/bin/activate
  • Docker Image Management:

    # Build a base image (see scripts/docker_build_base_all.sh)
    # Specify --platform linux/amd64 if building on ARM for x86 compatibility
    docker build ./dockerfiles -t yimjk/ale-bench:python-202301-base -f ./dockerfiles/Dockerfile_python_202301_base
    
    # Push to Docker Hub (see scripts/docker_push_all.sh)
    docker image push yimjk/ale-bench:python-202301-base
    
    # Build a user-specific image with correct permissions (see scripts/docker_build_all.sh)
    docker build ./dockerfiles -t ale-bench:python-202301 -f ./dockerfiles/Dockerfile_python_202301 --build-arg UID=$(id -u) --build-arg GID=$(id -g)

    Note: When pushing to Docker Hub, please change the image tag prefix yimjk/ to your own username or organization name as appropriate (e.g., your-username/ale-bench or your-organization/ale-bench).

  • Python Library Development:

    # Linting
    ruff check src tests
    
    # Formatting
    ruff format src tests
    
    # Static Type Checking
    mypy src tests
    
    # Running Tests
    pytest
    pytest -m "not docker"  # Exclude tests requiring Docker

Citation

Please cite ALE-Bench as follows:

@misc{imajuku2025ale-bench,
    title = {{ALE-Bench}: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering},
    author = {Imajuku, Yuki and Horie, Kohki and Iwata, Yoichi and Aoki, Kensho and Takahashi, Naohiro and Akiba, Takuya},
    url = {https://github.com/SakanaAI/ALE-Bench},
    year = {2025}
}

About

The official repository of ALE-Bench

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages