ALE-Bench

ALE-Bench is a benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real-world tasks from the AtCoder Heuristic Contest (AHC), ALE-Bench presents optimization problems (e.g., routing and scheduling) that are computationally hard and admit no known exact solution.

Note: This repository is not an official product of SakanaAI or AtCoder and is therefore not officially supported.

Important: Please do not use this repository to participate in AHCs (AtCoder Heuristic Contest Generative AI Usage Rules - Version 20250616).

ale_bench_overview.mp4

Setup

Install Docker: Follow the official instructions at docker.com.

Install CairoSVG Dependencies: Refer to the CairoSVG documentation.

# Linux
sudo apt install libcairo2-dev libffi-dev
# macOS
brew install cairo libffi pkgconf
export PKG_CONFIG_PATH="/usr/local/lib/pkgconfig:/opt/homebrew/lib/pkgconfig:$PKG_CONFIG_PATH"
export DYLD_LIBRARY_PATH="/usr/local/lib:/opt/homebrew/lib:$DYLD_LIBRARY_PATH"

Note: These paths might vary depending on your macOS version and Homebrew installation. If you encounter issues, verify the correct paths for cairo and libffi installed by Homebrew.

Install Python (3.9 - 3.13) and ALE-Bench Toolkit:

# Install via this GitHub repository
pip install git+https://github.com/SakanaAI/ALE-Bench.git

# Or clone this GitHub repository and install locally
git clone https://github.com/SakanaAI/ALE-Bench.git
cd ALE-Bench
pip install .

# Using uv (recommended for faster environment management)
git clone https://github.com/SakanaAI/ALE-Bench.git
cd ALE-Bench
uv venv --python 3.12.9  # Or any supported Python version (3.9 ~ 3.13)
uv sync
source .venv/bin/activate

Build Docker Images: This script will build the necessary Docker execution images for ALE-Bench. It automatically pulls pre-built base images from Docker Hub (repository: yimjk/ale-bench) and then creates local images tagged as ale-bench:<language>-<version> with appropriate permissions for your user.
```
bash ./scripts/docker_build_all.sh $(id -u) $(id -g)
```
If you prefer to pull all base images beforehand, you can optionally run:
```
bash ./scripts/docker_pull_all.sh
```

[Optional] Download Data via Hugging Face Repository:

# Create a directory for the data
mkdir -p /tmp/data && cd /tmp/data
git lfs install
git clone https://huggingface.co/datasets/SakanaAI/ALE-Bench
# Set the ALE_BENCH_DATA environment variable to use this local copy.
# If not set, data will be downloaded on demand using hf_hub_download (default).
export ALE_BENCH_DATA=/tmp/data/ALE-Bench

Evaluation

For fair and reproducible performance comparisons, we strongly recommend running evaluations on a consistent, specified AWS instance (e.g., c6i.32xlarge).

We provide a Terraform configuration to set up the necessary environment, including the ALE-Bench toolkit and required dependencies. Please refer to the AWS Evaluation Guide for detailed instructions on setting up and running evaluations in AWS.

We also provide a MCP (Model Context Protocol) server feature to simplify the use of ALE-Bench as a tool. For setup and usage instructions, please refer to the MCP Server documentation.

Example Evaluation Script

import ale_bench
import ale_bench.utils
import datetime as dt

# Start a new evaluation session
session = ale_bench.start(
    problem_id="ahc001",
    lite_version=False,
    num_workers=13,  # Adjust based on your machine's physical cores
    run_visualization_server=True,
    visualization_server_port=8080
)

# NOTE: While the `session` object contains attributes like `private_seeds`,
# `rank_performance_map`, and `standings`, these (and any other attributes
# prefixed with an underscore, e.g., `_private_inputs`) MUST NOT be accessed
# or used during your experiment to ensure fair evaluation.

# Access problem details
problem = session.problem
problem_statement_md = problem.statement  # Markdown-formatted problem statement
problem_images = problem.statement_images  # Associated images
problem_constraints_obj = problem.constraints  # Structured constraints

# --- Your Agent's Logic Begins ---

# Example: Constructing an initial prompt for an LLM/LMM
# (Replace with your agent's prompt engineering)
initial_messages = my_agent.construct_initial_prompt(
    problem_statement_md,
    problem_images,
    problem_constraints_obj
)

# Utility for parsing problem statements (e.g., for OpenAI models)
parsed_content = ale_bench.utils.parse_statement(
    problem_statement_md, problem_images, return_openai=True
)

# Obtain a solution from your LLM/LMM agent
agent_response = my_agent.get_llm_response(initial_messages)
extracted_code = my_agent.parse_code_from_response(agent_response)
detected_language = my_agent.detect_code_language(extracted_code)
# Ensure detected_language is one of: "cpp17", "cpp20", "cpp23", "python", "rust"

# Evaluate against public test cases
public_result = session.public_eval(extracted_code, code_language=detected_language)
print(f"Initial Public Score: {public_result.overall_absolute_score}")

# Iterative refinement loop (example)
solution_attempts = [(extracted_code, public_result)]
current_best_code = extracted_code

# Define your maximum refinement iterations, e.g., MAX_REFINEMENT_ITERATIONS = 5
for i in range(MAX_REFINEMENT_ITERATIONS):
    feedback_prompt = my_agent.construct_feedback_prompt(
        problem, current_best_code, public_result
    )
    refined_response = my_agent.get_llm_response(feedback_prompt)
    refined_code = my_agent.parse_code_from_response(refined_response)

    if refined_code: # Agent might not always produce new code
        public_result = session.public_eval(refined_code, code_language=detected_language)
        solution_attempts.append((refined_code, public_result))
        # Update current_best_code based on problem's score type (minimize/maximize)
        # (Implementation depends on your agent's strategy)
        current_best_code = my_agent.select_best_code(solution_attempts, problem.metadata.score_type)
    else:
        print(f"Iteration {i+1}: No new code generated.")
        break # Or implement other logic like re-prompting

# Select the final submission based on overall public performance
final_submission_code = my_agent.select_best_code(solution_attempts, problem.metadata.score_type)

# --- Your Agent's Logic Ends ---

# Evaluate the final submission against private test cases
# Ensure `lite_version=False` during session start for rank and performance calculation.
private_result, final_rank, final_performance = session.private_eval(
    final_submission_code, code_language=detected_language
)
print(f"Final Private Score: {private_result.overall_absolute_score}")
print(f"Rank: {final_rank}, Performance: {final_performance}")

# Monitor resource consumption
print(f"Current Resource Usage: {session.current_resource_usage}")
print(f"Remaining Resources: {session.remaining_resource_usage}")

# Inspect local Rust tool sources (if applicable)
if session.problem.metadata.problem_type == "reactive": # Example condition
    ale_bench.utils.print_dir_tree(session.rust_src_dir)

# Persist session state for later analysis or resumption
session.save("my_ahc001_session.json")

# Explicitly close the session to release resources
session.close()

# To resume a saved session:
# resumed_session = ale_bench.restart("/path/to/my_ahc001_session.json")

# To clear all cached ALE-Bench data (problem data, toolchains):
# ale_bench.clear_cache()

Documentation

For more details about ALE-Bench, please refer to the docs/ directory.

Development and Contributing

Please see the CONTRIBUTING.md file.

Citation

Please cite ALE-Bench as follows:

@article{imajuku2025ale-bench,
    title = {ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering},
    author = {Imajuku, Yuki and Horie, Kohki and Iwata, Yoichi and Aoki, Kensho and Takahashi, Naohiro and Akiba, Takuya},
    journal = {arXiv preprint arXiv:2506.09050},
    year = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
cloud		cloud
dockerfiles		dockerfiles
docs		docs
mcp		mcp
scripts		scripts
src/ale_bench		src/ale_bench
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ALE-Bench

Table of Contents

Setup

Evaluation

Example Evaluation Script

Documentation

Development and Contributing

Citation

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

SakanaAI/ALE-Bench

Folders and files

Latest commit

History

Repository files navigation

ALE-Bench

Table of Contents

Setup

Evaluation

Example Evaluation Script

Documentation

Development and Contributing

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages