ALE-Bench is a benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real-world tasks from the AtCoder Heuristic Contest (AHC), ALE-Bench presents optimization problems (e.g., routing and scheduling) that are computationally hard and admit no known exact solution.
Note: This repository is not an official product of SakanaAI or AtCoder and is therefore not officially supported.
Important: Please do not use this repository to participate in AHCs (AtCoder Heuristic Contest Generative AI Usage Rules - Version 20250616).
ale_bench_overview.mp4
-
Install Docker: Follow the official instructions at docker.com.
-
Install CairoSVG Dependencies: Refer to the CairoSVG documentation.
# Linux sudo apt install libcairo2-dev libffi-dev # macOS brew install cairo libffi pkgconf export PKG_CONFIG_PATH="/usr/local/lib/pkgconfig:/opt/homebrew/lib/pkgconfig:$PKG_CONFIG_PATH" export DYLD_LIBRARY_PATH="/usr/local/lib:/opt/homebrew/lib:$DYLD_LIBRARY_PATH"
Note: These paths might vary depending on your macOS version and Homebrew installation. If you encounter issues, verify the correct paths for
cairo
andlibffi
installed by Homebrew. -
Install Python (3.9 - 3.13) and ALE-Bench Toolkit:
# Install via this GitHub repository pip install git+https://github.com/SakanaAI/ALE-Bench.git # Or clone this GitHub repository and install locally git clone https://github.com/SakanaAI/ALE-Bench.git cd ALE-Bench pip install . # Using uv (recommended for faster environment management) git clone https://github.com/SakanaAI/ALE-Bench.git cd ALE-Bench uv venv --python 3.12.9 # Or any supported Python version (3.9 ~ 3.13) uv sync source .venv/bin/activate
-
Build Docker Images: This script will build the necessary Docker execution images for ALE-Bench. It automatically pulls pre-built base images from Docker Hub (repository:
yimjk/ale-bench
) and then creates local images tagged asale-bench:<language>-<version>
with appropriate permissions for your user.bash ./scripts/docker_build_all.sh $(id -u) $(id -g)
If you prefer to pull all base images beforehand, you can optionally run:
bash ./scripts/docker_pull_all.sh
-
[Optional] Download Data via Hugging Face Repository:
# Create a directory for the data mkdir -p /tmp/data && cd /tmp/data git lfs install git clone https://huggingface.co/datasets/SakanaAI/ALE-Bench # Set the ALE_BENCH_DATA environment variable to use this local copy. # If not set, data will be downloaded on demand using hf_hub_download (default). export ALE_BENCH_DATA=/tmp/data/ALE-Bench
For fair and reproducible performance comparisons, we strongly recommend running evaluations on a consistent, specified AWS instance (e.g., c6i.32xlarge
).
We provide a Terraform configuration to set up the necessary environment, including the ALE-Bench toolkit and required dependencies. Please refer to the AWS Evaluation Guide for detailed instructions on setting up and running evaluations in AWS.
We also provide a MCP (Model Context Protocol) server feature to simplify the use of ALE-Bench as a tool. For setup and usage instructions, please refer to the MCP Server documentation.
import ale_bench
import ale_bench.utils
import datetime as dt
# Start a new evaluation session
session = ale_bench.start(
problem_id="ahc001",
lite_version=False,
num_workers=13, # Adjust based on your machine's physical cores
run_visualization_server=True,
visualization_server_port=8080
)
# NOTE: While the `session` object contains attributes like `private_seeds`,
# `rank_performance_map`, and `standings`, these (and any other attributes
# prefixed with an underscore, e.g., `_private_inputs`) MUST NOT be accessed
# or used during your experiment to ensure fair evaluation.
# Access problem details
problem = session.problem
problem_statement_md = problem.statement # Markdown-formatted problem statement
problem_images = problem.statement_images # Associated images
problem_constraints_obj = problem.constraints # Structured constraints
# --- Your Agent's Logic Begins ---
# Example: Constructing an initial prompt for an LLM/LMM
# (Replace with your agent's prompt engineering)
initial_messages = my_agent.construct_initial_prompt(
problem_statement_md,
problem_images,
problem_constraints_obj
)
# Utility for parsing problem statements (e.g., for OpenAI models)
parsed_content = ale_bench.utils.parse_statement(
problem_statement_md, problem_images, return_openai=True
)
# Obtain a solution from your LLM/LMM agent
agent_response = my_agent.get_llm_response(initial_messages)
extracted_code = my_agent.parse_code_from_response(agent_response)
detected_language = my_agent.detect_code_language(extracted_code)
# Ensure detected_language is one of: "cpp17", "cpp20", "cpp23", "python", "rust"
# Evaluate against public test cases
public_result = session.public_eval(extracted_code, code_language=detected_language)
print(f"Initial Public Score: {public_result.overall_absolute_score}")
# Iterative refinement loop (example)
solution_attempts = [(extracted_code, public_result)]
current_best_code = extracted_code
# Define your maximum refinement iterations, e.g., MAX_REFINEMENT_ITERATIONS = 5
for i in range(MAX_REFINEMENT_ITERATIONS):
feedback_prompt = my_agent.construct_feedback_prompt(
problem, current_best_code, public_result
)
refined_response = my_agent.get_llm_response(feedback_prompt)
refined_code = my_agent.parse_code_from_response(refined_response)
if refined_code: # Agent might not always produce new code
public_result = session.public_eval(refined_code, code_language=detected_language)
solution_attempts.append((refined_code, public_result))
# Update current_best_code based on problem's score type (minimize/maximize)
# (Implementation depends on your agent's strategy)
current_best_code = my_agent.select_best_code(solution_attempts, problem.metadata.score_type)
else:
print(f"Iteration {i+1}: No new code generated.")
break # Or implement other logic like re-prompting
# Select the final submission based on overall public performance
final_submission_code = my_agent.select_best_code(solution_attempts, problem.metadata.score_type)
# --- Your Agent's Logic Ends ---
# Evaluate the final submission against private test cases
# Ensure `lite_version=False` during session start for rank and performance calculation.
private_result, final_rank, final_performance = session.private_eval(
final_submission_code, code_language=detected_language
)
print(f"Final Private Score: {private_result.overall_absolute_score}")
print(f"Rank: {final_rank}, Performance: {final_performance}")
# Monitor resource consumption
print(f"Current Resource Usage: {session.current_resource_usage}")
print(f"Remaining Resources: {session.remaining_resource_usage}")
# Inspect local Rust tool sources (if applicable)
if session.problem.metadata.problem_type == "reactive": # Example condition
ale_bench.utils.print_dir_tree(session.rust_src_dir)
# Persist session state for later analysis or resumption
session.save("my_ahc001_session.json")
# Explicitly close the session to release resources
session.close()
# To resume a saved session:
# resumed_session = ale_bench.restart("/path/to/my_ahc001_session.json")
# To clear all cached ALE-Bench data (problem data, toolchains):
# ale_bench.clear_cache()
For more details about ALE-Bench, please refer to the docs/ directory.
Please see the CONTRIBUTING.md file.
Please cite ALE-Bench as follows:
@article{imajuku2025ale-bench,
title = {ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering},
author = {Imajuku, Yuki and Horie, Kohki and Iwata, Yoichi and Aoki, Kensho and Takahashi, Naohiro and Akiba, Takuya},
journal = {arXiv preprint arXiv:2506.09050},
year = {2025}
}