Skip to content

SakanaAI/ALE-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ALE-Bench

GitHub license GitHub check GitHub stars GitHub downloads Hugging Face repository Docker Hub arXiv Sakana AI Blog English Sakana AI Blog Japanese ALE-Bench Leaderboard

ALE-Bench is a benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real-world tasks from the AtCoder Heuristic Contest (AHC), ALE-Bench presents optimization problems (e.g., routing and scheduling) that are computationally hard and admit no known exact solution.

Note: This repository is not an official product of SakanaAI or AtCoder and is therefore not officially supported.

Important: Please do not use this repository to participate in AHCs (AtCoder Heuristic Contest Generative AI Usage Rules - Version 20250616).

ale_bench_overview.mp4

Table of Contents

Setup

  1. Install Docker: Follow the official instructions at docker.com.

  2. Install CairoSVG Dependencies: Refer to the CairoSVG documentation.

    # Linux
    sudo apt install libcairo2-dev libffi-dev
    # macOS
    brew install cairo libffi pkgconf
    export PKG_CONFIG_PATH="/usr/local/lib/pkgconfig:/opt/homebrew/lib/pkgconfig:$PKG_CONFIG_PATH"
    export DYLD_LIBRARY_PATH="/usr/local/lib:/opt/homebrew/lib:$DYLD_LIBRARY_PATH"

    Note: These paths might vary depending on your macOS version and Homebrew installation. If you encounter issues, verify the correct paths for cairo and libffi installed by Homebrew.

  3. Install Python (3.9 - 3.13) and ALE-Bench Toolkit:

    # Install via this GitHub repository
    pip install git+https://github.com/SakanaAI/ALE-Bench.git
    
    # Or clone this GitHub repository and install locally
    git clone https://github.com/SakanaAI/ALE-Bench.git
    cd ALE-Bench
    pip install .
    
    # Using uv (recommended for faster environment management)
    git clone https://github.com/SakanaAI/ALE-Bench.git
    cd ALE-Bench
    uv venv --python 3.12.9  # Or any supported Python version (3.9 ~ 3.13)
    uv sync
    source .venv/bin/activate
  4. Build Docker Images: This script will build the necessary Docker execution images for ALE-Bench. It automatically pulls pre-built base images from Docker Hub (repository: yimjk/ale-bench) and then creates local images tagged as ale-bench:<language>-<version> with appropriate permissions for your user.

    bash ./scripts/docker_build_all.sh $(id -u) $(id -g)

    If you prefer to pull all base images beforehand, you can optionally run:

    bash ./scripts/docker_pull_all.sh
  5. [Optional] Download Data via Hugging Face Repository:

    # Create a directory for the data
    mkdir -p /tmp/data && cd /tmp/data
    git lfs install
    git clone https://huggingface.co/datasets/SakanaAI/ALE-Bench
    # Set the ALE_BENCH_DATA environment variable to use this local copy.
    # If not set, data will be downloaded on demand using hf_hub_download (default).
    export ALE_BENCH_DATA=/tmp/data/ALE-Bench

Evaluation

For fair and reproducible performance comparisons, we strongly recommend running evaluations on a consistent, specified AWS instance (e.g., c6i.32xlarge).

We provide a Terraform configuration to set up the necessary environment, including the ALE-Bench toolkit and required dependencies. Please refer to the AWS Evaluation Guide for detailed instructions on setting up and running evaluations in AWS.

We also provide a MCP (Model Context Protocol) server feature to simplify the use of ALE-Bench as a tool. For setup and usage instructions, please refer to the MCP Server documentation.

Example Evaluation Script

import ale_bench
import ale_bench.utils
import datetime as dt

# Start a new evaluation session
session = ale_bench.start(
    problem_id="ahc001",
    lite_version=False,
    num_workers=13,  # Adjust based on your machine's physical cores
    run_visualization_server=True,
    visualization_server_port=8080
)

# NOTE: While the `session` object contains attributes like `private_seeds`,
# `rank_performance_map`, and `standings`, these (and any other attributes
# prefixed with an underscore, e.g., `_private_inputs`) MUST NOT be accessed
# or used during your experiment to ensure fair evaluation.

# Access problem details
problem = session.problem
problem_statement_md = problem.statement  # Markdown-formatted problem statement
problem_images = problem.statement_images  # Associated images
problem_constraints_obj = problem.constraints  # Structured constraints

# --- Your Agent's Logic Begins ---

# Example: Constructing an initial prompt for an LLM/LMM
# (Replace with your agent's prompt engineering)
initial_messages = my_agent.construct_initial_prompt(
    problem_statement_md,
    problem_images,
    problem_constraints_obj
)

# Utility for parsing problem statements (e.g., for OpenAI models)
parsed_content = ale_bench.utils.parse_statement(
    problem_statement_md, problem_images, return_openai=True
)

# Obtain a solution from your LLM/LMM agent
agent_response = my_agent.get_llm_response(initial_messages)
extracted_code = my_agent.parse_code_from_response(agent_response)
detected_language = my_agent.detect_code_language(extracted_code)
# Ensure detected_language is one of: "cpp17", "cpp20", "cpp23", "python", "rust"

# Evaluate against public test cases
public_result = session.public_eval(extracted_code, code_language=detected_language)
print(f"Initial Public Score: {public_result.overall_absolute_score}")

# Iterative refinement loop (example)
solution_attempts = [(extracted_code, public_result)]
current_best_code = extracted_code

# Define your maximum refinement iterations, e.g., MAX_REFINEMENT_ITERATIONS = 5
for i in range(MAX_REFINEMENT_ITERATIONS):
    feedback_prompt = my_agent.construct_feedback_prompt(
        problem, current_best_code, public_result
    )
    refined_response = my_agent.get_llm_response(feedback_prompt)
    refined_code = my_agent.parse_code_from_response(refined_response)

    if refined_code: # Agent might not always produce new code
        public_result = session.public_eval(refined_code, code_language=detected_language)
        solution_attempts.append((refined_code, public_result))
        # Update current_best_code based on problem's score type (minimize/maximize)
        # (Implementation depends on your agent's strategy)
        current_best_code = my_agent.select_best_code(solution_attempts, problem.metadata.score_type)
    else:
        print(f"Iteration {i+1}: No new code generated.")
        break # Or implement other logic like re-prompting

# Select the final submission based on overall public performance
final_submission_code = my_agent.select_best_code(solution_attempts, problem.metadata.score_type)

# --- Your Agent's Logic Ends ---

# Evaluate the final submission against private test cases
# Ensure `lite_version=False` during session start for rank and performance calculation.
private_result, final_rank, final_performance = session.private_eval(
    final_submission_code, code_language=detected_language
)
print(f"Final Private Score: {private_result.overall_absolute_score}")
print(f"Rank: {final_rank}, Performance: {final_performance}")

# Monitor resource consumption
print(f"Current Resource Usage: {session.current_resource_usage}")
print(f"Remaining Resources: {session.remaining_resource_usage}")

# Inspect local Rust tool sources (if applicable)
if session.problem.metadata.problem_type == "reactive": # Example condition
    ale_bench.utils.print_dir_tree(session.rust_src_dir)

# Persist session state for later analysis or resumption
session.save("my_ahc001_session.json")

# Explicitly close the session to release resources
session.close()

# To resume a saved session:
# resumed_session = ale_bench.restart("/path/to/my_ahc001_session.json")

# To clear all cached ALE-Bench data (problem data, toolchains):
# ale_bench.clear_cache()

Documentation

For more details about ALE-Bench, please refer to the docs/ directory.

Development and Contributing

Please see the CONTRIBUTING.md file.

Citation

Please cite ALE-Bench as follows:

@article{imajuku2025ale-bench,
    title = {ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering},
    author = {Imajuku, Yuki and Horie, Kohki and Iwata, Yoichi and Aoki, Kensho and Takahashi, Naohiro and Akiba, Takuya},
    journal = {arXiv preprint arXiv:2506.09050},
    year = {2025}
}

About

The official repository of ALE-Bench

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages