RLMEval: Evaluating Research-Level Neural Theorem Proving & Autoformalization

This repository features the code for the evaluation of neural theorem proving and autoformalization methods on the RLM25 benchmark.

Project setup

Important

Set the environment variables GITHUB_ACCESS_TOKEN (see GitHub documentation) and OPENAI_API_KEY.

Development inside Docker (recommended)

In VS Code, run the Dev Containers: Open Folder in Container... command from the Command Palette (F1). The .devcontainer folder contains the necessary configuration and will take care of setting up the environment.

Local installation

Requirements:

Python >= 3.10
git
Lean 4

Install Python project:

pip install -e .

RLM25

Prepare the RLM25 dataset:

python scripts/extract_benchmark.py  --config configs/benchmark/rlm25.yaml

Run statement autoformalization evaluation:

python scripts/eval_statement_autoformalization.py --benchmark-config configs/benchmark/rlm25.yaml --model-config configs/models/gpt-4o_greedy.yaml

Run proof autoformalization evaluation:

python scripts/eval_proof_autoformalization.py --benchmark-config configs/benchmark/rlm25.yaml --model-config configs/models_proof/gpt-4o_greedy.yaml

Citation

If you use this code or the RLM25 benchmark in your research, please consider citing one of the following papers. If you use the benchmark for evaluating automated theorem proving or proof autoformalization methods, please cite the first paper. If you use it for evaluating statement autoformalization methods, please cite the second paper.

RLMEval: Evaluating Research-Level Neural Theorem Proving, EMNLP 2025 Findings

@misc{poiroux2025rlmevalevaluatingresearchlevelneural,
    title={RLMEval: Evaluating Research-Level Neural Theorem Proving}, 
    author={Auguste Poiroux and Antoine Bosselut and Viktor Kunčak},
    year={2025},
    eprint={2510.25427},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2510.25427}, 
}

Reliable Evaluation and Benchmarks for Statement Autoformalization, EMNLP 2025

@misc{poiroux2025reliableevaluationbenchmarksstatement,
    title={Reliable Evaluation and Benchmarks for Statement Autoformalization}, 
    author={Auguste Poiroux and Gail Weiss and Viktor Kunčak and Antoine Bosselut},
    year={2025},
    eprint={2406.07222},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2406.07222}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
configs		configs
data		data
scripts		scripts
src/rlm_eval		src/rlm_eval
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RLMEval: Evaluating Research-Level Neural Theorem Proving & Autoformalization

Project setup

Development inside Docker (recommended)

Local installation

RLM25

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

augustepoiroux/RLMEval

Folders and files

Latest commit

History

Repository files navigation

RLMEval: Evaluating Research-Level Neural Theorem Proving & Autoformalization

Project setup

Development inside Docker (recommended)

Local installation

RLM25

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages