Welcome to the Xent Benchmark Github 🥳
Some quick notes:
See more in depth results at xent.ai
| Rank | Player ID | Score |
|---|---|---|
| 1 | gemini-2.5-pro | 65.86 |
| 2 | grok-4-0709 | 63.22 |
| 3 | gpt-5 | 62.77 |
| 4 | deepseek-reasoner | 62.67 |
| 5 | gemini-2.5-flash | 59.08 |
| 6 | claude-opus-4-1-20250805 | 58.65 |
| 7 | claude-opus-4-20250514 | 58.35 |
| 8 | gpt-5-mini | 49.22 |
| 9 | claude-sonnet-4-20250514 | 48.45 |
| 10 | kimi-k2-0905-preview | 42.89 |
| 11 | deepseek-chat | 35.48 |
| 12 | gpt-5-nano | 23.27 |
So how do you run a Xent benchmark? There are two main ways: a local web interface and the CLI. There are instructions for both below.
But first, you'll need to install the package: pip install xent
If you want to run from source, then you need to have uv installed. See the installation instructions or just run curl -LsSf https://astral.sh/uv/install.sh | sh. If using uv, when following the instructions below you'll need to call uv run xent instead of just xent.
To start the web interface, call xent serve. This starts a webserver running on localhost:8000 and opens a browser window to that page.
At the top of the page, you'll see a "Play Game" button. This allows you to play games directly, rather than benchmarking llms / agents.
Once you are on the play page, you can specify the game you want to execute by pasting the code into the game code panel. When you begin, the judge model (gpt2) will execute locally and you can begin trying out game execution and seeing results. Let us know if you write any fun games and we will happily add them to this repository!
If you want to run a benchmark, the first thing you need to do is set up configuration for that benchmark.
Basic Settings: in this panel, you can configure settings that apply for the entire benchmark, such as Judge model, how many rounds per game, and how many maps per game.
Players: here you can configure what players you want for your model. See "Customized Model Configuration" below for information on how you can specify more information about llm players via query parameters on model specification. You can also specify a human player for interactive mode.
Games: choose to either use the default games that come packaged with xent, or write custom games (as well as their presentation functions).
Once ready, click "Create Benchmark Configuration" to generate your config.
Once you have saved a benchmark, you can now execute it. From the configuration dashboard page, you can use the "Start Benchmark" button to begin local execution.
Depending on your configuration, benchmark execution can be fairly time consuming. Once the benchmark has begun execution, you will see completed games as they finish, but this can take anywhere from 2 to 20 minutes, so be patient!
Please look at "Environment variables" below to find out about what information you will need to have available in order to let execution complete successfully.
First, we'll create a configuration for your benchmark run. This configuration will contain things like:
- Games to execute
- Maps for the different games
- Players for those games
- Model to use as a judge
To generate such a configuration you call uv run xent configure.
# Generate a minimal configuration with a simple game played by gpt-4o
xent configure
# Generate a configuration with a simple game played by gpt-4.1 and o3
xent configure --model gpt-4.1 --model o3
# Generate a configuration from specific game files and/or directories
uv run xent configure --game-dir ./games
# See more CLI configuration options
uv run xent configure --helpThe configuration is written to a JSON file (defaults to <user data dir>/xent_config.json; see "Data Storage & Overrides" below) which can be passed to xent run for execution. To better understand the configuration file, take a look at CondensedXentBenchmarkConfig and ExpandedXentBenchmarkConfig in src/xent/common/configuration_types.py
So, now that we have a configuration, how do we run it?
To run a benchmark from the command line, use xent run. But before you do that, here are some notes:
By default, xent run uses <user data dir>/xent_config.json as the configuration path. Override with uv run xent run --config path/to/config.json. To see all flags, call xent run --help.
During execution, xent run writes results and artifacts under <user data dir>/benchmarks/<benchmark_id>. Override the base directory via uv run xent run --results-dir path/to/results. Xent creates a subdirectory named with the benchmark_id.
In order to be somewhat robust to failure or interruption, Xent will look into the results directory for completed work. So if you re-run an already completed benchmark, it will effectively be a no-op. Instead you can pass either --regenerate-id (which will make a new, timestamped, benchmark id for the run) or --clean which will destroy any existing data in the results dir. Be careful using --clean! You can totally delete your valuable results! I recommend using --regenerate-id
A brief note about environment variables. If you want to run a Xent benchmark using paid APIs (eg ChatGPT), then you will need to export an environment variable holding your API key. Currently, Xent supports:
- OPENAI_API_KEY
- ANTHROPIC_API_KEY
- GEMINI_API_KEY
- GROK_API_KEY
- DEEPSEEK_API_KEY
- MOONSHOT_API_KEY
You'll get an exception if you try to call these models without the proper environment variables.
Now that you have completed a benchmark execution, its time to examine the results. The easiest way to look at the results is to read the generated markdown report. This will be present in results_directory/<benchmark_id>/report.md. It contains a human-readable summary (including some nice charting) of all the games played.
In addition to report.md, there is also the benchmark_<benchmark_id>.json file. This contains the complete data generated by the benchmark and its structure is defined in src/xent/common/configuration_types.py as BenchmarkResult.
You'll also see files named "game_<game_name>_<model_name>.json". These files contain results for that game-player pair (as well as the original configuration of the game). The data structure is the GameMapResults type defined in src/xent/common/configuration_types.py. All of the data in these files will be present in the benchmark json, but you may find it handy to inspect them individually as the output can be quite long.
Finally, there is log.txt which is simply the log output of the benchmark execution. Any errors or issues will be visible here.
Xent stores all run data (configs, results, logs) under the platform’s standard user data directory by default:
- macOS:
~/Library/Application Support/xent - Linux:
~/.local/share/xent
Defaults:
- Config file:
<user data dir>/xent_config.json - Results root:
<user data dir>/benchmarks - Per‑benchmark data:
<results root>/<benchmark_id> - Logs:
<results root>/<benchmark_id>/logs/log.txt
Overrides:
- CLI flags take precedence:
--configand--results-dir XENT_DATA_DIRchanges the base user data directory (affects both config and results defaults)XENT_RESULTS_DIRoverrides only the results root- The web UI uses the same results root; its keystore is stored at
<results root>/.api_keys.json
Directories are created on first use. If you prefer a custom location (e.g., a project folder), pass it via --results-dir.
The following is a guide for those of you who you are interested in evaluating with customized model configuration or even with a custom agent.
By default, Xent uses configuration options when calling models. For LLM APIs (eg ChatGPT) Xent tends to have very little usage of configuration such as temperature parameters. For models called via HuggingFace, however, more configuration is necessary.
When calling xent configure --model foo, you can use query parameters to specify params that will be forwarded directly to the api request. For example xent configure --model gpt-5?reasoning_effort=high will pass reasoning_effort: "high" into the openai requests made for that model.
You can view the model configuration options available in src/xent/runtime/player_configuration.py. These options (DefaultXGPOptions and DefaultHFXGPOptions) are specified in the options field of PlayerConfig type defined in src/xent/common/configuration_types.py. You can view the actual usage of these options in the HuggingFaceClient class defined in src/xent/runtime/llm_api_client.py.
Some people modifying Xent are interested in running Xent against their own agents. If that's you, then here is some good news! We have endeavored to make this relatively simple as long as you are familiar with python development.
Here are the key places you should look at to make those changes:
src/xent/runtime/players/base_player.py- This contains the XGP interface that player agents must implement.src/xent/runtime/players/default_players.py- This contains the default player agent implementation. You can use this as a reference implementation.src/xent/runtime/players/players.py- This contains the registry and mapping between player types and player implementations.
By making changes to a few files, you can add your own agent implementation to the Xent system and begin benchmarking it yourself. If you have spent the time to do so, then we encourage you to open a pull request with your changes so that others can benefit from your work.
A short guide to get you started modifying Xent and maybe (hopefully!) contributing.
- Before you begin, ensure you have the following installed:
- Python 3.12 or higher
- uv for dependency management
- Run the CLI tool
uv run xent- Run tests
Note: Running tests will execute GPT-2 via Hugging Face. If you run integration tests, then you'll need to have Ollama running locally with qwen3:0.6b available.
# Run all tests
uv run pytest
# Run only unit tests
uv run pytest -m "not integration"
# Run only integration tests
uv run pytest -m integrationThe project uses modern Python tooling for consistent code quality:
# Format code
uv run ruff format .
# Lint and auto-fix issues
uv run ruff check --fix .
# Type check (source only)
uv run mypy src/
# Run all quality checks
uv run ruff format . && uv run ruff check --fix . && uv run mypy src/Pre-commit hooks are available and will automatically:
- Format code with Ruff
- Fix auto-fixable linting issues
- Run type checking on staged files
To run pre-commit manually:
uv run pre-commit run --all-filesProject-specific VSCode settings are configured to:
- Use Ruff for formatting and linting
- Format on save
- Organize imports automatically
- Integrate with the project's Python interpreter
This project is licensed under the MIT License - see the LICENSE file for details.
The code in this repository is developed from the paper below. Please cite it if you find the repository helpful.
@misc{hongler2025crossentropygameslanguagemodels,
title={Cross-Entropy Games for Language Models: From Implicit Knowledge to General Capability Measures},
author={Clément Hongler and Andrew Emil},
year={2025},
eprint={2506.06832},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.06832},
}