TransEvalnia

TransEvalnia is a prompting-based translation evaluation and ranking system that uses reasoning in performing its evaluations and ranking. The system presents fine-grained evaluations based on a subset of the Multidimensional Quality Metrics (https://themqm.org/), returns an assessment of which translation it deems the best, and provides numerical scores for the various dimensions and for the overall translation. TransEvalnia performs as well as or better than the state-of-the-art MT-Ranker (Moosa et al., 2024) on our own English∼Japanese data as well as several language pairs from various WMT shared tasks. Using Anthropic's Claude-3.5-Sonnet and Qwen-2.5-72B-Instruct as the evaluation LLMs, the evaluations returned are deemed highly acceptable to human raters, and the scores assigned to the translations by Sonnet, as well as other LLMs, correlate well with scores assigned by the human raters.

Publication

Richard Sproat, Tianyu Zhao and Llion Jones. 2025. "TransEvalnia: Reasoning-based Evaluation and Ranking of Translations." http://arxiv.org/abs/2507.12724.

Installation

pip install -r requirements.txt

Sample script

A sample of ten lines from the WMT-2021 English-Japanese dataset can be found in sample_data.

The script scripts/sample_script.sh does a complete run of all variants of the system: one-step evaluation and ranking; two-step evaluation, followed by ranking; three-step evaluation, interleaving and ranking; as well as LLM-based scoring of the translations (which can then be ranked by score).

Computing ordering bias

See the script scripts/ordering_bias.py

Data from HuggingFace

The main set of data can be obtained at https://huggingface.co/datasets/SakanaAI/TransEvalnia.

The script scripts/download_dataset.py downloads the datasets used in the paper from HuggingFace to the local, in the same format as the sample data.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
prompts		prompts
sample_data		sample_data
scripts		scripts
LICENSE		LICENSE
README.md		README.md
evaluate_translations.py		evaluate_translations.py
global_flags.py		global_flags.py
interleave_evaluations.py		interleave_evaluations.py
llm.py		llm.py
permute_corpus.py		permute_corpus.py
rank_translations.py		rank_translations.py
rank_translations_one_step.py		rank_translations_one_step.py
reason_rank.py		reason_rank.py
reason_rank_one_step.py		reason_rank_one_step.py
requirements.txt		requirements.txt
score_translations.py		score_translations.py
score_translations_main.py		score_translations_main.py
system.png		system.png
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TransEvalnia

Publication

Installation

Sample script

Computing ordering bias

Data from HuggingFace

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

SakanaAI/TransEvalnia

Folders and files

Latest commit

History

Repository files navigation

TransEvalnia

Publication

Installation

Sample script

Computing ordering bias

Data from HuggingFace

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages