Lighteval v0.11.0

This release introduces major improvements and changes, across usability, stability, performance and documentation.

Highlights include a large refactor to simplify the architecture, automated metric tests, a dependency rework, improved documentation, and new tasks/benchmarks.

Highlights

Automated tests for metrics and stronger dependency checks
Continuous batching, caching, and faster CLI with reduced redundancy
Upgrade to datasets 4.0 and Trackio integration
Automatic chat template inference and reasoning trace support
New tasks: GSM-PLUS, TUMLU-mini, IFBench, Filipino benchmarks, MMLU Redux
Added Bulgarian, Macedonian, Danish, Icelandic, and Estonian literals
Documentation improvements (Google docstring style, README updates)

What's Changed

New Features

Automatic inference of chat template usage (no kwargs needed) by @clefourrier (#885)
More versatile dependency rework by @LysandreJik (#951)
Automatic tests for metrics by @NathanHB (#939)
Sample-to-sample comparisons for integration tests by @NathanHB (#977)
Continuous batching support by @NathanHB (#850) (arthur)
Refactored code and removed unused parts by @NathanHB (#709)
Post-processing for reasoning tokens in pipeline by @clefourrier (#882)
logging of system prompt by @clefourrier (#907)
Adds Caching of samples by @clefourrier (#909)
Upgrade to datasets 4.0 by @NathanHB (#924)
Trackio integration when available by @NathanHB (#930)
Parameterization of sampling evals from CLI by @clefourrier (#926)
Local GGUF support in VLLM with HF tokenizer by @JIElite (#972)

Enhancement

bootstrap_iters as an argument by @pratyushmaini (#697)
Load tasks before models by @clefourrier (#931)
Save reasoning_content from litellm as details by @muupan (#929)
Fix for TGI endpoint inference and JSON grammar generation by @cpcdoy (#502)
Reduced redundancy in CLI arguments by @NathanHB (#932)
Registry refactor by @clefourrier (#937)
Multilingual extractiveness support by @rolshoven (#956)
Added backend_options parameter to LLM judges by @rolshoven (#963)

Documentation

Added org_to_bill parameter by @tfrere (#781)
Updated docs with Google docstring style by @NathanHB (#941)
Updated README by @NathanHB (#961)

New Tasks

Added GSM-PLUS by @NathanHB (#780)
Added TUMLU-mini benchmark, fixed #577 by @ceferisbarov (#811)
Added Filipino benchmark community tasks by @ljvmiranda921 (#852)
MMLU Redux and caching fix by @clefourrier (#883)
Added IFBench by @clefourrier (#944)

Task and Metrics Changes

Added Bulgarian and Macedonian literals by @dianaonutu (#769)
Added Danish translation literals by @spyysalo (#770)
Added Icelandic translation literals by @joenaess (#775)
Completed Estonian translation literals by @spyysalo (#779)
Updated translation_literals.py by @dianaonutu (#923)

Bug Fixes

Fixed [#794]: assigned SummaCZS instance in Faithfulness metric by @sahilds1 (#795)
Caught ROCM/HIP/AMD OOM in should_reduce_batch_size by @mcleish7 (#812)
Fixed GPQA and index extractive metric by @clefourrier (#829)
Updated extractive_match_utils.py for cases with : by @clefourrier (#831)
Fixed from_model function and added tests by @NathanHB (#921)
Fixed tasks list by @alielfilali01 (#906)
Set upper bound on VLLM version by @NathanHB (#964)
Fixed batching bug in metrics by @rolshoven (#958)

Other Changes

Fixed typo in attribute name (CONCURENT_CALLS → CONCURRENT_CALLS) by @muupan (#884)
Added ability to configure concurrent_requests in litellm_model.py by @dameikle (#911)

New Contributors

We’re excited to welcome new contributors in this release:

@pratyushmaini, @DeVikingMark, @sahilds1, @dianaonutu, @tfrere, @mcleish7, @leopardracer, @spyysalo, @ceferisbarov, @joenaess, @ryantzr1, @dtung8068, @muupan, @NouamaneTazi, @uralik, @dameikle, @ljvmiranda921, @cpcdoy, @rolshoven, @JIElite, @LysandreJik

Full Changelog: v0.10.0...v0.11.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

V0.11.0