Lighteval v0.11.0
This release introduces major improvements and changes, across usability, stability, performance and documentation.
Highlights include a large refactor to simplify the architecture, automated metric tests, a dependency rework, improved documentation, and new tasks/benchmarks.
Highlights
- Automated tests for metrics and stronger dependency checks
- Continuous batching, caching, and faster CLI with reduced redundancy
- Upgrade to datasets 4.0 and Trackio integration
- Automatic chat template inference and reasoning trace support
- New tasks: GSM-PLUS, TUMLU-mini, IFBench, Filipino benchmarks, MMLU Redux
- Added Bulgarian, Macedonian, Danish, Icelandic, and Estonian literals
- Documentation improvements (Google docstring style, README updates)
What's Changed
New Features
- Automatic inference of chat template usage (no kwargs needed) by @clefourrier (#885)
- More versatile dependency rework by @LysandreJik (#951)
- Automatic tests for metrics by @NathanHB (#939)
- Sample-to-sample comparisons for integration tests by @NathanHB (#977)
- Continuous batching support by @NathanHB (#850) (arthur)
- Refactored code and removed unused parts by @NathanHB (#709)
- Post-processing for reasoning tokens in pipeline by @clefourrier (#882)
- logging of system prompt by @clefourrier (#907)
- Adds Caching of samples by @clefourrier (#909)
- Upgrade to
datasets
4.0 by @NathanHB (#924) - Trackio integration when available by @NathanHB (#930)
- Parameterization of sampling evals from CLI by @clefourrier (#926)
- Local GGUF support in VLLM with HF tokenizer by @JIElite (#972)
Enhancement
bootstrap_iters
as an argument by @pratyushmaini (#697)- Load tasks before models by @clefourrier (#931)
- Save
reasoning_content
from litellm as details by @muupan (#929) - Fix for
TGI
endpoint inference and JSON grammar generation by @cpcdoy (#502) - Reduced redundancy in CLI arguments by @NathanHB (#932)
- Registry refactor by @clefourrier (#937)
- Multilingual extractiveness support by @rolshoven (#956)
- Added
backend_options
parameter to LLM judges by @rolshoven (#963)
Documentation
- Added
org_to_bill
parameter by @tfrere (#781) - Updated docs with Google docstring style by @NathanHB (#941)
- Updated README by @NathanHB (#961)
New Tasks
- Added GSM-PLUS by @NathanHB (#780)
- Added TUMLU-mini benchmark, fixed #577 by @ceferisbarov (#811)
- Added Filipino benchmark community tasks by @ljvmiranda921 (#852)
- MMLU Redux and caching fix by @clefourrier (#883)
- Added IFBench by @clefourrier (#944)
Task and Metrics Changes
- Added Bulgarian and Macedonian literals by @dianaonutu (#769)
- Added Danish translation literals by @spyysalo (#770)
- Added Icelandic translation literals by @joenaess (#775)
- Completed Estonian translation literals by @spyysalo (#779)
- Updated
translation_literals.py
by @dianaonutu (#923)
Bug Fixes
- Fixed [#794]: assigned
SummaCZS
instance in Faithfulness metric by @sahilds1 (#795) - Caught ROCM/HIP/AMD OOM in
should_reduce_batch_size
by @mcleish7 (#812) - Fixed GPQA and index extractive metric by @clefourrier (#829)
- Updated
extractive_match_utils.py
for cases with:
by @clefourrier (#831) - Fixed
from_model
function and added tests by @NathanHB (#921) - Fixed tasks list by @alielfilali01 (#906)
- Set upper bound on VLLM version by @NathanHB (#964)
- Fixed batching bug in metrics by @rolshoven (#958)
Other Changes
- Fixed typo in attribute name (
CONCURENT_CALLS
→CONCURRENT_CALLS
) by @muupan (#884) - Added ability to configure
concurrent_requests
inlitellm_model.py
by @dameikle (#911)
New Contributors
We’re excited to welcome new contributors in this release:
@pratyushmaini, @DeVikingMark, @sahilds1, @dianaonutu, @tfrere, @mcleish7, @leopardracer, @spyysalo, @ceferisbarov, @joenaess, @ryantzr1, @dtung8068, @muupan, @NouamaneTazi, @uralik, @dameikle, @ljvmiranda921, @cpcdoy, @rolshoven, @JIElite, @LysandreJik
Full Changelog: v0.10.0...v0.11.0