MMLU Benchmark Evaluation Tool

This repository provides a streamlined tool for evaluating open-source and proprietary Large Language Models (LLMs) on the Massive Multitask Language Understanding (MMLU) benchmark.

Introduction

Evaluating Large Language Models (LLMs) on various benchmarks is crucial for understanding their capabilities and limitations. The MMLU benchmark, with its diverse range of subjects, is a widely recognised standard for assessing knowledge and reasoning. This project aims to provide an easy-to-use script to run MMLU evaluations, log performance metrics, and save results in a structured format.

Features

This project provides an easy-to-use and standardised framework for evaluating both open-source and proprietary Large Language Models (LLMs) on the Massive Multitask Language Understanding (MMLU) dataset. It aims to streamline the benchmarking process, allowing users to efficiently assess model performance (accuracy) and inference efficiency (speed and VRAM utilisation) across various MMLU subjects. By offering a clear and reproducible methodology, this project facilitates comparative analysis of LLMs to better understand their capabilities and resource requirements for diverse applications.

Currently, the following types of models are supported:

Hugging Face transformer open-source models
Bedrock models

Quantisation and batching are supported for the Hugging Face transformer open-source models.

Results structure

Evaluation results will be saved in the results/ folder. JSON files will store the accuracy, number of examples, execution time, used VRAM, and available VRAM for each evaluated MMLU subject.

The evaluation results will be stored in a JSON file with the following structure (not actual data):

{
    "abstract_algebra": {
        "accuracy": 0.8235294117647058,
        "number_examples": 17,
        "execution_time": 364.404,
        "used_VRAM": 7934,
        "total_VRAM": 8188
    },
    "high_school_physics": {
        "accuracy": 0.625,
        "number_examples": 16,
        "execution_time": 840.058,
        "used_VRAM": 2224,
        "total_VRAM": 8188
    }
}

License

MIT License.
Copyright holder: Renaud Gaudron

Acknowledgements

This project leverages the incredible work of various communities. We extend our sincere gratitude to:

Hugging Face: For their invaluable transformers and datasets libraries, which are central to the LLM loading and data management within this benchmarking suite.
Amazon Web Services (AWS): For their Bedrock service and the boto3 library.
PyTorch: For providing the foundational deep learning framework, essential for the operation of the Hugging Face models.
MMLU Dataset Creators: For developing and making available the Massive Multitask Language Understanding (MMLU) dataset, which forms the core of these LLM evaluations. The dataset can be found at: https://huggingface.co/datasets/cais/mmlu
The Python Community: For the rich ecosystem of libraries.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
logs		logs
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
environment.yml		environment.yml
main.ipynb		main.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MMLU Benchmark Evaluation Tool

Table of Contents

Introduction

Features

Results structure

License

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

RenaudGaudron/MMLU_benchmark

Folders and files

Latest commit

History

Repository files navigation

MMLU Benchmark Evaluation Tool

Table of Contents

Introduction

Features

Results structure

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages