This is an automatically updated open-source large language model leaderboard with data sourced from HuggingFace. Through this project, you can easily view and compare the performance of various large language models.
- 🔄 Automatic Updates: Automatically fetches the latest model evaluation data from HuggingFace daily via GitHub Actions
- 📊 Complete Data: Provides comprehensive leaderboard data, including model names, parameter counts, and various evaluation scores
- 📱 Responsive Design: Supports viewing leaderboard data on various devices
- 🔍 Search and Sort: Supports searching and sorting by different metrics on the complete leaderboard page
- 📥 Data Download: Provides data in JSON and CSV formats for download
Last updated: 2025-10-27 01:10:03 UTC
| Rank | Model | Average Score | Parameters(B) | IFEval | BBH | MATH | GPQA | MUSR | MMLU-PRO |
|---|---|---|---|---|---|---|---|---|---|
| 1 | MaziyarPanahi/calme-3.2-instruct-78b 📑 | 52.08 | 78.0 | 80.63 | 62.61 | 40.33 | 20.36 | 38.53 | 70.03 |
| 2 | MaziyarPanahi/calme-3.1-instruct-78b 📑 | 51.29 | 78.0 | 81.36 | 62.41 | 39.27 | 19.46 | 36.50 | 68.72 |
| 3 | dfurman/CalmeRys-78B-Orpo-v0.1 📑 | 51.23 | 78.0 | 81.63 | 61.92 | 40.63 | 20.02 | 36.37 | 66.80 |
| 4 | MaziyarPanahi/calme-2.4-rys-78b 📑 | 50.77 | 78.0 | 80.11 | 62.16 | 40.71 | 20.36 | 34.57 | 66.69 |
| 5 | huihui-ai/Qwen2.5-72B-Instruct-abliterated 📑 | 48.11 | 72.7 | 85.93 | 60.49 | 60.12 | 19.35 | 12.34 | 50.41 |
| 6 | Qwen/Qwen2.5-72B-Instruct 📑 | 47.98 | 72.7 | 86.38 | 61.87 | 59.82 | 16.67 | 11.74 | 51.40 |
| 7 | MaziyarPanahi/calme-2.1-qwen2.5-72b 📑 | 47.86 | 72.7 | 86.62 | 61.66 | 59.14 | 15.10 | 13.30 | 51.32 |
| 8 | newsbang/Homer-v1.0-Qwen2.5-72B 📑 | 47.46 | 72.7 | 76.28 | 62.27 | 49.02 | 22.15 | 17.90 | 57.17 |
| 9 | ehristoforu/qwen2.5-test-32b-it 📑 | 47.37 | 32.8 | 78.89 | 58.28 | 59.74 | 15.21 | 19.13 | 52.95 |
| 10 | Saxo/Linkbricks-Horizon-AI-Avengers-V1-32B 📑 | 47.34 | 32.8 | 79.72 | 57.63 | 60.27 | 14.99 | 18.16 | 53.25 |
| 11 | MaziyarPanahi/calme-2.2-qwen2.5-72b 📑 | 47.22 | 72.7 | 84.77 | 61.80 | 58.91 | 14.54 | 12.02 | 51.31 |
| 12 | fluently-lm/FluentlyLM-Prinum 📑 | 47.22 | 32.8 | 80.90 | 59.48 | 54.00 | 18.23 | 17.26 | 53.42 |
| 13 | JungZoona/T3Q-Qwen2.5-14B-Instruct-1M-e3 📑 | 47.09 | 0.0 | 73.24 | 65.47 | 28.63 | 22.26 | 38.69 | 54.27 |
| 14 | JungZoona/T3Q-qwen2.5-14b-v1.0-e3 📑 | 47.09 | 14.8 | 73.24 | 65.47 | 28.63 | 22.26 | 38.69 | 54.27 |
| 15 | zetasepic/Qwen2.5-32B-Instruct-abliterated-v2 📑 | 46.89 | 32.8 | 83.34 | 56.53 | 59.52 | 15.66 | 14.93 | 51.35 |
| 16 | rubenroy/Gilgamesh-72B 📑 | 46.79 | 72.7 | 84.86 | 61.84 | 43.81 | 19.24 | 17.66 | 53.36 |
| 17 | Sakalti/ultiima-72B 📑 | 46.77 | 72.7 | 71.40 | 61.10 | 53.55 | 21.92 | 18.12 | 54.51 |
| 18 | CombinHorizon/zetasepic-abliteratedV2-Qwen2.5-32B-Inst-BaseMerge-TIES 📑 | 46.76 | 32.8 | 83.28 | 56.83 | 58.53 | 15.66 | 14.22 | 52.05 |
| 19 | maldv/Awqward2.5-32B-Instruct 📑 | 46.75 | 32.8 | 82.55 | 57.21 | 62.31 | 12.08 | 13.87 | 52.48 |
| 20 | raphgg/test-2.5-72B 📑 | 46.74 | 72.7 | 84.37 | 62.15 | 41.09 | 18.57 | 20.52 | 53.74 |
The complete leaderboard data can be viewed through the following methods:
The leaderboard includes the following main evaluation metrics:
- Average ⬆️: Average score of all evaluations
- IFEval: Instruction following capability evaluation
- BBH: Big-Bench Hard benchmark for large language models
- MATH Lvl 5: Mathematical problem-solving capability evaluation
- GPQA: General Physics Question Answering evaluation
- MUSR: Multi-step reasoning evaluation
- MMLU-PRO: Massive Multitask Language Understanding Professional version evaluation
- Python 3.10+
- HuggingFace API token
- Clone the repository
git clone https://github.com/chenjy16/modelrank_ai.git cd modelrank_ai
This project is open-sourced under the MIT License.
Data is sourced from HuggingFace.