🖥️ MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

🤗 Evaluation Data | 📑 Paper | 📢 Leaderboard (coming soon)

Note

Our technical report has released on arXiv.

📖 Introduction

We are happy to release MMBench-GUI, a hierarchical, multi-platform benchmark framework and toolbox, to evaluate GUI agents. MMBench-GUI is comprising four evaluation levels: GUI Content Understanding, GUI Element Grounding, GUI Task Automation, and GUI Task Collaboration. We also propose the Efficiency–Quality Area (EQA) metric for agent navigation, integrating accuracy and efficiency. MMBench-GUI provides a rigorous standard for evaluating and guiding future developments in GUI agent capabilities.

MMBench-GUI is developed based on VLMEvalkit, supporting the evaluation of models in a API manner or local deployment manner. We hope that MMBench-GUI will enable more researchers to evaluate agents more efficiently and comprehensively. You can refer to the How-to-Use section for detailed usage.

Examples of each level of tasks

Features

Hierarchical Evaluation: We developed a hierarchical evaluation framework to systematically and comprehensively assess GUI agents' capabilities. In short, we organize the evaluation framework into four ascending levels, termed as L1~L4.
Support multi-platform evaluation: we establish a robust, multi-platform evaluation dataset encompassing diverse operating systems, such as Windows, macOS, Linux, iOS, Android, and Web interfaces, ensuring extensive coverage and relevance to real-world applications.
A more human-aligned evaluation metric for planning: We value both speed and quality of the agent. Therefore, we propose the Efficiency–Quality Area (EQA) metric that balances accuracy and efficiency, rewarding agents that achieve task objectives with minimal operational step, to replace Success Rate (SR).
Manually reviewed and optimized online task setup: We conducted a thorough review of existing online tasks and excluded those that could not be completed due to issues such as network or account restrictions.
More up-to-date evaluation data and more comprehensive task design: We collected, annotated, and processed additional evaluation data through a semi-automated workflow to better assess the agent’s localization and understanding capabilities. Overall, the benchmark comprises over 8,000 tasks spanning various operating platforms.

Todos

Release our technical reports where we have evaluated some GUI Agents on our benchmark.
Support circular mode for the evaluation of GUIContentUnderstanding.
Support GUITaskAutomation based on Docker for all platforms.
Support GUITaskCollaboration based on Docker for all platforms.

🪧 News

2025.07.24 We release our technical report Paper
2025.06.24 We have released the refactoring code for L1-GUI Content Understanding and L2-GUI Element Grounding tasks. Next, tasks of L3-GUI Task Automation and L4-GUI Task Collaboration will also be integrated into this codebase.
2025.06.24 We have released the images and json files used in L1-GUI Content Understanding and L2-GUI Element Grounding tasks at HuggingFace.

Installation and Evaluation

Installation

Build a conda env (we use cuda=12.4 when developing this project).

conda create -n mmbench-gui python==3.9
conda activate mmbench-gui

Install torch

pip install tqdm
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124

Install VLMEvalkit

git clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .

（optional) Flash attention is used to accelerate the inference speed and thus we recommend to install it:

pip install flash-attn==2.7.4.post1 --no-build-isolation

Note

We also provide the environment packages list in requirements/dev_env.txt for reproducing the same env as ours.

Data prepare

Please download our data in HuggingFace, and organize these files as below:

DATA_ROOT/                              // We use LMUData in VLMEvalkit as default root dir.
|-- MMBench-GUI/                        
|   |-- offline_images/
|   |   |-- os_windows/
|   |   |   |-- 0b08bd98_a0e7b2a5_68e346390d562be39f55c1aa7db4a5068d16842c0cb29bd1c6e3b49292a242d1.png
|   |   |   |-- ...
|   |   |-- os_mac/
|   |   |-- os_linux/
|   |   |-- os_ios/
|   |   |-- os_android/
|   |   `-- os_web/
|   |-- L1_annotations.json
`---|-- L2_annotations.json

You can also run download.py to automaticly download data from OpenXLab:

LMUData=/path/of/data python utils/download.py

Evaluation

Here, we evaluate the UI-TARS-1.5-7B model since we have integrated it in our benchmark as an example.

#Single GPU
LMUData=/path/of/data python evaluate.py --config configs/config_local_uitars.py

You can refer to Development Guidance for details about how to integrate and evaluate your model with MMBench-GUI.

📊 Performance

Caution

We are validating the final results again. Thus, performance of models shown in this table would change and we will update this as soon as possible.

Results shown in these tables are obtained through API-based manner, and we keep the same parameters for all models.

1. Performance on L1-GUI Content Understanding.

2. Performance on L2-GUI Element Grounding.

3. Performance on L3-GUI Task Automation.

4. Performance on L4-GUI Task Collaboration.

⚙️ How-to-integrate

Please refer to Development Guidance.

❓ FAQs

Please refer to FAQs.

🌺 Acknowledgement

We would like to thank the following great works, which provided important references for the development of MMBench-GUI.

📌 Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)

@article{wang2025mmbenchgui,
  title   = {MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents},
  author  = {Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, Weiyun Wang, Xiangyu Zhao, Jixuan Chen, Haodong Duan, Tianbao Xie, Shiqian Su, Chenyu Yang, Yue Yu, Yuan Huang, Yiqian Liu, Xiao Zhang, Xiangyu Yue, Weijie Su, Xizhou Zhu, Wei Shen, Jifeng Dai, Wenhai Wang},
  journal    = {arXiv preprint arXiv:2507.19478},
  year    = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
assets		assets
benchmarks		benchmarks
configs		configs
models		models
requirements		requirements
utils		utils
.env		.env
.gitignore		.gitignore
DEFAULT_FUNCTIONS.md		DEFAULT_FUNCTIONS.md
DEVELOPMENT_GUIDANCE.md		DEVELOPMENT_GUIDANCE.md
FAQs.md		FAQs.md
MMBench_GUI_report.pdf		MMBench_GUI_report.pdf
README.md		README.md
evaluate.py		evaluate.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🖥️ MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

📖 Introduction

Features

Todos

🪧 News

Installation and Evaluation

Installation

Data prepare

Evaluation

📊 Performance

1. Performance on L1-GUI Content Understanding.

2. Performance on L2-GUI Element Grounding.

3. Performance on L3-GUI Task Automation.

4. Performance on L4-GUI Task Collaboration.

⚙️ How-to-integrate

❓ FAQs

🌺 Acknowledgement

📌 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

open-compass/MMBench-GUI

Folders and files

Latest commit

History

Repository files navigation

🖥️ MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

📖 Introduction

Features

Todos

🪧 News

Installation and Evaluation

Installation

Data prepare

Evaluation

📊 Performance

1. Performance on L1-GUI Content Understanding.

2. Performance on L2-GUI Element Grounding.

3. Performance on L3-GUI Task Automation.

4. Performance on L4-GUI Task Collaboration.

⚙️ How-to-integrate

❓ FAQs

🌺 Acknowledgement

📌 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages