ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation (Real-Robot-Questions)

ManipBench

🌐 Homepage | Raw Dataset | Simplified Dataset |📑 Paper

This repo contains the code for generating and evaluating VLMs on real-world robot datasets for the paper "ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation".

If you like our project, please give us a star ⭐ on GitHub for latest update.

🔔News

🔥[2025-9-28]: ManipBench was presented on CoRL 2025

🔥[2025-8-1]: ManipBench was accepted by CoRL 2025

For now we only provide the instruction for how to evaluating the VLMs on our benchmark. This code base also include the code for generating the questions from real-world robotic manipulation datasets and the server-side code (the robot-side code is here)for conducting the real-world experiments presented in our paper. We will release the instruction about those code in the near future.

For real-world questions, feel free to contact with Enyu Zhao.For fabric-manipulation questions, you should check with Vedant Raval for details. And for the simulation code and questions, you can check with Hejia Zhang.

Introduction

Vision-Language Models (VLMs) have revolutionized artificial intelligence and robotics due to their commonsense reasoning capabilities. In robotic manipulation, VLMs are used primarily as high-level planners, but recent work has also studied their lower-level reasoning ability, which refers to making decisions about precise robot movements. However, the community currently lacks a clear and common benchmark that can evaluate how well VLMs can aid low-level reasoning in robotics. We propose ManipBench, to evaluate the low-level robot manipulation reasoning capabilities of VLMs across various dimensions, including how well they understand object-object interactions and deformable object manipulation.

Dataset Summary

The complete ManipBench dataset consists of of 12,617 VQAs across tasks ranging from pick-and-place, articulated object manipulation, deformable object manipulation, and dynamic manipulation.

In this codebase we provide the code for generating and evaluating the questions from real-world robotic manipulation datasets (9,180 VQAs).

We provide 2 types of VQAs. In the first type, the VLM should choose a picking point and a gripper moving trajectory combination from 4 candidates. A sample question is shown below:

In the second type, the VLM should choose a picking point from 4 candidates first and then choose which tile is the ideal placing location. A sample question is shown below:

If you only want to evaluate a VLM's performance on our benchmark, you don't have to worry about the code for generation or the real-world experiments.

Instruction on Evaluating VLMs

1. Use our pipeline:

Download the dataset: We provide the dataset in this link. Download the zip file and then unzip in this folder. The file structure should look like:
```
|BEN-VLM-MOKA
--|Dataset_ManipBench_ExData
----|bridge_pick_place
----|droid_arti
----|droid_pick_place
```
Here BEN-VLM-MOKA refers to this repo's path. You can rename "Dataset_ManipBench_ExData" to anything you like, in the following steps, I will refer YOUR_DATASET_ROOT as the folder name
Evaluate the VLM: We provide the code for evaluating the VLMs included in our paper. If you want to test those VLMs, you can run the code in each model's folder. For example, if you want to test with gpt-4o on type 1 questions from bridge dataset, you can run the following code in the root folder BEN-VLM-MOKA
```
python ./gpt/gpt_test_moka.py --dataset_root_folder YOUR_DATASET_ROOT --dataset_folder bridge_pick_place --dataset_name bridge --question_type Q1 --model_name gpt-4o --total_batches 10 
```
• If you want to evaluate the VLMs not included in our paper, you can use the BEN-VLM-MOKA/MODEL_FAMILY/MODEL_FAMILY_test_moka.py(like BEN-VLM-MOKA/gpt/gpt_test_moka.py) as examples to create your own script for evaluating new VLMs. The base class for evaluating is in manipbench_question_pipeline.py

2. Just want to use our questions for evaluating new VLMs:

Alternatively, we also provide an easy-to-use dataset if you don't want to write the code following our example. You can download the zip file from here. The file structure will look like this:

-simplified_manipbench_rr
----|Q1
---------|bridge_pick_place
-------------|question_0
-----------------|question.txt
-----------------|answer.txt
-----------------|image.png
-------------|question_1
-------------|...
---------|droid_arti
---------|droid_pick_place
----|Q2
---------|bridge_pick_place
-------------|question_0
-----------------|picking_point_question.txt
-----------------|picking_point_answer.txt
-----------------|ending_tile_question.txt
-----------------|ending_tile_answer.txt
-----------------|image.png
-------------|question_1
-------------|...
---------|droid_arti
---------|droid_pick_place

For Question type 1, the input prompt to feed into a VLM is stored in the question.txt and the expected answer is store in answer.txt. The input image is image.png.

For Question type 2, since the VLM has to answer both the picking point question and the ending tile question, we store the input prompt separately in to picking_point_question.txt and ending_tile_question.txt.

Citation

If you think our work is helpful, you can cite our work:

BibTeX:

@inproceedings{Manipbench2025,
  title={ManipBench: Benchmarking vision-language models for low-level robot manipulation},
  author={Zhao, Enyu and Raval, Vedant and Zhang, Hejia and Mao, Jiageng and Shangguan, Zeyu and Nikolaidis, Stefanos and Wang, Yue and Seita, Daniel},
  booktitle={Conference on Robot Learning (CoRL)},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Llama3_2		Llama3_2
__pycache__		__pycache__
assets		assets
fonts		fonts
gemini		gemini
glm		glm
gpt		gpt
intern2_5vl		intern2_5vl
internvl		internvl
llava		llava
paligemma		paligemma
prompts		prompts
question_generation		question_generation
qwen2_5vl		qwen2_5vl
qwen2vl		qwen2vl
qwenvl		qwenvl
raw_dataset_process		raw_dataset_process
.DS_Store		.DS_Store
.gitignore		.gitignore
.gitmodules		.gitmodules
OFL.txt		OFL.txt
README.md		README.md
SimSun.ttf		SimSun.ttf
image_preprocess.py		image_preprocess.py
manipbench_question_pipeline.py		manipbench_question_pipeline.py
manipbench_real_world.py		manipbench_real_world.py
manipbench_real_world_image_preprocess.py		manipbench_real_world_image_preprocess.py
simple_question_creation_pipeline.py		simple_question_creation_pipeline.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation (Real-Robot-Questions)

ManipBench

🌐 Homepage | Raw Dataset | Simplified Dataset |📑 Paper

🔔News

Introduction

Dataset Summary

Instruction on Evaluating VLMs

1. Use our pipeline:

2. Just want to use our questions for evaluating new VLMs:

Citation

About

Uh oh!

Releases

Packages

Languages

License

slurm-lab-usc/ManipBench-Real-Robot-question

Folders and files

Latest commit

History

Repository files navigation

ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation (Real-Robot-Questions)

ManipBench

🌐 Homepage | Raw Dataset | Simplified Dataset |📑 Paper

🔔News

Introduction

Dataset Summary

Instruction on Evaluating VLMs

1. Use our pipeline:

2. Just want to use our questions for evaluating new VLMs:

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages