Skip to content

slurm-lab-usc/ManipBench-Real-Robot-question

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation (Real-Robot-Questions)

ManipBench

This repo contains the code for generating and evaluating VLMs on real-world robot datasets for the paper "ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation".

If you like our project, please give us a star ⭐ on GitHub for latest update.

Alt text

🔔News

🔥[2025-9-28]: ManipBench was presented on CoRL 2025

🔥[2025-8-1]: ManipBench was accepted by CoRL 2025

For now we only provide the instruction for how to evaluating the VLMs on our benchmark. This code base also include the code for generating the questions from real-world robotic manipulation datasets and the server-side code (the robot-side code is here)for conducting the real-world experiments presented in our paper. We will release the instruction about those code in the near future.

For real-world questions, feel free to contact with Enyu Zhao.For fabric-manipulation questions, you should check with Vedant Raval for details. And for the simulation code and questions, you can check with Hejia Zhang.

Introduction

Vision-Language Models (VLMs) have revolutionized artificial intelligence and robotics due to their commonsense reasoning capabilities. In robotic manipulation, VLMs are used primarily as high-level planners, but recent work has also studied their lower-level reasoning ability, which refers to making decisions about precise robot movements. However, the community currently lacks a clear and common benchmark that can evaluate how well VLMs can aid low-level reasoning in robotics. We propose ManipBench, to evaluate the low-level robot manipulation reasoning capabilities of VLMs across various dimensions, including how well they understand object-object interactions and deformable object manipulation.

Dataset Summary

The complete ManipBench dataset consists of of 12,617 VQAs across tasks ranging from pick-and-place, articulated object manipulation, deformable object manipulation, and dynamic manipulation.

In this codebase we provide the code for generating and evaluating the questions from real-world robotic manipulation datasets (9,180 VQAs).

We provide 2 types of VQAs. In the first type, the VLM should choose a picking point and a gripper moving trajectory combination from 4 candidates. A sample question is shown below:

Sample_q1

In the second type, the VLM should choose a picking point from 4 candidates first and then choose which tile is the ideal placing location. A sample question is shown below:

Sample_q2 If you only want to evaluate a VLM's performance on our benchmark, you don't have to worry about the code for generation or the real-world experiments.

Instruction on Evaluating VLMs

1. Use our pipeline:

  1. Download the dataset: We provide the dataset in this link. Download the zip file and then unzip in this folder. The file structure should look like:

    |BEN-VLM-MOKA
    --|Dataset_ManipBench_ExData
    ----|bridge_pick_place
    ----|droid_arti
    ----|droid_pick_place
    

    Here BEN-VLM-MOKA refers to this repo's path. You can rename "Dataset_ManipBench_ExData" to anything you like, in the following steps, I will refer YOUR_DATASET_ROOT as the folder name

  2. Evaluate the VLM: We provide the code for evaluating the VLMs included in our paper. If you want to test those VLMs, you can run the code in each model's folder. For example, if you want to test with gpt-4o on type 1 questions from bridge dataset, you can run the following code in the root folder BEN-VLM-MOKA

    python ./gpt/gpt_test_moka.py --dataset_root_folder YOUR_DATASET_ROOT --dataset_folder bridge_pick_place --dataset_name bridge --question_type Q1 --model_name gpt-4o --total_batches 10 
    

    • If you want to evaluate the VLMs not included in our paper, you can use the BEN-VLM-MOKA/MODEL_FAMILY/MODEL_FAMILY_test_moka.py(like BEN-VLM-MOKA/gpt/gpt_test_moka.py) as examples to create your own script for evaluating new VLMs. The base class for evaluating is in manipbench_question_pipeline.py

2. Just want to use our questions for evaluating new VLMs:

Alternatively, we also provide an easy-to-use dataset if you don't want to write the code following our example. You can download the zip file from here. The file structure will look like this:

-simplified_manipbench_rr
----|Q1
---------|bridge_pick_place
-------------|question_0
-----------------|question.txt
-----------------|answer.txt
-----------------|image.png
-------------|question_1
-------------|...
---------|droid_arti
---------|droid_pick_place
----|Q2
---------|bridge_pick_place
-------------|question_0
-----------------|picking_point_question.txt
-----------------|picking_point_answer.txt
-----------------|ending_tile_question.txt
-----------------|ending_tile_answer.txt
-----------------|image.png
-------------|question_1
-------------|...
---------|droid_arti
---------|droid_pick_place

For Question type 1, the input prompt to feed into a VLM is stored in the question.txt and the expected answer is store in answer.txt. The input image is image.png.

For Question type 2, since the VLM has to answer both the picking point question and the ending tile question, we store the input prompt separately in to picking_point_question.txt and ending_tile_question.txt.

Citation

If you think our work is helpful, you can cite our work:

BibTeX:

@inproceedings{Manipbench2025,
  title={ManipBench: Benchmarking vision-language models for low-level robot manipulation},
  author={Zhao, Enyu and Raval, Vedant and Zhang, Hejia and Mao, Jiageng and Shangguan, Zeyu and Nikolaidis, Stefanos and Wang, Yue and Seita, Daniel},
  booktitle={Conference on Robot Learning (CoRL)},
  year={2025}
}

About

This is the code for ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation (CoRL 2025)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published