Vision-Language Models (VLMs) have revolutionized artificial intelligence and robotics due to their commonsense reasoning capabilities. In robotic manipulation, VLMs are used primarily as high-level planners, but recent work has also studied their lower-level reasoning ability, which refers to making decisions about precise robot movements. However, the community currently lacks a clear and common benchmark that can evaluate how well VLMs can aid low-level reasoning in robotics. We propose ManipBench, to evaluate the low-level robot manipulation reasoning capabilities of VLMs across various dimensions, including how well they understand object-object interactions and deformable object manipulation.
Learn more at - https://manipbench.github.io
This repository contains the code that can be used to run the fabric manipulation experiments from the ManipBench benchmark. Feel free to check out other repositories included in the website for running the experiments on real robot datasets and simulation environments.
The scripts in these repo use the multiple choice questions for evaluating fabric (deformable) manipulation and test the performance of multiple Vision-Language Model (VLM) families in terms of their question answering accuracy. These accuracies are either reported in the paper in the scale of 0 - 1 (with 1 corresponding to a model being able to answer all the questions correctly) or in the scale of 0% - 100% (with 100% accuracy meaning that the model was able to answer all the questions correctly).
We will deeply appreciate if you submit GitHub issue reports in case you notice any discrepancies with the code. Feel free to reach out to us if you are experiencing any trouble running the code.
Download the questions to evaluate fabric manipulation from the project website. The downloaded folder will have all the questions as well as the corresponding images. See manipbench-questions/ for reference. However, note that the images associated with the questions are not present in the repository due to upload restrictions.
Move the downloaded folder to the project root (the code expects the name to be manipbench-questions) and check the .json files in the downloaded manipbench-questions directory and ensure that the image paths in those jsons exist.
Ensure that all the required dependencies are installed.
Use the main.py script to run the experiments.
The script, in its current state, lets the user choose among the following modes for evaluation
- Default mode: Run
python3 main.py --output_dir {output_dir}. Evaluates all the default model families + sizes on all the ten low level reasoning tasks. Check theMODEL_FAMILIES,MODEL_SIZES, andTASKSinmain.py. - User specified model: Run
python3 main.py --model_family {model_family} --model_size {model_size} --output_dir {output_dir}. Evaluates the user specified model on all the ten tasks. - User specified model and task: Run
python3 main.py --model_family {model_family} --model_size {model_size} --task {task} --output_dir {output_dir}. Evaluates the user specified model on only the user specified task.
Note that for each mode, we are passing the arg of output_dir. That is going to be the parent directory for where you want to save the evaluation logs. These evaluation logs will include the input question, the answer chosen by the model, and the reasoning incorporated by the model to reach to its answer.
To run the default mode, you will also need to pass in your API key for using the OpenAI and Google Gemini APIs. Save these keys in specific files and update the GPT_API_KEY and GEMINI_API_KEY variables in main.py to reflect the accurate paths. In case you are running the script in a different mode, then you can also pass the api_key by using the arg --api_key_file with the script.
The final accuracies for the models will be saved in the corresponding logs.
As described above, main.py is the main script for this sub-project, handling all the model calls and question answering. The helper scripts can be found in utils/. Furthermore, we also include the older version of the code that was used to run the experiments and results reported in the paper. That script is main_old.py. However, the older version of the code wasn't polished well and was difficult to maintain. Hence, we release a cleaner version of the code.
The directory of dataset-generatio/ contained helper scripts used to convert the .json files for the question-answers from the raw real world data collected by us. If your goal is just to replicate the experiments reported in the paper then you don't need to use the files under dataset-generation/.
If you think our work is helpful, you can cite our work:
BibTeX:
@inproceedings{Manipbench2025,
title={ManipBench: Benchmarking vision-language models for low-level robot manipulation},
author={Zhao, Enyu and Raval, Vedant and Zhang, Hejia and Mao, Jiageng and Shangguan, Zeyu and Nikolaidis, Stefanos and Wang, Yue and Seita, Daniel},
booktitle={Conference on Robot Learning (CoRL)},
year={2025}
}