Official code release for Struct2D: A Perception-Guided Framework for Spatial Reasoning in Large Multimodal Models.
Fangrui Zhu*, Hanhui Wang*, Yiming Xie, Jing Gu, Tianye Ding, Jianwei Yang, Huaizu Jiang
*Equal Contribution
📑 Paper (arXiv)
Dataset and Models
- We propose a perception-guided 2D prompting strategy, Struct2D Prompting, and conduct a detailed zero-shot analysis that reveals LMMs’ ability to perform 3D spatial reasoning from structured 2D inputs alone.
- We introduce Struct2D-Set, a large-scale instructional tuning dataset with automatically generated, fine-grained QA pairs covering eight spatial reasoning categories grounded in 3D scenes.
- We fine-tune an open-source LMM to achieve competitive performance across several spatial reasoning benchmarks, validating the real-world applicability of our framework.
conda create -n struct2d python=3.10 -y
conda activate struct2d
git clone [email protected]:neu-vi/struct2d.git
pip install -e ".[torch,metrics]" --no-build-isolation
If you find Struct2D helpful in your research, please consider citing:
@article{zhu2025struct2d,
title={Struct2D: A Perception-Guided Framework for Spatial Reasoning in Large Multimodal Models},
author={Zhu, Fangrui and Wang, Hanhui and Xie, Yiming and Gu, Jing and Ding, Tianye and Yang, Jianwei and Jiang, Huaizu},
journal={arXiv preprint arXiv:2506.04220},
year={2025}
}
We thank the authors of GPT4Scene, LLaMA-Factory for inspiring discussions and open-sourcing their codebases.