This repository is the official implementation of "Visual Instruction Pretraining for Domain-Specific Foundation Models".
Abstract | Performance | Usage
Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at GitHub.
Figure 1: A conceptual illustration of the ViTP framework. A ViT backbone is embedded within a large VLM and then pretrained with domain-specific instruction following objective and Visual Robustness Learning (VRL). This process instils high-level semantic understanding into the ViT. The resulting weights are then used to initialize models for various downstream perception tasks.
Figure 2: (Left) The synergistic relationship between perception, generation, and reasoning in modern CV. Our proposed ViTP forges a novel link from high-level reasoning to low-level perception, a previously underexplored connection. (Right) Comparison of pretraining paradigms for ViT foundation models. ViTP employs an instruction-following objective to directly instil domain-specific perception capabilities into the vision backbone.
| Model | Parameters | Pretrain Dataset | Weights |
|---|---|---|---|
| ViTP_ViT_L_rs | 300M | modelscope huggingface |
ViTP_ViT_L_300M_rs |
| ViTP_ViT_L_med | 300M | ViTP_ViT_L_300M_med |
Remote Sensing Object Detection
Remote Sensing Semantic Segmentation
Remote Sensing Change Detection
Medical Imaging Semantic Segmentation
Weight and Logs are available at modelscope and HuggingFace.
| Dataset | Modal | Anno Format | Method | mAP | Config |
|---|---|---|---|---|---|
| DIOR | RGB | Hori. Box | Cascade-RCNN | 79.80 | Config |
| DIOR-R | RGB | Ori. Box | Oriented-RCNN | 75.08 | Config |
| DOTA-v2.0 | RGB | Ori. Box | Oriented-RCNN | 60.23 | Config |
| SARDet-100K | SAR | Hori. Box | Cascade-RCNN | 57.9 | Config |
| SSDD | SAR | Hori. Box | Mask-RCNN | 70.80 | Config |
| RSAR | SAR | Ori. Box | Oriented-RCNN | 72.31 | Config |
Weight and Logs are available at modelscope and HuggingFace.
| Dataset | Modal | Anno Format | Method | mIoU | Config |
|---|---|---|---|---|---|
| iSAID | RGB | Mask | UNet | 71.14 | Config |
| LoveDA | RGB | Mask | UperNet | 54.28 | Config |
| UAVid | RGB | Mask | UperNet | 73.39 | Config |
| SSDD | SAR | Polygons | UperNet | 65.90(AP) | Config |
Weight and Logs are available at modelscope and HuggingFace.
| Dataset | Modal | Anno Format | Method | F1 | Config |
|---|---|---|---|---|---|
| SVCD | RGB | Mask | UperNet | 98.63 | Config |
| WHU | RGB | Mask | UNet | 94.98 | Config |
| LEVER | RGB | Mask | UNet | 92.67 | Config |
| S2Looking | RGB | Mask | UNet | 69.89 | Config |
- Clone this repository:
git clone https://github.com/zcablii/ViTP.git
- Create a conda environment:
cd ViTP/mmrotate
conda create -n vitp-det python==3.10
conda activate vitp-det
- Install the required packages:
pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install mmcv-full==1.6.1 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.12.0/index.html
pip install -r requirements.txt
- Insatall flash attention:
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v0.2.8
pip install ninja
python setup.py install
cd ..
- Install mmcv:
cd ../mmcv
python setup.py install
cd ../mmrotate
- Install mmrotate:
pip install -e .
- compile deformable attention:
cd ops
sh make.sh
cd ..
sh ./tools/dist_train.sh ViTP_configs/vitp_dotav2_orcnn.py 8
sh ./tools/dist_test.sh ./ViTP_configs/vitp_dotav2_orcnn.py ./work_dirs/vitp_dotav2_orcnn/latest.pth 8 --format-only --eval-options submission_dir=./results/vitp_dotav2_orcnn
- Create a conda environment:
cd ViTP/mmseg
conda create -n vitp-seg python==3.10
conda activate vitp-seg
- Install the required packages:
pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install mmcv-full==1.6.1 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.12.0/index.html
pip install -r requirements.txt
- Install flash attention:
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v0.2.8
pip install ninja
python setup.py install
cd ..
- Install mmcv:
cd ../mmcv
python setup.py install
cd ../mmseg
- Install mmsegmentation:
pip install -e .
- compile deformable attention:
cd ops
sh make.sh
cd ..
sh ./tools/dist_train.sh ViTP_configs/vitp_isaid_upernet.py 8
sh ./tools/dist_test.sh ./ViTP_configs/vitp_isaid_upernet.py ./work_dirs/vitp_isaid_upernet/latest.pth 8 --eval mIoU
- Create a conda environment:
cd ViTP/opencd
conda create -n vitp-cd python==3.10
conda activate vitp-cd
- Install the required packages:
pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install -U openmim
mim install mmcv==2.0.0
mim install mmpretrain==1.2.0
pip install -r requirements.txt
- Install flash attention:
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v0.2.8
pip install ninja
python setup.py install
cd ..
- Install open-cd:
pip install -e .
sh ./tools/dist_train.sh ./ViTP_configs/vitp_s2looking_upernet.py 8
sh ./tools/dist_test.sh ./ViTP_configs/vitp_s2looking_upernet.py ./work_dirs/vitp_s2looking_upernet/iter_120000.pth 8
If you use this toolbox or benchmark in your research, please cite this project.
@article{Li_2025_ViTP,
title={Visual Instruction Pretraining for Domain-Specific Foundation Models},
author={Li, Yuxuan and Zhang, Yicheng and Tang, Wenhao and Dai, Yimian and Cheng, Ming-Ming and Li, Xiang and Yang, Jian},
journal={arXiv},
year={2025}
}Licensed under a Creative Commons Attribution-NonCommercial 4.0 International for Non-commercial use only. Any commercial use should get formal permission first.SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection


