Visual Instruction Pretraining for Domain-Specific Foundation Models

Yuxuan Li, Yicheng Zhang, Wenhao Tang, Yimian Dai, Ming-Ming Cheng, Xiang Li, and Jian Yang

This repository is the official implementation of "Visual Instruction Pretraining for Domain-Specific Foundation Models".

Abstract

Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at GitHub.

Figure 1: A conceptual illustration of the ViTP framework. A ViT backbone is embedded within a large VLM and then pretrained with domain-specific instruction following objective and Visual Robustness Learning (VRL). This process instils high-level semantic understanding into the ViT. The resulting weights are then used to initialize models for various downstream perception tasks.

Figure 2: (Left) The synergistic relationship between perception, generation, and reasoning in modern CV. Our proposed ViTP forges a novel link from high-level reasoning to low-level perception, a previously underexplored connection. (Right) Comparison of pretraining paradigms for ViT foundation models. ViTP employs an instruction-following objective to directly instil domain-specific perception capabilities into the vision backbone.

Pretrained Model

Model	Parameters	Pretrain Dataset	Weights
ViTP_ViT_L_rs	300M	modelscope huggingface	ViTP_ViT_L_300M_rs
ViTP_ViT_L_med	300M	modelscope huggingface	ViTP_ViT_L_300M_med

Performance

Domain-Specific Finetuning

Prepare downstream task datasets

Remote Sensing Object Detection

Remote Sensing Semantic Segmentation

Remote Sensing Change Detection

Medical Imaging Semantic Segmentation

Object Detection

Weight and Logs are available at modelscope and HuggingFace.

Dataset	Modal	Anno Format	Method	mAP	Config
DIOR	RGB	Hori. Box	Cascade-RCNN	79.80	Config
DIOR-R	RGB	Ori. Box	Oriented-RCNN	75.08	Config
DOTA-v2.0	RGB	Ori. Box	Oriented-RCNN	60.23	Config
SARDet-100K	SAR	Hori. Box	Cascade-RCNN	57.9	Config
SSDD	SAR	Hori. Box	Mask-RCNN	70.80	Config
RSAR	SAR	Ori. Box	Oriented-RCNN	72.31	Config

Semantic Segmentation

Weight and Logs are available at modelscope and HuggingFace.

Dataset	Modal	Anno Format	Method	mIoU	Config
iSAID	RGB	Mask	UNet	71.14	Config
LoveDA	RGB	Mask	UperNet	54.28	Config
UAVid	RGB	Mask	UperNet	73.39	Config
SSDD	SAR	Polygons	UperNet	65.90(AP)	Config

Change Detection

Weight and Logs are available at modelscope and HuggingFace.

Dataset	Modal	Anno Format	Method	F1	Config
SVCD	RGB	Mask	UperNet	98.63	Config
WHU	RGB	Mask	UNet	94.98	Config
LEVER	RGB	Mask	UNet	92.67	Config
S2Looking	RGB	Mask	UNet	69.89	Config

Usage

Clone this repository:

git clone https://github.com/zcablii/ViTP.git

Object Detection

Installation

Create a conda environment:

cd ViTP/mmrotate
conda create -n vitp-det python==3.10
conda activate vitp-det

Install the required packages:

pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install mmcv-full==1.6.1 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.12.0/index.html
pip install -r requirements.txt

Insatall flash attention:

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v0.2.8
pip install ninja
python setup.py install
cd ..

Install mmcv:

cd ../mmcv
python setup.py install
cd ../mmrotate

Install mmrotate:

pip install -e .

compile deformable attention:

cd ops
sh make.sh
cd ..

Train

sh ./tools/dist_train.sh ViTP_configs/vitp_dotav2_orcnn.py 8

Test

sh ./tools/dist_test.sh ./ViTP_configs/vitp_dotav2_orcnn.py ./work_dirs/vitp_dotav2_orcnn/latest.pth 8 --format-only --eval-options submission_dir=./results/vitp_dotav2_orcnn

Segmentation

Installation

Create a conda environment:

cd ViTP/mmseg
conda create -n vitp-seg python==3.10
conda activate vitp-seg

Install the required packages:

pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install mmcv-full==1.6.1 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.12.0/index.html
pip install -r requirements.txt

Install flash attention:

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v0.2.8
pip install ninja
python setup.py install
cd ..

Install mmcv:

cd ../mmcv
python setup.py install
cd ../mmseg

Install mmsegmentation:

pip install -e .

compile deformable attention:

cd ops
sh make.sh
cd ..

Train

sh ./tools/dist_train.sh ViTP_configs/vitp_isaid_upernet.py 8

Test

sh ./tools/dist_test.sh ./ViTP_configs/vitp_isaid_upernet.py ./work_dirs/vitp_isaid_upernet/latest.pth 8 --eval mIoU

Change Detection

Installation

Create a conda environment:

cd ViTP/opencd
conda create -n vitp-cd python==3.10
conda activate vitp-cd

Install the required packages:

pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install -U openmim
mim install mmcv==2.0.0
mim install mmpretrain==1.2.0
pip install -r requirements.txt

Install flash attention:

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v0.2.8
pip install ninja
python setup.py install
cd ..

Install open-cd:

pip install -e .

Train

sh ./tools/dist_train.sh ./ViTP_configs/vitp_s2looking_upernet.py 8

Test

sh ./tools/dist_test.sh ./ViTP_configs/vitp_s2looking_upernet.py ./work_dirs/vitp_s2looking_upernet/iter_120000.pth 8

Citation

If you use this toolbox or benchmark in your research, please cite this project.

@article{Li_2025_ViTP,
  title={Visual Instruction Pretraining for Domain-Specific Foundation Models},
  author={Li, Yuxuan and Zhang, Yicheng and Tang, Wenhao and Dai, Yimian and Cheng, Ming-Ming and Li, Xiang and Yang, Jian},
  journal={arXiv},
  year={2025}
}

License

Licensed under a Creative Commons Attribution-NonCommercial 4.0 International for Non-commercial use only. Any commercial use should get formal permission first.SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Figs		Figs
ViTP		ViTP
mmcv		mmcv
mmrotate		mmrotate
mmseg		mmseg
opencd		opencd
README.md		README.md
ViTP.pdf		ViTP.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Visual Instruction Pretraining for Domain-Specific Foundation Models

Yuxuan Li, Yicheng Zhang, Wenhao Tang, Yimian Dai, Ming-Ming Cheng, Xiang Li, and Jian Yang

This repository is the official implementation of "Visual Instruction Pretraining for Domain-Specific Foundation Models".

Abstract

Pretrained Model

Performance

Domain-Specific Finetuning

Prepare downstream task datasets

Object Detection

Semantic Segmentation

Change Detection

Usage

Object Detection

Installation

Train

Test

Segmentation

Installation

Train

Test

Change Detection

Installation

Train

Test

Citation

License

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Uh oh!

Uh oh!

zcablii/ViTP

Folders and files

Latest commit

History

Repository files navigation

Visual Instruction Pretraining for Domain-Specific Foundation Models

Yuxuan Li, Yicheng Zhang, Wenhao Tang, Yimian Dai, Ming-Ming Cheng, Xiang Li, and Jian Yang

This repository is the official implementation of "Visual Instruction Pretraining for Domain-Specific Foundation Models".

Abstract

Pretrained Model

Performance

Domain-Specific Finetuning

Prepare downstream task datasets

Object Detection

Semantic Segmentation

Change Detection

Usage

Object Detection

Installation

Train

Test

Segmentation

Installation

Train

Test

Change Detection

Installation

Train

Test

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages