Skip to content
/ ViTP Public

Offical implementation of "Visual Instruction Pretraining for Domain-Specific Foundation Models"

zcablii/ViTP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Visual Instruction Pretraining for Domain-Specific Foundation Models

Yuxuan Li, Yicheng Zhang, Wenhao Tang, Yimian Dai, Ming-Ming Cheng, Xiang Li, and Jian Yang

This repository is the official implementation of "Visual Instruction Pretraining for Domain-Specific Foundation Models".

Abstract | Performance | Usage

Abstract

Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at GitHub.


image

Figure 1: A conceptual illustration of the ViTP framework. A ViT backbone is embedded within a large VLM and then pretrained with domain-specific instruction following objective and Visual Robustness Learning (VRL). This process instils high-level semantic understanding into the ViT. The resulting weights are then used to initialize models for various downstream perception tasks.

image

Figure 2: (Left) The synergistic relationship between perception, generation, and reasoning in modern CV. Our proposed ViTP forges a novel link from high-level reasoning to low-level perception, a previously underexplored connection. (Right) Comparison of pretraining paradigms for ViT foundation models. ViTP employs an instruction-following objective to directly instil domain-specific perception capabilities into the vision backbone.


Pretrained Model

Model Parameters Pretrain Dataset Weights
ViTP_ViT_L_rs 300M modelscope
huggingface
ViTP_ViT_L_300M_rs
ViTP_ViT_L_med 300M ViTP_ViT_L_300M_med

Performance

image


Domain-Specific Finetuning

Prepare downstream task datasets

Remote Sensing Object Detection

Remote Sensing Semantic Segmentation

Remote Sensing Change Detection

Medical Imaging Semantic Segmentation

Object Detection

Weight and Logs are available at modelscope and HuggingFace.

Dataset Modal Anno Format Method mAP Config
DIOR RGB Hori. Box Cascade-RCNN 79.80 Config
DIOR-R RGB Ori. Box Oriented-RCNN 75.08 Config
DOTA-v2.0 RGB Ori. Box Oriented-RCNN 60.23 Config
SARDet-100K SAR Hori. Box Cascade-RCNN 57.9 Config
SSDD SAR Hori. Box Mask-RCNN 70.80 Config
RSAR SAR Ori. Box Oriented-RCNN 72.31 Config

Semantic Segmentation

Weight and Logs are available at modelscope and HuggingFace.

Dataset Modal Anno Format Method mIoU Config
iSAID RGB Mask UNet 71.14 Config
LoveDA RGB Mask UperNet 54.28 Config
UAVid RGB Mask UperNet 73.39 Config
SSDD SAR Polygons UperNet 65.90(AP) Config

Change Detection

Weight and Logs are available at modelscope and HuggingFace.

Dataset Modal Anno Format Method F1 Config
SVCD RGB Mask UperNet 98.63 Config
WHU RGB Mask UNet 94.98 Config
LEVER RGB Mask UNet 92.67 Config
S2Looking RGB Mask UNet 69.89 Config

Usage

  • Clone this repository:
git clone https://github.com/zcablii/ViTP.git

Object Detection

Installation

  • Create a conda environment:
cd ViTP/mmrotate
conda create -n vitp-det python==3.10
conda activate vitp-det
  • Install the required packages:
pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install mmcv-full==1.6.1 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.12.0/index.html
pip install -r requirements.txt
  • Insatall flash attention:
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v0.2.8
pip install ninja
python setup.py install
cd ..
  • Install mmcv:
cd ../mmcv
python setup.py install
cd ../mmrotate
  • Install mmrotate:
pip install -e .
  • compile deformable attention:
cd ops
sh make.sh
cd ..

Train

sh ./tools/dist_train.sh ViTP_configs/vitp_dotav2_orcnn.py 8 

Test

sh ./tools/dist_test.sh ./ViTP_configs/vitp_dotav2_orcnn.py ./work_dirs/vitp_dotav2_orcnn/latest.pth 8 --format-only --eval-options submission_dir=./results/vitp_dotav2_orcnn

Segmentation

Installation

  • Create a conda environment:
cd ViTP/mmseg
conda create -n vitp-seg python==3.10
conda activate vitp-seg
  • Install the required packages:
pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install mmcv-full==1.6.1 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.12.0/index.html
pip install -r requirements.txt
  • Install flash attention:
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v0.2.8
pip install ninja
python setup.py install
cd ..
  • Install mmcv:
cd ../mmcv
python setup.py install
cd ../mmseg
  • Install mmsegmentation:
pip install -e .
  • compile deformable attention:
cd ops
sh make.sh
cd ..

Train

sh ./tools/dist_train.sh ViTP_configs/vitp_isaid_upernet.py 8 

Test

sh ./tools/dist_test.sh ./ViTP_configs/vitp_isaid_upernet.py ./work_dirs/vitp_isaid_upernet/latest.pth 8 --eval mIoU

Change Detection

Installation

  • Create a conda environment:
cd ViTP/opencd
conda create -n vitp-cd python==3.10
conda activate vitp-cd
  • Install the required packages:
pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install -U openmim
mim install mmcv==2.0.0
mim install mmpretrain==1.2.0
pip install -r requirements.txt
  • Install flash attention:
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v0.2.8
pip install ninja
python setup.py install
cd ..
  • Install open-cd:
pip install -e .

Train

sh ./tools/dist_train.sh ./ViTP_configs/vitp_s2looking_upernet.py 8

Test

sh ./tools/dist_test.sh ./ViTP_configs/vitp_s2looking_upernet.py ./work_dirs/vitp_s2looking_upernet/iter_120000.pth 8

Citation

If you use this toolbox or benchmark in your research, please cite this project.

@article{Li_2025_ViTP,
  title={Visual Instruction Pretraining for Domain-Specific Foundation Models},
  author={Li, Yuxuan and Zhang, Yicheng and Tang, Wenhao and Dai, Yimian and Cheng, Ming-Ming and Li, Xiang and Yang, Jian},
  journal={arXiv},
  year={2025}
}

License

Licensed under a Creative Commons Attribution-NonCommercial 4.0 International for Non-commercial use only. Any commercial use should get formal permission first.SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection

About

Offical implementation of "Visual Instruction Pretraining for Domain-Specific Foundation Models"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published