WACV 2025
Anh-Quan Cao1 Maximilian Jaritz2 Matthieu Guillaumin2 Raoul de Charette1 Loris Bazzani2
If you find this work or code useful, please cite our paper and give this repo a star:
@InProceedings{cao2024latteclip,
title={LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts},
author={Anh-Quan Cao and Maximilian Jaritz and Matthieu Guillaumin and Raoul de Charette and Loris Bazzani},
year={2024},
booktitle = {arXiv}
}
- 17/12/2024: code is released.
- 14/10/2024: code will be available soon.
Follow these steps to install the necessary dependencies:
Create a new conda environment and install the dependencies:
conda create -n latteclip python=3.10
conda activate latteclipNavigate to the latteclip directory and run the following command:
make install
make install-trainingFollow the official instructions here.
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip install -e .Create a folder to store the data and set the path in the bash variable $LATTECLIP_DATA_DIR:
mkdir -p /path/to/data
export LATTECLIP_DATA_DIR=/path/to/dataDownload the data from this link and extract all files into the $LATTECLIP_DATA_DIR.
Navigate to the latteclip directory and run the preprocess script to create the webdataset, tarfiles, and extract the clip features:
cd latteclip
bash scripts/preprocess/preprocess.shTo generate image descriptions, follow these steps:
Run the following command:
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh $MACHINE_ID $NUM_MACHINE classname_dtd dtd $NUM_PROCESSES_PER_GPU $NUM_GPUSAssume you have 2 machines, 1 GPU per machine, and 5 generation processes per Tesla V100 32g GPU:
Machine 0:
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 2 classname_dtd dtd 5 1Machine 1:
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 1 2 classname_dtd dtd 5 1Use the following commands:
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_dtd dtd 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_eurosat eurosat 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_scene sun397 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_flower flower102 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_food101 food101 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_pets oxford_pets 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_car stanford_cars 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_ufc ucf101 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_caltech caltech101 5 1The process is similar to generating image descriptions. Use the following commands:
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 dtd_describe_common_v3 dtd 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 eurosat_describe_common_v3 eurosat 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 sun397_describe_common_v3 sun397 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 flower102_describe_common_v3 flower102 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 food101_describe_common_v3 food101 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 pets_describe_common_v3 oxford_pets 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 car_describe_common_v3 stanford_cars 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 ufc_describe_common_v3 ucf101 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 caltech_describe_common_v3 caltech101 5 1To train the model on dtd, run:
bash scripts/unsupervised/dtd/dtd_fine_tune_multiclass.sh $lr $class_per_image $device $port $seed $exp_name$lr: Learning rate$class_per_image: Number of classes per image (always set to 1)$device: Device ID$port: Port for the job (Not used)$seed: Random seed$exp_name: Experiment name
To train with learning rate 1e-7, on device 0, with port 25680, random seed 3, and experiment name exp_dtd:
bash scripts/unsupervised/dtd_fine_tune_multiclass.sh 1e-7 1 0 25680 1 exp_dtdbash scripts/unsupervised/eurosat_fine_tune_multiclass.sh 1e-7 1 0 25666 1 exp_eurosat
bash scripts/unsupervised/caltech101_fine_tune_multiclass.sh 1e-7 1 0 25665 1 exp_caltech101
bash scripts/unsupervised/fgvc_aircraft/fgvc_aircraft_fine_tune_multiclass.sh 1e-7 1 0 25667 1 exp_fgvc_aircraft
bash scripts/unsupervised/flower102_fine_tune_multiclass.sh 1e-7 1 0 25668 1 exp_flower102
bash scripts/unsupervised/food101_fine_tune_multiclass.sh 1e-7 1 0 25669 1 exp_food101
bash scripts/unsupervised/oxford_pets_fine_tune_multiclass.sh 1e-7 1 0 25670 1 exp_oxford_pets
bash scripts/unsupervised/stanford_cars/stanford_cars_fine_tune_multiclass.sh 1e-7 1 0 25671 1 exp_stanford_cars
bash scripts/unsupervised/sun397_fine_tune_multiclass.sh 1e-7 1 0 25672 1 exp_sun397
bash scripts/unsupervised/ucf101_fine_tune_multiclass.sh 1e-7 1 0 25673 1 exp_ucf101Note
Logs will be stored in the logs folder.
This repository is built upon OpenCLIP and LLaVA.
The research was conducted mainly during Quan’s internship at Amazon. The research was also supported by the ANR project SIGHT (ANR-20-CE23-0016) and SAMBA collaborative project co-funded by BpiFrance in the Investissement d’Avenir Program. Computation was performed partly using HPC resources from GENCI–IDRIS (AD011012808R2, AD011014102R1). We thank Ajanthan Thalaiyasingam and Mohammad Fahes for their insightful suggestions. We also extend our gratitude to Mohammad Fahes and Ivan Lopes for their thorough proofreading.