GenTron: Diffusion Transformers for Image and Video Generation
Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Yanping Xie, Animesh Sinha, Ping Luo, Tao Xiang, Juan-Manuel Perez-Rua
The University of Hong Kong, Meta
This repository contains:
- 🪐 A simple PyTorch implementation of Text-to-Image GenTron
- 🪐 A simple PyTorch implementation of Text-to-Video GenTron
- ⚡️ An ImageNet features extract script
- 🛸 A GenTron training script
- 🛸 A GenTron training script using stored features.
conda create -n gentron python=3.10
conda activate gentron
pip install -r requirements.txtpython sample.py --image_size 512 --seed 1python sample.py --model GenTron-T2I-XL/2 --image_size 256 --ckpt /path/to/model.ptpython sample_t2v.py --model GenTron-T2V-XL/2 --image_size 256 --ckpt /path/to/model.pt| GenTron Model | Train Steps | Image Resolution | 
|---|---|---|
| B/2 | 150000 | 256x256 | 
torchrun --nnodes=1 --nproc_per_node=1 extract_features.py --data_path /path/to/ImageNet/train --features_path /path/to/ImageNet/featuresTrain GenTron-T2I model directly.
accelerate launch --mixed_precision fp16 train.py --model GenTron-T2I-XL/2 --data_path /path/to/ImageNet/trainaccelerate launch --multi_gpu --num_processes N --mixed_precision fp16 train.py --model GenTron-T2I-XL/2 --data_path /path/to/ImageNet/trainTrain GenTron-T2I model with extracted features.
accelerate launch --mixed_precision fp16 train_v2.py --model GenTron-T2I-XL/2 --features_path /path/to/ImageNet/featuresaccelerate launch --multi_gpu --num_processes N --mixed_precision fp16 train_v2.py --model GenTron-T2I-XL/2 --features_path /path/to/ImageNet/featuresWebVid-10M Datset.
Assumes webvid data is structured as follows.
Webvid/
    videos/
        000001_000050/      ($page_dir)
            1.mp4           (videoid.mp4)
            ...
            5000.mp4
        ...
MSR-VTT Datset.
The official data and video links can be found in link.
For the convenience, you can also download the splits and captions by,
wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msrvtt_data.zipBesides, the raw videos can be found in sharing from Frozen️ in Time, i.e.,
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zipTrain GenTron-T2V model directly.
accelerate launch --multi_gpu --num_processes N --mixed_precision fp16 train_t2v.py --model GenTron-T2V-XL/2 --meta_path /path/to/webvid/results_10M_train.csv --data_dir /path/to/webvidaccelerate launch --multi_gpu --num_processes N --mixed_precision fp16 train_t2v.py --model GenTron-T2V-XL/2 --meta_path /path/to/msrvtt_data/MSRVTT_data.json --data_dir /path/to/MSRVTT@article{chen2023gentron,
  title={Gentron: Delving deep into diffusion transformers for image and video generation},
  author={Chen, Shoufa and Xu, Mengmeng and Ren, Jiawei and Cong, Yuren and He, Sen and Xie, Yanping and Sinha, Animesh and Luo, Ping and Xiang, Tao and Perez-Rua, Juan-Manuel},
  journal={arXiv preprint arXiv:2312.04557},
  year={2023}
}
