Enabling Reference-based Camera Control for Video Generation

SIGGRAPH Asia 2025

Yawen Luo¹ Jianhong Bai² Xiaoyu Shi^3✉ Menghan Xia³ Xintao Wang³ Pengfei Wan³
Di Zhang³ Kun Gai³ Tianfan Xue^1✉

¹The Chinese University of Hong Kong ²Zhejiang University
³Kuaishou Technology ^✉Corresponding author

Note: This open-source repository is intended to provide a reference implementation. Due to the difference in the underlying I2V model's performance, the open-source version may not achieve the same performance as the model in our paper.

🔥 Updates

[2025.10.09]: Training and Inference Code, Model Checkpoints are available.
[2025.09.25]: CamCloneMaster has been accepted by SIGGRAPH Aisa 2025.
[2025.09.08]: CameraClone Dataset is avaliable.
[2025.06.03]: Release the Project Page and the Arxiv version.

📷 Introduction

TL;DR: We propose CamCloneMaster, a framework that enables users to replicate camera movements from reference videos without requiring camera parameters or test-time fine-tuning. CamCloneMaster seamlessly supports reference-based camera control for both I2V and V2V tasks within a unified framework. We also release our CameraClone Dataset rendered with Unreal Engine 5.

⚙️ Code: CamCloneMaster + Wan2.1 (Inference & Training)

The model utilized in our paper is an internally developed T2V model, not Wan2.1. Due to company policy restrictions, we are unable to open-source the model used in the paper.

Due to training cost limitations, we adapted the Wan2.1-T2V-1.3B model for Image-to-Video (I2V) generation. This was achieved by conditioning the first frame through channel concatenation, a method proposed in the Wan technical report, rather than using the larger Wan2.1-I2V-14B model. We then integrated CamCloneMaster with this adapted 1.3B model to validate our method's effectiveness. Please note that results may differ from the demo due to this difference in the underlying I2V model.

Inference

Step 1: Set up the environment

DiffSynth-Studio requires Rust and Cargo to compile extensions. You can install them using the following command:

curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs/) | sh
. "$HOME/.cargo/env"

Install DiffSynth-Studio:

git clone https://github.com/KwaiVGI/CamCloneMaster
cd CamCloneMaster
pip install -e .

Step 2: Download the pretrained checkpoints

Download the pre-trained Wan2.1-T2V-1.3B models

cd CamCloneMaster
python download_wan2.1.py

Download the adapted Wan2.1-I2V-1.3B and CamCloneMaster models

Please download checkpoints from huggingface and place it in models/.

# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/KwaiVGI/CamCloneMaster-Wan2.1

Step 3: Test Adapted Wan-1.3B-I2V & CamCloneMaster on example videos

Test Image-to-Video Generation with Adapted Wan-1.3B-I2V Checkpoints

python inference_i2v.py --dataset_path demo/example_csv/infer/example_i2v_testset.csv --ckpt_path models/CamCloneMaster-Wan2.1/Wan-I2V-1.3B-Step8000.ckpt --output_dir demo/i2v_output

Test Camera Controlled Image-to-Video Generation with CamCloneMaster Checkpoints

python inference_camclone.py --cameraclone_type i2v --dataset_path demo/example_csv/infer/example_camclone_testset.csv --ckpt_path models/CamCloneMaster-Wan2.1/CamCloneMaster-Step9500.ckpt --output_dir demo/camclone_i2v_output

Test Camera Controlled Video-to-Video Re-Generation with CamCloneMaster Checkpoints

python inference_camclone.py --cameraclone_type v2v --dataset_path demo/example_csv/infer/example_camclone_testset.csv --ckpt_path models/CamCloneMaster-Wan2.1/CamCloneMaster-Step9500.ckpt --output_dir demo/camclone_v2v_output

Step 4: Test your own videos

To test your own videos, structure your test data according to the demo/example_csv/infer/example_camclone_testset.csv file. The required data will vary based on the generation mode:

For Camera Controlled Image-to-Video Generation, you will need to provide:
- ref_video_path: The reference video for camera motion.
- first_frame_path: The initial frame of the target video.
- caption: A description of the target video.
For Camera Controlled Video-to-Video Re-generation, you will need to provide:
- ref_video_path: The reference video for camera motion.
- content_video_path: The reference video for the content.
- caption: A description of the target video.
- The first_frame_path is not needed, as the system defaults to using the first frame of the content reference video.

Note: If your camera reference video is not at a 480x832 resolution, it will be automatically resized and cropped. Because camera motion is highly dependent on resolution, this can affect comparisons. For details on the resizing process, please refer to the CamCloneDataset class.

To accurately compare the camera motion of the generated video with your reference video, you have two options:

Pre-process the reference video: Before inference, use the resize_and_crop_videos.py script to resize your camera motion reference video to 480x832.
Use the visualization script: The vis_camclone_results.py script will automatically sample, resize, and crop your reference video—in the same way as the CamCloneDataset class—when it concatenates the reference and target videos for comparison.

Training

Step 1: Set up the environment

pip install lightning pandas websockets

Step 2: Prepare the training dataset

Download the CameraClone Dataset.

sudo apt-get install git-lfs
git lfs install
git clone https://huggingface.co/datasets/KwaiVGI/CameraClone-Dataset
cat CamCloneDataset.part* > CamCloneDataset.tar.gz
tar --zstd -xvf CamCloneDataset.tar.gz

Step 3: Training CamCloneMaster

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_camclone.py --dataset_path CameraClone-Dataset/CamCloneDataset.csv/ --output_path models/train

📷 Dataset: Camera Clone Dataset

1. Dataset Introduction

TL;DR: The Camera Clone Dataset, introduced in CamCloneMaster, is a large-scale synthetic dataset designed for camera clone learning, encompassing diverse scenes, subjects, and camera movements. It consists of triple video sets: a camera motion reference video $V_{cam}$, a content reference video $V_{cont}$, and a target video $V$, which recaptures the scene in $V_{cont}$ with the same camera movement as $V_{cam}$.

dataset.mp4

The Camera Clone Dataset is rendered using Unreal Engine 5. We collect 40 3D scenes as backgrounds, and we also collect 66 characters and put them into the 3D scenes as main subjects, each character is combined with one random animation, such as running and dancing.

To construct the triple set, camera trajectories must satisfy two key requirements: 1) Simultaneous Multi-View Capture: Multiple cameras must film the same scene concurrently, each following a distinct trajectory. 2) Paired Trajectories: paired shots with the same camera trajectories across different locations. Our implementation strategy addresses these needs as follows: Within any single location, 10 synchronized cameras operate simultaneously, each following one of ten unique, pre-defined trajectories to capture diverse views. To create paired trajectories, we group 3D locations in scenes into sets of four, ensuring that the same ten camera trajectories are replicated across all locations within each set. The camera trajectories themselves are automatically generated using designed rules. These rules encompass various types, including basic movements, circular arcs, and more complex camera paths.

In total, Camera Clone Dataset comprises 391K visually authentic videos shooting from 39.1K different locations in 40 scenes with 97.75K diverse camera trajectories, and 1,155K triple video sets are constructed based on these videos. Each video has a resolution of 576 x 1,008 and 77 frames.

3D Environment: We collect 40 high-quality 3D environments assets from Fab. To minimize the domain gap between rendered data and real-world videos, we primarily select visually realistic 3D scenes, while choosing a few stylized or surreal 3D scenes as a supplement. To ensure data diversity, the selected scenes cover a variety of indoor and outdoor settings, such as city streets, shopping malls, cafes, office rooms, and the countryside.

Character: We collect 66 different human 3D models as characters from Fab and Mixamo.

Animation: We collect 93 different animations from Fab and Mixamo, including common actions such as waving, dancing, and cheering. We use these animations to drive the collected characters and create diverse datasets through various combinations.

Camera Trajectories: To prevent clipping, trajectories are constrained by a maximum movement distance $d_{max}$, determined by the initial shot position in the scene. The types of trajectories contain:

Basic: Simple pans/tilts (5°-75°), rolls (20°-340°), and translations along cardinal axes.
Arc: Orbital paths, combining a primary rotation (10°-75°) with smaller, secondary rotations (5°-15°).
Random: Smooth splines interpolated between 2-4 random keypoints. Half of these splines also incorporated with multi-axis rotations.

2. Statistics and Configurations

Dataset Statistics:

Number of Dynamic Scenes	Camera per Scene	Total Videos	Number of Triple Sets
39,100	10	391,000	1154,819

Video Configurations:

Resolution	Frame Number	FPS
1344x768	77	15
1008x576	77	15

Note: You can use 'center crop' to adjust the video's aspect ratio to fit your video generation model, such as 16:9, 9:16, 4:3, or 3:4.

3. File Structure

Camera-Clone-Dataset
├──data
    ├── 0316
    │   └── traj_1_01
    │       ├── scene1_01.mp4
    │       ├── scene550_01.mp4
    │       ├── scene935_01.mp4
    │       └── scene1224_01.mp4
    ├── 0317
    ├── 0401
    ├── 0402
    ├── 0404
    ├── 0407
    └── 0410

4. Use Dataset

sudo apt-get install git-lfs
git lfs install
git clone https://huggingface.co/datasets/KwaiVGI/CameraClone-Dataset
cd CameraClone-Dataset
cat CamCloneDataset.part* > CamCloneDataset.tar.gz
tar --zstd -xvf CamCloneDataset.tar.gz

The "Triple Sets" information is located in the CamCloneDataset.csv file, which contains the following columns:

video_path: The path to the target video.
caption: A description of the target video.
ref_video_path: The path to the camera reference video.
content_video_path: The path to the content reference video.

🌟 Citation

Please leave us a star 🌟 and cite our paper if you find our work helpful.

@misc{luo2025camclonemaster,
      title={CamCloneMaster: Enabling Reference-based Camera Control for Video Generation}, 
      author={Yawen Luo and Jianhong Bai and Xiaoyu Shi and Menghan Xia and Xintao Wang and Pengfei Wan and Di Zhang and Kun Gai and Tianfan Xue},
      year={2025},
      eprint={2506.03140},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.03140}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Figs		Figs
demo		demo
diffsynth		diffsynth
models		models
.DS_Store		.DS_Store
README.md		README.md
download_wan2.1.py		download_wan2.1.py
inference_camclone.py		inference_camclone.py
inference_i2v.py		inference_i2v.py
requirements.txt		requirements.txt
resize_and_crop_videos.py		resize_and_crop_videos.py
setup.py		setup.py
train_camclone.py		train_camclone.py
vis_camclone_results.py		vis_camclone_results.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Enabling Reference-based Camera Control for Video Generation

SIGGRAPH Asia 2025

🔥 Updates

📷 Introduction

⚙️ Code: CamCloneMaster + Wan2.1 (Inference & Training)

Inference

Step 1: Set up the environment

Step 2: Download the pretrained checkpoints

Step 3: Test Adapted Wan-1.3B-I2V & CamCloneMaster on example videos

Step 4: Test your own videos

Training

Step 1: Set up the environment

Step 2: Prepare the training dataset

Step 3: Training CamCloneMaster

📷 Dataset: Camera Clone Dataset

1. Dataset Introduction

2. Statistics and Configurations

3. File Structure

4. Use Dataset

🌟 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

KlingTeam/CamCloneMaster

Folders and files

Latest commit

History

Repository files navigation

Enabling Reference-based Camera Control for Video Generation

SIGGRAPH Asia 2025

🔥 Updates

📷 Introduction

⚙️ Code: CamCloneMaster + Wan2.1 (Inference & Training)

Inference

Step 1: Set up the environment

Step 2: Download the pretrained checkpoints

Step 3: Test Adapted Wan-1.3B-I2V & CamCloneMaster on example videos

Step 4: Test your own videos

Training

Step 1: Set up the environment

Step 2: Prepare the training dataset

Step 3: Training CamCloneMaster

📷 Dataset: Camera Clone Dataset

1. Dataset Introduction

2. Statistics and Configurations

3. File Structure

4. Use Dataset

🌟 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages