Yawen Luo1
Jianhong Bai2
Xiaoyu Shi3β
Menghan Xia3
Xintao Wang3
Pengfei Wan3
Di Zhang3
Kun Gai3
Tianfan Xue1β
1The Chinese University of Hong Kong Β Β
2Zhejiang University
3Kuaishou Technology Β Β
βCorresponding author
Note: This open-source repository is intended to provide a reference implementation. Due to the difference in the underlying I2V model's performance, the open-source version may not achieve the same performance as the model in our paper.
- [2025.10.09]: Training and Inference Code, Model Checkpoints are available.
- [2025.09.25]: CamCloneMaster has been accepted by SIGGRAPH Aisa 2025.
- [2025.09.08]: CameraClone Dataset is avaliable.
- [2025.06.03]: Release the Project Page and the Arxiv version.
TL;DR: We propose CamCloneMaster, a framework that enables users to replicate camera movements from reference videos without requiring camera parameters or test-time fine-tuning. CamCloneMaster seamlessly supports reference-based camera control for both I2V and V2V tasks within a unified framework. We also release our CameraClone Dataset rendered with Unreal Engine 5.
The model utilized in our paper is an internally developed T2V model, not Wan2.1. Due to company policy restrictions, we are unable to open-source the model used in the paper.
Due to training cost limitations, we adapted the Wan2.1-T2V-1.3B model for Image-to-Video (I2V) generation. This was achieved by conditioning the first frame through channel concatenation, a method proposed in the Wan technical report, rather than using the larger Wan2.1-I2V-14B model. We then integrated CamCloneMaster with this adapted 1.3B model to validate our method's effectiveness. Please note that results may differ from the demo due to this difference in the underlying I2V model.
DiffSynth-Studio requires Rust and Cargo to compile extensions. You can install them using the following command:
curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs/) | sh
. "$HOME/.cargo/env"Install DiffSynth-Studio:
git clone https://github.com/KwaiVGI/CamCloneMaster
cd CamCloneMaster
pip install -e .- Download the pre-trained Wan2.1-T2V-1.3B models
cd CamCloneMaster
python download_wan2.1.py- Download the adapted Wan2.1-I2V-1.3B and CamCloneMaster models
Please download checkpoints from huggingface and place it in models/.
# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/KwaiVGI/CamCloneMaster-Wan2.1- Test Image-to-Video Generation with Adapted Wan-1.3B-I2V Checkpoints
python inference_i2v.py --dataset_path demo/example_csv/infer/example_i2v_testset.csv --ckpt_path models/CamCloneMaster-Wan2.1/Wan-I2V-1.3B-Step8000.ckpt --output_dir demo/i2v_output- Test Camera Controlled Image-to-Video Generation with CamCloneMaster Checkpoints
python inference_camclone.py --cameraclone_type i2v --dataset_path demo/example_csv/infer/example_camclone_testset.csv --ckpt_path models/CamCloneMaster-Wan2.1/CamCloneMaster-Step9500.ckpt --output_dir demo/camclone_i2v_output- Test Camera Controlled Video-to-Video Re-Generation with CamCloneMaster Checkpoints
python inference_camclone.py --cameraclone_type v2v --dataset_path demo/example_csv/infer/example_camclone_testset.csv --ckpt_path models/CamCloneMaster-Wan2.1/CamCloneMaster-Step9500.ckpt --output_dir demo/camclone_v2v_outputTo test your own videos, structure your test data according to the demo/example_csv/infer/example_camclone_testset.csv file. The required data will vary based on the generation mode:
-
For Camera Controlled Image-to-Video Generation, you will need to provide:
ref_video_path: The reference video for camera motion.first_frame_path: The initial frame of the target video.caption: A description of the target video.
-
For Camera Controlled Video-to-Video Re-generation, you will need to provide:
ref_video_path: The reference video for camera motion.content_video_path: The reference video for the content.caption: A description of the target video.- The
first_frame_pathis not needed, as the system defaults to using the first frame of the content reference video.
Note: If your camera reference video is not at a 480x832 resolution, it will be automatically resized and cropped. Because camera motion is highly dependent on resolution, this can affect comparisons. For details on the resizing process, please refer to the CamCloneDataset class.
To accurately compare the camera motion of the generated video with your reference video, you have two options:
- Pre-process the reference video: Before inference, use the
resize_and_crop_videos.pyscript to resize your camera motion reference video to 480x832. - Use the visualization script: The
vis_camclone_results.pyscript will automatically sample, resize, and crop your reference videoβin the same way as theCamCloneDatasetclassβwhen it concatenates the reference and target videos for comparison.
pip install lightning pandas websocketsDownload the CameraClone Dataset.
sudo apt-get install git-lfs
git lfs install
git clone https://huggingface.co/datasets/KwaiVGI/CameraClone-Dataset
cat CamCloneDataset.part* > CamCloneDataset.tar.gz
tar --zstd -xvf CamCloneDataset.tar.gzCUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_camclone.py --dataset_path CameraClone-Dataset/CamCloneDataset.csv/ --output_path models/trainTL;DR: The Camera Clone Dataset, introduced in CamCloneMaster, is a large-scale synthetic dataset designed for camera clone learning, encompassing diverse scenes, subjects, and camera movements. It consists of triple video sets: a camera motion reference video
dataset.mp4
The Camera Clone Dataset is rendered using Unreal Engine 5. We collect 40 3D scenes as backgrounds, and we also collect 66 characters and put them into the 3D scenes as main subjects, each character is combined with one random animation, such as running and dancing.
To construct the triple set, camera trajectories must satisfy two key requirements: 1) Simultaneous Multi-View Capture: Multiple cameras must film the same scene concurrently, each following a distinct trajectory. 2) Paired Trajectories: paired shots with the same camera trajectories across different locations. Our implementation strategy addresses these needs as follows: Within any single location, 10 synchronized cameras operate simultaneously, each following one of ten unique, pre-defined trajectories to capture diverse views. To create paired trajectories, we group 3D locations in scenes into sets of four, ensuring that the same ten camera trajectories are replicated across all locations within each set. The camera trajectories themselves are automatically generated using designed rules. These rules encompass various types, including basic movements, circular arcs, and more complex camera paths.
In total, Camera Clone Dataset comprises 391K visually authentic videos shooting from 39.1K different locations in 40 scenes with 97.75K diverse camera trajectories, and 1,155K triple video sets are constructed based on these videos. Each video has a resolution of 576 x 1,008 and 77 frames.
3D Environment: We collect 40 high-quality 3D environments assets from Fab. To minimize the domain gap between rendered data and real-world videos, we primarily select visually realistic 3D scenes, while choosing a few stylized or surreal 3D scenes as a supplement. To ensure data diversity, the selected scenes cover a variety of indoor and outdoor settings, such as city streets, shopping malls, cafes, office rooms, and the countryside.
Character: We collect 66 different human 3D models as characters from Fab and Mixamo.
Animation: We collect 93 different animations from Fab and Mixamo, including common actions such as waving, dancing, and cheering. We use these animations to drive the collected characters and create diverse datasets through various combinations.
Camera Trajectories: To prevent clipping, trajectories are constrained by a maximum movement distance
- Basic: Simple pans/tilts (5Β°-75Β°), rolls (20Β°-340Β°), and translations along cardinal axes.
- Arc: Orbital paths, combining a primary rotation (10Β°-75Β°) with smaller, secondary rotations (5Β°-15Β°).
- Random: Smooth splines interpolated between 2-4 random keypoints. Half of these splines also incorporated with multi-axis rotations.
Dataset Statistics:
| Number of Dynamic Scenes | Camera per Scene | Total Videos | Number of Triple Sets |
|---|---|---|---|
| 39,100 | 10 | 391,000 | 1154,819 |
Video Configurations:
| Resolution | Frame Number | FPS |
|---|---|---|
| 1344x768 | 77 | 15 |
| 1008x576 | 77 | 15 |
Note: You can use 'center crop' to adjust the video's aspect ratio to fit your video generation model, such as 16:9, 9:16, 4:3, or 3:4.
Camera-Clone-Dataset
βββdata
βββ 0316
β βββ traj_1_01
β βββ scene1_01.mp4
β βββ scene550_01.mp4
β βββ scene935_01.mp4
β βββ scene1224_01.mp4
βββ 0317
βββ 0401
βββ 0402
βββ 0404
βββ 0407
βββ 0410
sudo apt-get install git-lfs
git lfs install
git clone https://huggingface.co/datasets/KwaiVGI/CameraClone-Dataset
cd CameraClone-Dataset
cat CamCloneDataset.part* > CamCloneDataset.tar.gz
tar --zstd -xvf CamCloneDataset.tar.gzThe "Triple Sets" information is located in the CamCloneDataset.csv file, which contains the following columns:
- video_path: The path to the target video.
- caption: A description of the target video.
- ref_video_path: The path to the camera reference video.
- content_video_path: The path to the content reference video.
Please leave us a star π and cite our paper if you find our work helpful.
@misc{luo2025camclonemaster,
title={CamCloneMaster: Enabling Reference-based Camera Control for Video Generation},
author={Yawen Luo and Jianhong Bai and Xiaoyu Shi and Menghan Xia and Xintao Wang and Pengfei Wan and Di Zhang and Kun Gai and Tianfan Xue},
year={2025},
eprint={2506.03140},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.03140},
}

