Xincheng Shuai1 · Henghui Ding 1 . Zhenyuan Qin1 . Hao Luo2,3 · Xingjun Ma1 . Dacheng Tao4
1Fudan University · 2DAMO Academy, Alibaba group · 3Hupan Lab · 4Nanyang Technological University, Singapore
Controlling the movements of dynamic objects and the camera within generated videos is a meaningful yet challenging task. Due to the lack of datasets with comprehensive 6D pose annotations, existing text-to-video methods can not simultaneously control the motions of both camera and objects in 3D-aware manner. Therefore we introduce a Synthetic Dataset for Free-Form Motion Control (SynFMC). The proposed SynFMC dataset includes diverse object and environment categories and covers various motion patterns according to specific rules, simulating common and complex real-world scenarios. The complete 6D pose information facilitates models learning to disentangle the motion effects from objects and the camera in a video. To provide precise 3D-aware motion control, we further propose a method trained on SynFMC, Free-Form Motion Control (FMC). FMC can control the 6D poses of objects and camera independently or simultaneously, producing high-fidelity videos.
Figure 1. The rule-based generation pipeline of videos in the proposed Synthetic Dataset for Free-Form Motion Control (SynFMC). This example generates synthetic video with three objects: (1) The environment asset and it’s matching object assets are selected as the scene elements. (2) The motion types of objects and camera are randomly selected for trajectory generation. (3) The center region shows the resulting 3D animation sequence used for rendering. The rendered video and annotations are demonstrated in the last row.
Figure 2. The architecture of FMC. In the first stage, we randomly sample the images from synthetic videos and update the parameters from injected Domain LoRA. Next, the modules from CMC are learned. It consists of two parts: Camera Encoder and Camera Adapter, where the Camera Adapter is introduced into the temporal modules. Finally, we train the Object Encoder from OMC. It receives the 6D object pose features, which are repeated in the corresponding object region. We use Gaussian blur kernel centered at the centroid to prevent the need of precise masks. Then, the output is multiplied by the coarse masks to modulate the features in the main branch.
conda env create -f environment.yaml
conda activate fmcThe training process of FMC consists of three stages.
In the first stage, we randomly sample the images from synthetic videos and update the parameters from injected Domain LoRA.
bash dist_run_lora.bashNext, the modules from CMC are learned. Inspired by Cameractrl, it consists of two parts: Camera Encoder and Camera Adapter, where the Camera Adapter is introduced into the temporal modules.
bash dist_run_cam.bashFinally, we train the Object Encoder from OMC. It receives the 6D object pose features, which are repeated in the corresponding object region. We use Gaussian blur kernel centered at the centroid to prevent the need of precise masks. Then, the output is multiplied by the coarse masks to modulate the features in the main branch.
bash dist_run_obj.bashIf you find our work useful for your research and applications, please kindly cite using this BibTeX:
@inproceedings{SynFMC,
title={{Free-Form Motion Control}: Controlling the 6D Poses of Camera and Objects in Video Generation},
author={Shuai, Xincheng and Ding, Henghui and Qin, Zhenyuan and Luo, Hao and Ma, Xingjun and Tao, Dacheng},
booktitle={ICCV},
year={2025}
}