Authors
Mariam Hassan★1, Sebastian Stapf★2, Ahmad Rahimi★1, Pedro M. B. Rezende★2, Yasaman Haghighi♦1
David Brüggemann♦3, Isinsu Katircioglu♦3, Lin Zhang♦3, Xiaoran Chen♦3, Suman Saha♦3
Marco Cannici♦4, Elie Aljalbout♦4, Botao Ye♦5, Xi Wang♦5, Aram Davtyan2
Mathieu Salzmann1,3, Davide Scaramuzza4, Marc Pollefeys5, Paolo Favaro2, Alexandre Alahi1
1École Polytechnique Fédérale de Lausanne (EPFL) 2University of Bern 3Swiss Data Science Center 4University of Zurich 5ETH Zurich
★ Main Contributors ♦ Data Contributors
GEM.mp4
Welcome to the official repository for GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control. This repository contains the codebase, data processing pipeline, and instructions for training, sampling, and evaluating our world model.
GEM is a multimodal world model designed for:
- Fine-grained ego-motion modeling.
- Capturing object dynamics in complex environments.
- Flexible scene composition control in ego-centric vision tasks.
This repository allows researchers and developers to reproduce our results and further explore the capabilities of our world model.
Ensure you have the following installed:
- Python >= 3.8
- PyTorch >= 1.12 with CUDA support
- Other dependencies listed in
requirements.txt
-
Clone the repository:
git clone https://github.com/your_username/gem-world-model.git cd gem-world-model
-
Install dependencies:
pip install -r requirements.txt
-
(Optional) Set up a conda environment:
conda create -n gem_env python=3.8 conda activate gem_env pip install -r requirements.txt
We released the model weights of GEM v1.0 at HuggingFace
To get started quickly, we provide sample data you can use to run training, sampling, or evaluation.
Alternatively, you can follow the steps below to prepare your own dataset.
-
Obtain the dataset:
- Download the datasets from their respective websites (e.g. BDD100K).
- Convert the videos to individual frames using data_processing/fast_ffmpeg.py
Example usage:
python preprocess_opendv/fast_ffmpeg.py --video_dir VIDEOS_PATH --output_dir IMAGES_PATH
- Gather the frames into .h5 files using data_processing/convert_to_h5.py
Example usage:
python preprocess_opendv/fast_h5.py --data_dir IMAGES_PATH --out_dir H5_PATH
-
To generate the pseudo-labels (depth, trajectories and calibration), see the instructions in the README.
-
The final directory structure should look like:
data/ ├── bdd100k/ │ ├── h5/ │ │ ├── file1.h5 │ │ ├── curation_file1.csv │ │ ├── trajectory_file1.h5 | │ └── depth_file1.h5
- Make sure ./data is in the same folder as the cloned repository.
-
Prepare the metadata:
- Run:
python data_processing/prepare_data.py
- This will generate a csv file with all of the curation information and validate the files you downloaded.
-
Your data is now ready! You can now proceed with training.
Train GEM from scratch or fine-tune a pre-trained model as follows:
- Update the configuration in
configs/train_config.yaml
as needed. - Run the training script:
python train.py --config configs/train_config.yaml
- In your training config, you can specify the control signal and modalities you want to use for training. For example:
Will enable training with depth images, trajectories, and keyposes as input modalities. Only the modalities and controls specified in the
data: target: gem.data.curated_dataset.CuratedSampler params: data_dict: - depth_img - trajectory - rendered_poses
data_dict
will be used for training. - Batch size: Adjust
batch_size
in the config file based on your GPU memory. - Distributed Training: For multi-GPU training, use:
or use SLURM if on a cluster.
torchrun --nproc_per_node=4 train.py --config configs/train_config.yaml
Make sure the checkpoint is
checkpoints/
├──gem.safetensors
Before running the script, double-check that the dataset paths are correctly set in sample.py
.
Check the dataset paths in the 'sample.py' before you run the script.
Generate unconditional samples or condition on specific inputs:
- Generate unconditional samples:
python sample.py --save path/to/save/generated_samples --condition_type unconditional
- Generate conditional samples (with
object_manipulation
,skeleton_manipulation
,ego_motion
as control signal options):python sample.py --save path/to/save/generated_samples --condition_type object_manipulation
After following the Sampling steps and generating videos, either conditional or unconditional, you can evaluate the model using the following command:
python evaluate.py --dir path/to/generated_samples --condition_type object_manipulation
with the appropriate condition_type
based on the control signal used during sampling. Using ‘unconditional’ will assess the visual quality of the generated samples and provide FID and FVD metrics.
The evaluation reports the following metrics:
- FID/FVD: Fréchet Inception/Video Distance for generative quality.
- ADE: Average Displacement Error to compare the trajectory of the ego car in the generated video and the ground truth one.
- COM: Control Object Manipulation metric to evaluate the pixel misplacement of the manipulated object.
- AP: Average Precision to evaluate the quality of the generated pedestrian skeletons, based on COCO evaluation metrics.
The architecture of GEM consists of:
- Ego-Motion Encoder: Processes input ego-motion sequences.
- Object Dynamics Module: Models interactions between dynamic entities.
- Scene Decoder: Reconstructs and generates realistic scenes.
For a detailed explanation, please refer to the paper here.
If GEM contributes to your research, please cite our paper:
@misc{hassan2024gemgeneralizableegovisionmultimodal,
title={GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control},
author={Mariam Hassan and Sebastian Stapf and Ahmad Rahimi and Pedro M. B. Rezende and Yasaman Haghighi and David Brüggemann and Isinsu Katircioglu and Lin Zhang and Xiaoran Chen and Suman Saha and Marco Cannici and Elie Aljalbout and Botao Ye and Xi Wang and Aram Davtyan and Mathieu Salzmann and Davide Scaramuzza and Marc Pollefeys and Paolo Favaro and Alexandre Alahi},
year={2024},
eprint={2412.11198},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.11198},
}
For questions or issues, feel free to open an issue.
Happy researching! 🚗🎥
Our implementation is built on generative-models and vista. Thanks to their developers for the open-source work!