💎 GEM: Generalizable Ego-Vision Multimodal World Model

Authors
Mariam Hassan^★1, Sebastian Stapf^★2, Ahmad Rahimi^★1, Pedro M. B. Rezende^★2, Yasaman Haghighi^♦1
David Brüggemann^♦3, Isinsu Katircioglu^♦3, Lin Zhang^♦3, Xiaoran Chen^♦3, Suman Saha^♦3
Marco Cannici^♦4, Elie Aljalbout^♦4, Botao Ye^♦5, Xi Wang^♦5, Aram Davtyan²
Mathieu Salzmann^1,3, Davide Scaramuzza⁴, Marc Pollefeys⁵, Paolo Favaro², Alexandre Alahi¹

¹École Polytechnique Fédérale de Lausanne (EPFL) ²University of Bern ³Swiss Data Science Center ⁴University of Zurich ⁵ETH Zurich

^★ Main Contributors ^♦ Data Contributors

CVPR 2025
Paper | Website

GEM.mp4

Welcome to the official repository for GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control. This repository contains the codebase, data processing pipeline, and instructions for training, sampling, and evaluating our world model.

📝 Overview

GEM is a multimodal world model designed for:

Fine-grained ego-motion modeling.
Capturing object dynamics in complex environments.
Flexible scene composition control in ego-centric vision tasks.

This repository allows researchers and developers to reproduce our results and further explore the capabilities of our world model.

🚀 Installation

Prerequisites

Ensure you have the following installed:

Python >= 3.8
PyTorch >= 1.12 with CUDA support
Other dependencies listed in requirements.txt

Steps

Clone the repository:

git clone https://github.com/your_username/gem-world-model.git
cd gem-world-model

Install dependencies:
```
pip install -r requirements.txt
```

(Optional) Set up a conda environment:

conda create -n gem_env python=3.8
conda activate gem_env
pip install -r requirements.txt

GEM Checkpoint

We released the model weights of GEM v1.0 at HuggingFace

📂 Dataset Preparation

To get started quickly, we provide sample data you can use to run training, sampling, or evaluation.

Alternatively, you can follow the steps below to prepare your own dataset.

Obtain the dataset:
- Download the datasets from their respective websites (e.g. BDD100K).
- Convert the videos to individual frames using data_processing/fast_ffmpeg.py
Example usage:
```
python preprocess_opendv/fast_ffmpeg.py --video_dir VIDEOS_PATH --output_dir IMAGES_PATH
```
- Gather the frames into .h5 files using data_processing/convert_to_h5.py
Example usage:
```
python preprocess_opendv/fast_h5.py --data_dir IMAGES_PATH --out_dir H5_PATH
```
To generate the pseudo-labels (depth, trajectories and calibration), see the instructions in the README.

The final directory structure should look like:

data/
├── bdd100k/
│   ├── h5/
│   │   ├── file1.h5
│   │   ├── curation_file1.csv
│   │   ├── trajectory_file1.h5
|   │   └── depth_file1.h5

Make sure ./data is in the same folder as the cloned repository.

Prepare the metadata:
- Run:
```
python data_processing/prepare_data.py
```
- This will generate a csv file with all of the curation information and validate the files you downloaded.
Your data is now ready! You can now proceed with training.

🏋️‍♂️ Training

Train GEM from scratch or fine-tune a pre-trained model as follows:

Update the configuration in configs/train_config.yaml as needed.

Run the training script:

python train.py --config configs/train_config.yaml

Key Configurations

In your training config, you can specify the control signal and modalities you want to use for training. For example:
```
data:
  target: gem.data.curated_dataset.CuratedSampler 
  params:
    data_dict:
      - depth_img
      - trajectory
      - rendered_poses
```
Will enable training with depth images, trajectories, and keyposes as input modalities. Only the modalities and controls specified in the data_dict will be used for training.
Batch size: Adjust batch_size in the config file based on your GPU memory.
Distributed Training: For multi-GPU training, use:
```
torchrun --nproc_per_node=4 train.py --config configs/train_config.yaml
```
or use SLURM if on a cluster.

🎨 Sampling

Make sure the checkpoint is

  checkpoints/
  ├──gem.safetensors

Before running the script, double-check that the dataset paths are correctly set in sample.py.

Check the dataset paths in the 'sample.py' before you run the script.

Sampling Instructions

Generate unconditional samples or condition on specific inputs:

Generate unconditional samples:

python sample.py --save path/to/save/generated_samples --condition_type unconditional

Generate conditional samples (with object_manipulation, skeleton_manipulation, ego_motion as control signal options):
```
python sample.py --save path/to/save/generated_samples --condition_type object_manipulation
```

📊 Evaluation

After following the Sampling steps and generating videos, either conditional or unconditional, you can evaluate the model using the following command:

python evaluate.py --dir path/to/generated_samples --condition_type object_manipulation

with the appropriate condition_type based on the control signal used during sampling. Using ‘unconditional’ will assess the visual quality of the generated samples and provide FID and FVD metrics.

Metrics

The evaluation reports the following metrics:

FID/FVD: Fréchet Inception/Video Distance for generative quality.
ADE: Average Displacement Error to compare the trajectory of the ego car in the generated video and the ground truth one.
COM: Control Object Manipulation metric to evaluate the pixel misplacement of the manipulated object.
AP: Average Precision to evaluate the quality of the generated pedestrian skeletons, based on COCO evaluation metrics.

🧠 Model Architecture

The architecture of GEM consists of:

Ego-Motion Encoder: Processes input ego-motion sequences.
Object Dynamics Module: Models interactions between dynamic entities.
Scene Decoder: Reconstructs and generates realistic scenes.

For a detailed explanation, please refer to the paper here.

📜 Citations

If GEM contributes to your research, please cite our paper:

@misc{hassan2024gemgeneralizableegovisionmultimodal,
      title={GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control}, 
      author={Mariam Hassan and Sebastian Stapf and Ahmad Rahimi and Pedro M. B. Rezende and Yasaman Haghighi and David Brüggemann and Isinsu Katircioglu and Lin Zhang and Xiaoran Chen and Suman Saha and Marco Cannici and Elie Aljalbout and Botao Ye and Xi Wang and Aram Davtyan and Mathieu Salzmann and Davide Scaramuzza and Marc Pollefeys and Paolo Favaro and Alexandre Alahi},
      year={2024},
      eprint={2412.11198},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.11198}, 
}

For questions or issues, feel free to open an issue.

Happy researching! 🚗🎥

Acknowledgements

Our implementation is built on generative-models and vista. Thanks to their developers for the open-source work!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
configs		configs
data_processing		data_processing
dwpose		dwpose
gem		gem
pseudo_labeling_pipeline		pseudo_labeling_pipeline
visualizations		visualizations
README.md		README.md
bin_to_st.py		bin_to_st.py
check_distributed.py		check_distributed.py
control_demo.py		control_demo.py
eval_utils.py		eval_utils.py
evaluate.py		evaluate.py
fvd_utils.py		fvd_utils.py
requirements.txt		requirements.txt
sample.py		sample.py
sample_skeletons.py		sample_skeletons.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💎 GEM: Generalizable Ego-Vision Multimodal World Model

CVPR 2025
Paper | Website

📝 Overview

📋 Contents

🚀 Installation

Prerequisites

Steps

GEM Checkpoint

📂 Dataset Preparation

🏋️‍♂️ Training

Key Configurations

🎨 Sampling

Sampling Instructions

📊 Evaluation

Metrics

🧠 Model Architecture

📜 Citations

Acknowledgements

About

Releases

Packages

Languages

vita-epfl/GEM

Folders and files

Latest commit

History

Repository files navigation

💎 GEM: Generalizable Ego-Vision Multimodal World Model

CVPR 2025 Paper | Website

📝 Overview

📋 Contents

🚀 Installation

Prerequisites

Steps

GEM Checkpoint

📂 Dataset Preparation

🏋️‍♂️ Training

Key Configurations

🎨 Sampling

Sampling Instructions

📊 Evaluation

Metrics

🧠 Model Architecture

📜 Citations

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

CVPR 2025
Paper | Website

Packages