Nitro-E is a family of text-to-image diffusion models focused on highly efficient training. With just 304M parameters, Nitro-E is designed to be resource-friendly for both training and inference. For training, it only takes 1.5 days on a single node with 8 AMD Instinct™ MI300X GPUs. On the inference side, Nitro-E delivers a throughput of 18.8 samples per second (batch size 32, 512px images) a single AMD Instinct MI300X GPU. The distilled version can further increase the throughput to 39.3 samples per second. On a consumer iGPU device Strix Halo, our model can generate a 512px image using only 0.16 second.
This repository provides training and data preparation scripts to reproduce our results. We hope this codebase for efficient diffusion model training enables researchers to iterate faster on ideas and lowers the barrier for independent developers to build custom models.
- [2025.10.24]: 🔥Release Nitro-E Release Nitro-E-512px model, Nitro-E-512px-GRPO post-training GRPO model, Nitro-E-512px-dist distilled model, training and inference code!
When running on AMD InstinctTM GPUs, it is recommended to use the public PyTorch ROCm images to get optimized performance out-of-the-box.
docker pull rocm/pytorch:rocm6.2.2_ubuntu22.04_py3.10_pytorch_release_2.3.0pip install diffusers==0.32.2 transformers==4.49.0 accelerate==1.7.0 wandb torchmetrics pycocotools torchmetrics[image] mosaicml-streaming==0.11.0 beautifulsoup4 tabulate timm==0.9.1 pyarrow einops omegaconf sentencepiece==0.2.0 pandas==2.2.3 alive-progressgit clone https://github.com/ROCm/flash-attention.git
cd flash-attention
MAX_JOBS=`nproc` python setup.py installThe E-MMDiT models were trained on a dataset of ~25M images consisting of both real and synthetic data that are openly available on the internet, including Segment-Anything-1B, JourneyDB, and FLUX-generated images using prompts from DiffusionDB and DataComp.
We provide a full pipeline to create the data.
Please go to datasets folder and run:
cd datasets
bash scripts/get_data_all.shto create the complete version of the dataset.
Or run:
cd datasets
bash scripts/get_data_partial.shto create a tiny version for testing purposes.
Launch a training session using this script:
bash scripts/train_512.shPlease modify configs/accelerate.yaml accordingly for multi-GPU / multi-node distributed training setup, torch compile, etc., and specific yaml files in configs for experimental settings.
import torch
from core.tools.inference_pipe import init_pipe
device = torch.device('cuda:0')
dtype = torch.bfloat16
repo_name = "amd/Nitro-E"
resolution = 512
ckpt_name = 'Nitro-E-512px.safetensors'
# for 1024px model
# resolution = 1024
# ckpt_name = 'Nitro-E-1024px.safetensors'
pipe = init_pipe(device, dtype, resolution, repo_name=repo_name, ckpt_name=ckpt_name)
prompt = 'A hot air balloon in the shape of a heart grand canyon'
images = pipe(prompt=prompt, width=resolution, height=resolution, num_inference_steps=20, guidance_scale=4.5).imagesimport torch
from core.tools.inference_pipe import init_pipe
device = torch.device('cuda:0')
dtype = torch.bfloat16
resolution = 512
repo_name = "amd/Nitro-E"
ckpt_name = 'Nitro-E-512px-dist.safetensors'
pipe = init_pipe(device, dtype, resolution, repo_name=repo_name, ckpt_name=ckpt_name)
prompt = 'A hot air balloon in the shape of a heart grand canyon'
images = pipe(prompt=prompt, width=resolution, height=resolution, num_inference_steps=4, guidance_scale=0).images- Nitro-T: Efficient Training of diffusion models.
- Nitor-1: One-step distillation of diffusion models.
Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved.
This project is licensed under the MIT License.
