Skip to content

Tongyi-MAI/Z-Image

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

57 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

โšก๏ธ- Image
An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Official Siteย  Hugging Faceย  Hugging Faceย  ModelScope Modelย  ModelScope Spaceย  Art Gallery PDFย  Web Art Galleryย 

Welcome to the official repository for the Z-Image๏ผˆ้€ ็›ธ๏ผ‰project!

โœจ Z-Image

Z-Image is a powerful and highly efficient image generation model with 6B parameters. Currently there are three variants:

  • ๐Ÿš€ Z-Image-Turbo โ€“ A distilled version of Z-Image that matches or exceeds leading competitors with only 8 NFEs (Number of Function Evaluations). It offers โšก๏ธsub-second inference latencyโšก๏ธ on enterprise-grade H800 GPUs and fits comfortably within 16G VRAM consumer devices. It excels in photorealistic image generation, bilingual text rendering (English & Chinese), and robust instruction adherence.

  • ๐Ÿงฑ Z-Image-Base โ€“ The non-distilled foundation model. By releasing this checkpoint, we aim to unlock the full potential for community-driven fine-tuning and custom development.

  • โœ๏ธ Z-Image-Edit โ€“ A variant fine-tuned on Z-Image specifically for image editing tasks. It supports creative image-to-image generation with impressive instruction-following capabilities, allowing for precise edits based on natural language prompts.

๐Ÿ“ฃ News

  • [2025-12-08] ๐Ÿ† Z-Image-Turbo ranked 8th overall on the Artificial Analysis Text-to-Image Leaderboard, making it the ๐Ÿฅ‡ #1 open-source model! Check out the full leaderboard.
  • [2025-12-01] ๐ŸŽ‰ Our technical report for Z-Image is now available on arXiv.
  • [2025-11-26] ๐Ÿ”ฅ Z-Image-Turbo is released! We have released the model checkpoint on Hugging Face and ModelScope. Try our online demo!

๐Ÿ“ฅ Model Zoo

Model Hugging Face ModelScope
Z-Image-Turbo Hugging Face
Hugging Face Space
ModelScope Model
ModelScope Space
Z-Image-Base To be released To be released
Z-Image-Edit To be released To be released

๐Ÿ–ผ๏ธ Showcase

๐Ÿ“ธ Photorealistic Quality: Z-Image-Turbo delivers strong photorealistic image generation while maintaining excellent aesthetic quality.

Showcase of Z-Image on Photo-realistic image Generation

๐Ÿ“– Accurate Bilingual Text Rendering: Z-Image-Turbo excels at accurately rendering complex Chinese and English text.

Showcase of Z-Image on Bilingual Text Rendering

๐Ÿ’ก Prompt Enhancing & Reasoning: Prompt Enhancer empowers the model with reasoning capabilities, enabling it to transcend surface-level descriptions and tap into underlying world knowledge.

reasoning.jpg

๐Ÿง  Creative Image Editing: Z-Image-Edit shows a strong understanding of bilingual editing instructions, enabling imaginative and flexible image transformations.

Showcase of Z-Image-Edit on Image Editing

๐Ÿ—๏ธ Model Architecture

We adopt a Scalable Single-Stream DiT (S3-DiT) architecture. In this setup, text, visual semantic tokens, and image VAE tokens are concatenated at the sequence level to serve as a unified input stream, maximizing parameter efficiency compared to dual-stream approaches.

Architecture of Z-Image and Z-Image-Edit

๐Ÿ“ˆ Performance

Z-Image-Turbo's performance has been validated on multiple independent benchmarks, where it consistently demonstrates state-of-the-art results, especially as the leading open-source model.

Artificial Analysis Text-to-Image Leaderboard

On the highly competitive Artificial Analysis Leaderboard, Z-Image-Turbo ranked 8th overall and secured the top position as the ๐Ÿฅ‡ #1 Open-Source Model, outperforming all other open-source alternatives.

Z-Image Rank on Artificial Analysis Leaderboard
Artificial Analysis Leaderboard

Z-Image Rank on Artificial Analysis Leaderboard (Open-Source Model Only)
Artificial Analysis Leaderboard (Open-Source Model Only)

Alibaba AI Arena Text-to-Image Leaderboard

According to the Elo-based Human Preference Evaluation on Alibaba AI Arena, Z-Image-Turbo also achieves state-of-the-art results among open-source models and shows highly competitive performance against leading proprietary models.

Z-Image Elo Rating on AI Arena
Alibaba AI Arena Text-to-Image Leaderboard

๐Ÿš€ Quick Start

(1) PyTorch Native Inference

Build a virtual environment you like and then install the dependencies:

pip install -e .

Then run the following code to generate an image:

python inference.py

(2) Diffusers Inference

(1) PyTorch Native Inference

Build a virtual environment you like and then install the dependencies:

pip install -e .

Then run the following code to generate an image:

python inference.py

(2) Diffusers Inference

Install the latest version of diffusers, use the following command:

Click here for details for why you need to install diffusers from source

We have submitted two pull requests (#12703 and #12715) to the ๐Ÿค— diffusers repository to add support for Z-Image. Both PRs have been merged into the latest official diffusers release. Therefore, you need to install diffusers from source for the latest features and Z-Image support.

pip install git+https://github.com/huggingface/diffusers

Then, try the following code to generate an image:

import torch
from diffusers import ZImagePipeline

# 1. Load the pipeline
# Use bfloat16 for optimal performance on supported GPUs
pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.to("cuda")

# [Optional] Attention Backend
# Diffusers uses SDPA by default. Switch to Flash Attention for better efficiency if supported:
# pipe.transformer.set_attention_backend("flash")    # Enable Flash-Attention-2
# pipe.transformer.set_attention_backend("_flash_3") # Enable Flash-Attention-3

# [Optional] Model Compilation
# Compiling the DiT model accelerates inference, but the first run will take longer to compile.
# pipe.transformer.compile()

# [Optional] CPU Offloading
# Enable CPU offloading for memory-constrained devices.
# pipe.enable_model_cpu_offload()

prompt = "Young Chinese woman in red Hanfu, intricate embroidery. Impeccable makeup, red floral forehead pattern. Elaborate high bun, golden phoenix headdress, red flowers, beads. Holds round folding fan with lady, trees, bird. Neon lightning-bolt lamp (โšก๏ธ), bright yellow glow, above extended left palm. Soft-lit outdoor night background, silhouetted tiered pagoda (่ฅฟๅฎ‰ๅคง้›ๅก”), blurred colorful distant lights."

# 2. Generate Image
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    num_inference_steps=9,  # This actually results in 8 DiT forwards
    guidance_scale=0.0,     # Guidance should be 0 for the Turbo models
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]

image.save("example.png")

๐Ÿ”ฌ Decoupled-DMD: The Acceleration Magic Behind Z-Image

arXiv

Decoupled-DMD is the core few-step distillation algorithm that empowers the 8-step Z-Image model.

Our core insight in Decoupled-DMD is that the success of existing DMD (Distributaion Matching Distillation) methods is the result of two independent, collaborating mechanisms:

  • CFG Augmentation (CA): The primary engine ๐Ÿš€ driving the distillation process, a factor largely overlooked in previous work.
  • Distribution Matching (DM): Acts more as a regularizer โš–๏ธ, ensuring the stability and quality of the generated output.

By recognizing and decoupling these two mechanisms, we were able to study and optimize them in isolation. This ultimately motivated us to develop an improved distillation process that significantly enhances the performance of few-step generation.

Diagram of Decoupled-DMD

๐Ÿค– DMDR: Fusing DMD with Reinforcement Learning

arXiv

Building upon the strong foundation of Decoupled-DMD, our 8-step Z-Image model has already demonstrated exceptional capabilities. To achieve further improvements in terms of semantic alignment, aesthetic quality, and structural coherenceโ€”while producing images with richer high-frequency detailsโ€”we present DMDR.

Our core insight behind DMDR is that Reinforcement Learning (RL) and Distribution Matching Distillation (DMD) can be synergistically integrated during the post-training of few-step models. We demonstrate that:

  • RL Unlocks the Performance of DMD ๐Ÿš€
  • DMD Effectively Regularizes RL โš–๏ธ

Diagram of DMDR

๐ŸŽ‰ Community Works

  • Cache-DiT offers inference acceleration support for Z-Image with DBCache, Context Parallelism and Tensor Parallelism. Visit their example for more details.
  • stable-diffusion.cpp is a pure C++ diffusion model inference engine that supports fast and memory-efficient Z-Image inference across multiple platforms (CUDA, Vulkan, etc.). You can use stable-diffusion.cpp to generate images with Z-Image on machines with as little as 4GB of VRAM. For more information, please refer to How to Use Zโ€Image on a GPU with Only 4GB VRAM.
  • LeMiCa provides a training-free, timestep-level acceleration method that conveniently speeds up Z-Image inference. For more details, see LeMiCa4Z-Image.
  • ComfyUI ZImageLatent provdes an easy to use latent of the official Z-Image resolutions.
  • DiffSynth-Studio has provided more support for Z-Image, including LoRA training, full training, distillation training, and low-VRAM inference. Please refer to the document of DiffSynth-Studio.
  • vllm-omni, a framework that extends its support for omni-modality model fast inference and serving, now supports Z-Image.
  • SGLang-Diffusion brings SGLang's state-of-the-art performance to accelerate image and video generation for diffusion models, now supporting Z-Image.

๐Ÿš€ Star History

Star History Chart

๐Ÿ“œ Citation

If you find our work useful in your research, please consider citing:

@article{team2025zimage,
  title={Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer},
  author={Z-Image Team},
  journal={arXiv preprint arXiv:2511.22699},
  year={2025}
}

@article{liu2025decoupled,
  title={Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield},
  author={Dongyang Liu and Peng Gao and David Liu and Ruoyi Du and Zhen Li and Qilong Wu and Xin Jin and Sihan Cao and Shifeng Zhang and Hongsheng Li and Steven Hoi},
  journal={arXiv preprint arXiv:2511.22677},
  year={2025}
}

@article{jiang2025distribution,
  title={Distribution Matching Distillation Meets Reinforcement Learning},
  author={Jiang, Dengyang and Liu, Dongyang and Wang, Zanyi and Wu, Qilong and Jin, Xin and Liu, David and Li, Zhen and Wang, Mengmeng and Gao, Peng and Yang, Harry},
  journal={arXiv preprint arXiv:2511.13649},
  year={2025}
}

๐Ÿค We're Hiring!

We're actively looking for Research Scientists, Engineers, and Interns to work on foundational generative models and their applications. Interested candidates please send your resume to: [email protected]

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 12

Languages