This repository contains the code and resources for the paper "Leveraging Latent Space Scaling in Flow and Diffusion for Efficient Text to Image Generation".
Accepted at ICCV AIGENS workshop 2025.
TL;DR: This framework accelerates text-to-image generation by shifting the early timesteps of diffusion and flow to a lower resolution in latent space. Compared to traditional methods, it achieves better image quality.
Flow matching and diffusion models have shown impressive results in text-to-image generation, producing photorealistic images through an iterative denoising process. A common strategy to speed up synthesis is to perform early denoising at lower resolutions. However, traditional methods that downscale and upscale in pixel space often introduce artifacts and distortions. These issues arise when the upscaled images are re-encoded into the latent space, leading to degraded final image quality. To address this, we propose Latent Space Scaling Generation (LSSGen), a framework that performs resolution scaling directly in the latent space using a lightweight latent upsampler. Without altering the Transformer or U-Net architecture, LSSGen improves both efficiency and visual quality while supporting flexible multi-resolution generation. Our comprehensive evaluation covering text-image alignment and perceptual quality shows that LSSGen significantly outperforms conventional scaling approaches. When generating
$1024^2$ images at similar speeds, it achieves up to 246% TOPIQ score improvement.
- Python 3.8+
- PyTorch 2.0+
- Diffusers 0.33.*
To use the LSSGen framework with Diffusers pipeline style, you can follow the example below.
dtype = torch.bfloat16
device = "cuda"
from network.models.upsampler import LatentUpSampler
upsampler = LatentUpSampler.from_pretrained(UPSAMPLER_PATH, torch_dtype=dtype).to(device)Latent Upsampler model is under weights folder. Example:
UPSAMPLER_PATH = "weights/FLUX-VAE-scaler"The FLUX pipeline supports all FLUX.1 text to image models, including FLUX.1-dev, FLUX.1-schnell, and FLUX.1-Krea-dev.
from network.pipelines import LSSFluxPipeline
pipeline = LSSFluxPipeline.from_pretrained(MODEL_PATH, latent_upsampler=upsampler, torch_dtype=dtype).to(device)
image = pipeline(
prompt, height=1024, width=1024, num_inference_steps=50, guidance_scale=3.5,
start_sigma=0.75, shorten_intermediate_steps=True
).images[0]The Stable Diffusion 3 pipeline supports all Stable Diffusion 3 models, including SD3, SD3.5.
from network.pipelines import LSSStableDiffusion3Pipeline
pipeline = LSSStableDiffusion3Pipeline.from_pretrained(MODEL_PATH, latent_upsampler=upsampler, torch_dtype=dtype).to(device)
image = pipeline(
prompt, height=1024, width=1024, num_inference_steps=40, guidance_scale=4.5,
start_sigma=0.75, shorten_intermediate_steps=True
).images[0]The Stable Diffusion XL pipeline supports all Stable Diffusion XL model variants, including SDXL, and Playground v2.5.
from network.pipelines import LSSStableDiffusionXLPipeline
pipeline = LSSStableDiffusionXLPipeline.from_pretrained(MODEL_PATH, latent_upsampler=upsampler, torch_dtype=dtype).to(device)
image = pipeline(
prompt, height=1024, width=1024, num_inference_steps=50, guidance_scale=5,
start_sigma=0.75, shorten_intermediate_steps=False
).images[0]from network.pipelines import LSSStableDiffusionPipeline
pipeline = LSSStableDiffusionPipeline.from_pretrained(MODEL_PATH, latent_upsampler=upsampler, torch_dtype=dtype).to(device)
image = pipeline(
prompt, height=1024, width=1024, num_inference_steps=50, guidance_scale=7.5,
start_sigma=0.75, shorten_intermediate_steps=False
).images[0]from network.pipelines import LSSCogView4Pipeline
pipeline = LSSCogView4Pipeline.from_pretrained(MODEL_PATH, latent_upsampler=upsampler, torch_dtype=dtype).to(device)
image = pipeline(
prompt, height=1024, width=1024, num_inference_steps=50, guidance_scale=3.5,
start_sigma=0.75, shorten_intermediate_steps=True
).images[0]python inference_dyn_scaling_t2i.py <GENEVAL_PROMPTS_JSONL> \
--model <MODEL_PATH> \
--upscale_model <UPSAMPLER_PATH> \
--pipeline_type <PIPELINE_TYPE> \ # e.g., flux, lssflux, sd, lsssd etc.
--output_dir <OUTPUT_DIR> \
--num_inference_steps <NUM_STEPS> \
--guidance_scale <GUIDANCE_SCALE> \
--start_sigma 0.75 \
--shorten_intermediate_steps True \
--min_size 512 # minimal image size in initial stage of LSSGenShorten Intermediate Steps: If set toTrue, the LSSGen will shorten the intermediate steps when resolution is lower than the standard resolution. This is useful for speeding up the generation process while maintaining quality in flow based models.Start Sigma: The initial sigma value for the denoising process. A value for controling the trade-off between speed and quality. 0.75 is optimal for best quality, but you can try other values like 0.667 for faster generation.
Please cite our work if you use this repository in your research:
@InProceedings{Tang_2025_ICCV,
author = {Tang, Jyun-Ze and Hsu, Chih-Fan and Li, Jeng-Lin and Chang, Ming-Ching and Chen, Wei-Chao},
title = {LSSGen: Leveraging Latent Space Scaling in Flow and Diffusion for Efficient Text to Image Generation},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
month = {October},
year = {2025},
pages = {5048-5057}
}
This project is only for research or education purposes, and not freely available for commercial use or redistribution. This intellectual property belongs to the Inventec Corporation. Licensing is possible if you want to use the code for commercial use. For scientific use, please reference this repository together with the relevant publication.