[docs] Quantization + torch.compile + offloading #11703

stevhliu · 2025-06-12T22:55:00Z

Follows up on #11670 and #11672 to document combinations of quantization, torch.compile, and offloading.

HuggingFaceDocBuilderDev · 2025-06-12T23:01:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

docs/source/en/optimization/speed-memory-optims.md

sayakpaul

Thanks for starting this. Will get you the numbers.

docs/source/en/optimization/speed-memory-optims.md

sayakpaul · 2025-06-14T02:01:12Z

@stevhliu

combination	latency	memory usage
quantization	32.602	14.9453
quantization, torch.compile	25.847	14.9448
quantization, torch.compile, model CPU offloading	32.312	12.2369
quantization, torch.compile, group offloading	60.235	12.2369

Code: https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d

Worth mentioning:

We are applying quantization to transformer and text_encoder_2
GPU used: RTX 4090
Using PyTorch nightlies is better
https://huggingface.co/docs/diffusers/main/en/optimization/memory#group-offloading mentions why the speed-memory trade-off with group offloading in Flux isn't as expected.

docs/source/en/optimization/speed-memory-optims.md

sayakpaul · 2025-06-17T03:36:08Z

docs/source/en/optimization/speed-memory-optims.md

+```
+
+</hfoption>
+<hfoption id="group offloading">


Do you think it might be better demonstrated with a more compute heavy model like Wan? This way, we can show the actual benefits of group offloading.

Sounds good, could you get me the updated numbers for Wan with quantization/group offloading/torch.compile please?

I think it's okay to have the Flux numbers but for the sake of code and discussions, we could have Wan.

Ah ok, don't worry about getting the Wan numbers then!

stevhliu · 2025-06-17T20:41:06Z

docs/source/en/optimization/memory.md

+
+Offloading strategies move not currently active layers or models to the CPU to avoid increasing GPU memory. These strategies can be combined with quantization and torch.compile to balance inference speed and memory usage.
+
+Refer to the [Compile and offloading quantized models](./speed-memory-optims) guide for more details.


I think #11731 can be resolved in this PR where I make a note that offloading can be combined with quantization and torch.compile

Added your layerwise casting note in here as well :)

Yeah feel free to close those :)

docs/source/en/optimization/speed-memory-optims.md

docs/source/en/optimization/memory.md

docs/source/en/optimization/speed-memory-optims.md

sayakpaul · 2025-06-18T01:15:16Z

docs/source/en/optimization/speed-memory-optims.md

+pipeline.transformer.enable_group_offload(
+    onload_device=onload_device,
+    offload_device=offload_device,
+    offload_type="block_level",
+    num_blocks_per_group=4
+)


We should use these args when a component is quantized with bitsandbytes to mitigate device mismatch issues:

https://github.com/huggingface/diffusers/blob/1bc6f3dc0f21779480db70a4928d14282c0198ed/tests/quantization/test_torch_compile_utils.py#L71C9-L77C10

But I am curious. Were you able to run the code?

It did not haha 🙃

sayakpaul

Left some more comments. LMK if they make sense.

sayakpaul

Let' go!

sayakpaul · 2025-06-19T02:24:20Z

docs/source/en/optimization/speed-memory-optims.md

+from diffusers import DiffusionPipeline
+from diffusers.quantizers import PipelineQuantizationConfig
+
+torch._dynamo.config.cache_size_limit = 1000


Suggested change

torch._dynamo.config.cache_size_limit = 1000

torch._dynamo.config.cache_size_limit = 1000

torch._dynamo.config.capture_dynamic_output_shape_ops = True

sayakpaul · 2025-06-19T02:24:53Z

docs/source/en/optimization/speed-memory-optims.md

+from diffusers.quantizers import PipelineQuantizationConfig
+from transformers import UMT5EncoderModel
+
+torch._dynamo.config.cache_size_limit = 1000


Same suggestion as above.

stevhliu commented Jun 12, 2025

View reviewed changes

docs/source/en/optimization/speed-memory-optims.md Outdated Show resolved Hide resolved

stevhliu requested a review from sayakpaul June 12, 2025 23:12

sayakpaul reviewed Jun 13, 2025

View reviewed changes

docs/source/en/optimization/speed-memory-optims.md Outdated Show resolved Hide resolved

docs/source/en/optimization/speed-memory-optims.md Outdated Show resolved Hide resolved

docs/source/en/optimization/speed-memory-optims.md Outdated Show resolved Hide resolved

sayakpaul reviewed Jun 13, 2025

View reviewed changes

docs/source/en/optimization/speed-memory-optims.md Outdated Show resolved Hide resolved

stevhliu force-pushed the combine-optims branch from cdfd845 to 7d7f274 Compare June 13, 2025 22:07

stevhliu marked this pull request as ready for review June 16, 2025 19:23

sayakpaul reviewed Jun 17, 2025

View reviewed changes

docs/source/en/optimization/speed-memory-optims.md Outdated Show resolved Hide resolved

sayakpaul reviewed Jun 17, 2025

View reviewed changes

stevhliu added 5 commits June 17, 2025 13:38

draft

1519ece

feedback

4a35656

update

d834db7

feedback

971411d

fix

b483f24

stevhliu force-pushed the combine-optims branch from 4feb4d6 to b483f24 Compare June 17, 2025 20:39

stevhliu commented Jun 17, 2025

View reviewed changes

docs/source/en/optimization/speed-memory-optims.md Outdated Show resolved Hide resolved

sayakpaul reviewed Jun 18, 2025

View reviewed changes

docs/source/en/optimization/memory.md Show resolved Hide resolved

sayakpaul reviewed Jun 18, 2025

View reviewed changes

docs/source/en/optimization/memory.md Show resolved Hide resolved

sayakpaul reviewed Jun 18, 2025

View reviewed changes

docs/source/en/optimization/speed-memory-optims.md Outdated Show resolved Hide resolved

sayakpaul reviewed Jun 18, 2025

View reviewed changes

docs/source/en/optimization/speed-memory-optims.md Show resolved Hide resolved

sayakpaul reviewed Jun 18, 2025

View reviewed changes

docs/source/en/optimization/speed-memory-optims.md Outdated Show resolved Hide resolved

sayakpaul reviewed Jun 18, 2025

View reviewed changes

stevhliu added 2 commits June 18, 2025 10:09

feedback

b5d5e99

feedback

dc32d45

sayakpaul approved these changes Jun 19, 2025

View reviewed changes

Merge branch 'main' into combine-optims

f78f0f5


		Offloading strategies move not currently active layers or models to the CPU to avoid increasing GPU memory. These strategies can be combined with quantization and torch.compile to balance inference speed and memory usage.

		Refer to the [Compile and offloading quantized models](./speed-memory-optims) guide for more details.

[docs] Quantization + torch.compile + offloading #11703

Are you sure you want to change the base?

[docs] Quantization + torch.compile + offloading #11703

Uh oh!

Conversation

stevhliu commented Jun 12, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jun 12, 2025

Uh oh!

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sayakpaul commented Jun 14, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!