Skip to content

v0.8.0

Choose a tag to compare

@dhuangnm dhuangnm released this 03 Oct 12:24
· 29 commits to main since this release
33ef5f4
0 8 0

LLM Compressor v0.8.0 release notes

This LLM Compressor v0.8.0 release introduces the following new features and enhancements:

  • Support for multiple modifiers in oneshot compression runs
  • Quantization and calibration support for Qwen3 models including FP8 quantization support for Qwen3 VL MoE models
  • Transforms support for non-full-size rotation sizes
  • Improved accuracy recovery by updating W4A16 schemes to use actorder "weight" by default

Support for multiple modifiers in oneshot compression runs ✨

LLM Compressor now supports using multiple modifiers in oneshot compression runs.

You can apply multiple modifiers across model layers. This includes applying different modifiers, such as AWQ and GPTQ, to specific submodules for W4A16 quantization all within a single oneshot call and with only pass-through calibration data.

Using multiple modifiers improves non-uniform model quantization, addressing issues such as varying layer sensitivity.

For more information, see Non-uniform quantization.

Quantization and calibration support for Qwen3 models

Quantization and calibration support for Qwen3 models has been added to LLM Compressor.

An updated Qwen3NextSparseMoeBlock modeling definition has been added to temporarily update the MoE block during calibration in order to ensure that all the experts see data and are calibrated appropriately. This allows all experts to have calibrated scales while ensuring only the gated activation values are used.

FP8 and NVFP4 quantization examples have been added for the Qwen3-Next-80B-A3B-Instruct model. For more information see:

FP8 quantization support for Qwen3 VL MoE models

LLM Compressor now supports quantization for Qwen3 VL MoE models. You can now use data-free pathways such as FP8 channel-wise and block-wise quantization. Pathways requiring data such W4A16 and NVFP4 are planned for a future release.

Examples have been added for FP8 quantization for the Qwen/Qwen3-VL-235B-A22B-Instruct model.
For more information see:

An updated definition has been added for Qwen3VLMoeTextSparseMoeBlock which replaces all the MoE blocks with a linearized model definition such that a list of layers is used as opposed to a 3D parameter. This model definition enables quantization and is runnable in vLLM.

Transforms support for non-full-size rotation sizes

You can now set a transform_block_size field in the Transform-based modifier classes SpinQuantModifier and QuIPModifier. You can configure transforms of variable size with this field, and you no longer need to restrict hadamards to match the size of the weight.

It is typically beneficial to set the hadamard block size to match the quantization group size. Examples have been updated to show how to use this field when applying the QuIP Modifier.

For more information, see:

To efficiently run QuIP-style rotations using the hadacore kernels in vLLM, see examples/transform/README.md.

Improved accuracy recovery by updating W4A16 schemes to use actorder "weight" by default

The GPTQModifier class now uses "weight" activation ordering by default. Weight or "static" activation ordering has been shown to significantly improve accuracy recovery with no additional cost at runtime.

For more information and benchmarks, see vllm/pull/8135

Updates and deprecations

Support for R4 spinquant-style transforms

Support for R4 spinquant-style transforms has been added, which allows quantization of the down_proj layer with increased accuracy recovery. You can use this transform by specifying SpinQuantModifier(rotations=["R4"]) in the oneshot recipe.

Re-enabled support for W8A8 INT8 decompression

W8A8 INT8 decompression and model generation has been re-enabled in LLM Compressor.

The following changes have been made:

  • The ModelCompressor class has been updated to support compressing models initialized on the meta device.
  • The SparseCompressor and QuantizationCompressor classes have been modified to be compatible with meta devices.
  • The compress_weight() function has been modified across sparse compressors to accept module input, enabling correct behavior for meta-initialized shells.
  • Decompression and offload device detection has been updated to handle meta modules and empty modules gracefully.

Updated ignore lists in example recipes to capture all vision components

Ignore lists in example recipes were updated to correctly capture all vision components. Previously, some vision components like model.vision_tower were not being caught, causing downstream issues when serving models with vLLM.

Deprecated and removed unittest.TestCase

The unittest.TestCase test case has been deprecated and removed and has been replaced with standardized pytest test definitions.