Skip to content

Releases: vllm-project/llm-compressor

v0.8.1

08 Oct 02:13
bcad892

Choose a tag to compare

What's Changed

  • Pick up compressed-tensors 0.12.2 for patch release by @dhuangnm in #1904

Full Changelog: 0.8.0...0.8.1

v0.8.0

03 Oct 12:24
33ef5f4

Choose a tag to compare

0 8 0

LLM Compressor v0.8.0 release notes

This LLM Compressor v0.8.0 release introduces the following new features and enhancements:

  • Support for multiple modifiers in oneshot compression runs
  • Quantization and calibration support for Qwen3 models including FP8 quantization support for Qwen3 VL MoE models
  • Transforms support for non-full-size rotation sizes
  • Improved accuracy recovery by updating W4A16 schemes to use actorder "weight" by default

Support for multiple modifiers in oneshot compression runs ✨

LLM Compressor now supports using multiple modifiers in oneshot compression runs.

You can apply multiple modifiers across model layers. This includes applying different modifiers, such as AWQ and GPTQ, to specific submodules for W4A16 quantization all within a single oneshot call and with only pass-through calibration data.

Using multiple modifiers improves non-uniform model quantization, addressing issues such as varying layer sensitivity.

For more information, see Non-uniform quantization.

Quantization and calibration support for Qwen3 models

Quantization and calibration support for Qwen3 models has been added to LLM Compressor.

An updated Qwen3NextSparseMoeBlock modeling definition has been added to temporarily update the MoE block during calibration in order to ensure that all the experts see data and are calibrated appropriately. This allows all experts to have calibrated scales while ensuring only the gated activation values are used.

FP8 and NVFP4 quantization examples have been added for the Qwen3-Next-80B-A3B-Instruct model. For more information see:

FP8 quantization support for Qwen3 VL MoE models

LLM Compressor now supports quantization for Qwen3 VL MoE models. You can now use data-free pathways such as FP8 channel-wise and block-wise quantization. Pathways requiring data such W4A16 and NVFP4 are planned for a future release.

Examples have been added for FP8 quantization for the Qwen/Qwen3-VL-235B-A22B-Instruct model.
For more information see:

An updated definition has been added for Qwen3VLMoeTextSparseMoeBlock which replaces all the MoE blocks with a linearized model definition such that a list of layers is used as opposed to a 3D parameter. This model definition enables quantization and is runnable in vLLM.

Transforms support for non-full-size rotation sizes

You can now set a transform_block_size field in the Transform-based modifier classes SpinQuantModifier and QuIPModifier. You can configure transforms of variable size with this field, and you no longer need to restrict hadamards to match the size of the weight.

It is typically beneficial to set the hadamard block size to match the quantization group size. Examples have been updated to show how to use this field when applying the QuIP Modifier.

For more information, see:

To efficiently run QuIP-style rotations using the hadacore kernels in vLLM, see examples/transform/README.md.

Improved accuracy recovery by updating W4A16 schemes to use actorder "weight" by default

The GPTQModifier class now uses "weight" activation ordering by default. Weight or "static" activation ordering has been shown to significantly improve accuracy recovery with no additional cost at runtime.

For more information and benchmarks, see vllm/pull/8135

Updates and deprecations

Support for R4 spinquant-style transforms

Support for R4 spinquant-style transforms has been added, which allows quantization of the down_proj layer with increased accuracy recovery. You can use this transform by specifying SpinQuantModifier(rotations=["R4"]) in the oneshot recipe.

Re-enabled support for W8A8 INT8 decompression

W8A8 INT8 decompression and model generation has been re-enabled in LLM Compressor.

The following changes have been made:

  • The ModelCompressor class has been updated to support compressing models initialized on the meta device.
  • The SparseCompressor and QuantizationCompressor classes have been modified to be compatible with meta devices.
  • The compress_weight() function has been modified across sparse compressors to accept module input, enabling correct behavior for meta-initialized shells.
  • Decompression and offload device detection has been updated to handle meta modules and empty modules gracefully.

Updated ignore lists in example recipes to capture all vision components

Ignore lists in example recipes were updated to correctly capture all vision components. Previously, some vision components like model.vision_tower were not being caught, causing downstream issues when serving models with vLLM.

Deprecated and removed unittest.TestCase

The unittest.TestCase test case has been deprecated and removed and has been replaced with standardized pytest test definitions.

v0.7.1

21 Aug 21:37
6304ecf

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 0.7.0...0.7.1

v0.7.0

20 Aug 21:40
679a704

Choose a tag to compare

lc

LLM Compressor v0.7.0 release notes

This LLM Compressor v0.7.0 release introduces the following new features and enhancements:

  • Transforms support, including QuIP and SpinQuant algorithms
  • Apply multiple compressors to a single model for mixed-precision quantization
  • Support for DeepSeekV3-style block FP8 quantization
  • Expanded Mixture of Experts (MoE) calibration support, including support with NVFP4 quantization
  • Llama4 quantization support with vLLM compatibility
  • Configurable observer arguments
  • Simplified and unified Recipe classes for easier usage and debugging

Introducing Transforms ✨

LLM Compressor now supports transforms. With transforms, you can inject additional matrix operations within a model for the purposes of increasing the accuracy recovery as a result of quantization. Transforms allow rotating weights or activations into spaces with smaller dynamic ranges, reducing quantization error.

Two algorithms are supported in this release:

  • QuIP transforms inject transforms before and after weights to assist with weight-only quantization
  • SpinQuant transforms inject transforms whose inverses span across multiple weights, assisting in both weight and activation quantization. In this release, fused R1 and R2 (i.e. offline) transforms are available. The full lifecycle has been validated to confirm that the models produced by LLM Compressor match the performance outlined in the original SpinQuant paper. Learned rotations and online R3 and R4 rotations will be added in a future release.

The functionality for both algorithms available through the new QuIPModifier and SpinQuantModifier classes.

Applying multiple compressors to a single model

LLM Compressor now supports applying multiple compressors to a single model. This extends support for non-uniform quantization recipes, such as combining NVFP4 and FP8 quantization. This provides finer control over per-layer quantization, allowing more precise handling of layers that are especially sensitive to certain quantization types.

Models with more than one compressor applied have their format set to mixed-precision in the config.json file. Additionally, each config_group now includes a format key that specifies the format used for the layers targeted by that group.

Support for DeepSeekV3-style block FP8 quantization

You can now apply DeepSeekV3-style block FP8 quantization during model compression, a technique designed to further compress large language models for more efficient inference. The changes encompass the fundamental implementation of block-wise quantization, robust handling of quantization parameters, updated documentation, and a practical example to guide users in applying this new compression scheme.

Mixture of Experts support

LLM Compressor now includes enhanced general Mixture of Experts (MoE) calibration support, including support for MoEs with NVFP4 quantization. Forward passes of the MoE models can be controlled during calibration by adding custom modules to the replace_modules_for_calibration function which permanently changes the MoE module or moe_calibration_context function to temporarily update modules during calibration.

Llama4 quantization

LLama4 quantization is now supported in LLM Compressor. To be quantized and runnable in vLLM, Llama4TextMoe modules are permanently replaced using the replace_modules_for_calibration method which linearizes the modules. This allows the model to be quantized to schemes including WN16 with GPTQ and NVFP4.

Simplified and updated Recipe classes

Recipe classes have been updated with the following features:

  • Merged multiple recipe-related classes into a single, unified Recipe class
  • Simplified modifier creation, lifecycle management, and parsing logic
  • Improved serialization and deserialization for clarity and maintainability
  • Reduced redundant stages and arguments handling for easier debugging and usage

Configurable Observer arguments

Observer arguments can now be configured as a dict through the observer_kwargs quantization argument, which can be set through oneshot recipes.

v0.6.0.1

28 Jul 19:05
0461bf9

Choose a tag to compare

What's Changed

Full Changelog: 0.6.0...0.6.0.1

v0.6.0

24 Jun 15:22
c052d2c

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 0.5.2...0.6.0

v0.5.2

24 Jun 01:47
c1c8541

Choose a tag to compare

What's Changed

Read more

v0.5.1

29 Apr 01:34
ef175d7

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 0.5.0...0.5.1

v0.5.0

03 Apr 13:23
25b1138

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 0.4.1...0.5.0

v0.4.1

20 Feb 13:21
6a1ba3c

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 0.4.0...0.4.1