08 Oct 02:13

dhuangnm

bcad892

v0.8.1 Latest

Latest

What's Changed

Pick up compressed-tensors 0.12.2 for patch release by @dhuangnm in #1904

Full Changelog: 0.8.0...0.8.1

Contributors

dhuangnm

Assets 4

03 Oct 12:24

dhuangnm

0.8.0

33ef5f4

v0.8.0

LLM Compressor v0.8.0 release notes

This LLM Compressor v0.8.0 release introduces the following new features and enhancements:

Support for multiple modifiers in oneshot compression runs
Quantization and calibration support for Qwen3 models including FP8 quantization support for Qwen3 VL MoE models
Transforms support for non-full-size rotation sizes
Improved accuracy recovery by updating W4A16 schemes to use actorder "weight" by default

Support for multiple modifiers in oneshot compression runs ✨

LLM Compressor now supports using multiple modifiers in oneshot compression runs.

You can apply multiple modifiers across model layers. This includes applying different modifiers, such as AWQ and GPTQ, to specific submodules for W4A16 quantization all within a single oneshot call and with only pass-through calibration data.

Using multiple modifiers improves non-uniform model quantization, addressing issues such as varying layer sensitivity.

For more information, see Non-uniform quantization.

Quantization and calibration support for Qwen3 models

Quantization and calibration support for Qwen3 models has been added to LLM Compressor.

An updated Qwen3NextSparseMoeBlock modeling definition has been added to temporarily update the MoE block during calibration in order to ensure that all the experts see data and are calibrated appropriately. This allows all experts to have calibrated scales while ensuring only the gated activation values are used.

FP8 and NVFP4 quantization examples have been added for the Qwen3-Next-80B-A3B-Instruct model. For more information see:

FP8 quantization support for Qwen3 VL MoE models

LLM Compressor now supports quantization for Qwen3 VL MoE models. You can now use data-free pathways such as FP8 channel-wise and block-wise quantization. Pathways requiring data such W4A16 and NVFP4 are planned for a future release.

Examples have been added for FP8 quantization for the Qwen/Qwen3-VL-235B-A22B-Instruct model.
For more information see:

examples/quantization_w8a8_fp8/qwen3_vl_moe_fp8_example.py

An updated definition has been added for Qwen3VLMoeTextSparseMoeBlock which replaces all the MoE blocks with a linearized model definition such that a list of layers is used as opposed to a 3D parameter. This model definition enables quantization and is runnable in vLLM.

Transforms support for non-full-size rotation sizes

You can now set a transform_block_size field in the Transform-based modifier classes SpinQuantModifier and QuIPModifier. You can configure transforms of variable size with this field, and you no longer need to restrict hadamards to match the size of the weight.

It is typically beneficial to set the hadamard block size to match the quantization group size. Examples have been updated to show how to use this field when applying the QuIP Modifier.

For more information, see:

To efficiently run QuIP-style rotations using the hadacore kernels in vLLM, see examples/transform/README.md.

Improved accuracy recovery by updating W4A16 schemes to use actorder "weight" by default

The GPTQModifier class now uses "weight" activation ordering by default. Weight or "static" activation ordering has been shown to significantly improve accuracy recovery with no additional cost at runtime.

For more information and benchmarks, see vllm/pull/8135

Updates and deprecations

Support for R4 spinquant-style transforms

Support for R4 spinquant-style transforms has been added, which allows quantization of the down_proj layer with increased accuracy recovery. You can use this transform by specifying SpinQuantModifier(rotations=["R4"]) in the oneshot recipe.

Re-enabled support for W8A8 INT8 decompression

W8A8 INT8 decompression and model generation has been re-enabled in LLM Compressor.

The following changes have been made:

The ModelCompressor class has been updated to support compressing models initialized on the meta device.
The SparseCompressor and QuantizationCompressor classes have been modified to be compatible with meta devices.
The compress_weight() function has been modified across sparse compressors to accept module input, enabling correct behavior for meta-initialized shells.
Decompression and offload device detection has been updated to handle meta modules and empty modules gracefully.

Updated ignore lists in example recipes to capture all vision components

Ignore lists in example recipes were updated to correctly capture all vision components. Previously, some vision components like model.vision_tower were not being caught, causing downstream issues when serving models with vLLM.

Deprecated and removed unittest.TestCase

The unittest.TestCase test case has been deprecated and removed and has been replaced with standardized pytest test definitions.

Assets 4

21 Aug 21:37

dbarbuzzi

0.7.1

6304ecf

v0.7.1

What's Changed

[Examples] Create qwen_2_5_vl_example.py by @Zhao-Dongyu in #1752
[fix] Fix visual layer ignore pattern for Qwen2.5-VL models by @Zhao-Dongyu in #1766
[Transform] Fix QuIP targets by @kylesayrs in #1770

New Contributors

@Zhao-Dongyu made their first contribution in #1752

Full Changelog: 0.7.0...0.7.1

Contributors

kylesayrs and Zhao-Dongyu

Assets 4

20 Aug 21:40

dhuangnm

0.7.0

679a704

v0.7.0

LLM Compressor v0.7.0 release notes

This LLM Compressor v0.7.0 release introduces the following new features and enhancements:

Transforms support, including QuIP and SpinQuant algorithms
Apply multiple compressors to a single model for mixed-precision quantization
Support for DeepSeekV3-style block FP8 quantization
Expanded Mixture of Experts (MoE) calibration support, including support with NVFP4 quantization
Llama4 quantization support with vLLM compatibility
Configurable observer arguments
Simplified and unified Recipe classes for easier usage and debugging

Introducing Transforms ✨

LLM Compressor now supports transforms. With transforms, you can inject additional matrix operations within a model for the purposes of increasing the accuracy recovery as a result of quantization. Transforms allow rotating weights or activations into spaces with smaller dynamic ranges, reducing quantization error.

Two algorithms are supported in this release:

QuIP transforms inject transforms before and after weights to assist with weight-only quantization
SpinQuant transforms inject transforms whose inverses span across multiple weights, assisting in both weight and activation quantization. In this release, fused R1 and R2 (i.e. offline) transforms are available. The full lifecycle has been validated to confirm that the models produced by LLM Compressor match the performance outlined in the original SpinQuant paper. Learned rotations and online R3 and R4 rotations will be added in a future release.

The functionality for both algorithms available through the new QuIPModifier and SpinQuantModifier classes.

Applying multiple compressors to a single model

LLM Compressor now supports applying multiple compressors to a single model. This extends support for non-uniform quantization recipes, such as combining NVFP4 and FP8 quantization. This provides finer control over per-layer quantization, allowing more precise handling of layers that are especially sensitive to certain quantization types.

Models with more than one compressor applied have their format set to mixed-precision in the config.json file. Additionally, each config_group now includes a format key that specifies the format used for the layers targeted by that group.

Support for DeepSeekV3-style block FP8 quantization

You can now apply DeepSeekV3-style block FP8 quantization during model compression, a technique designed to further compress large language models for more efficient inference. The changes encompass the fundamental implementation of block-wise quantization, robust handling of quantization parameters, updated documentation, and a practical example to guide users in applying this new compression scheme.

Mixture of Experts support

LLM Compressor now includes enhanced general Mixture of Experts (MoE) calibration support, including support for MoEs with NVFP4 quantization. Forward passes of the MoE models can be controlled during calibration by adding custom modules to the replace_modules_for_calibration function which permanently changes the MoE module or moe_calibration_context function to temporarily update modules during calibration.

Llama4 quantization

LLama4 quantization is now supported in LLM Compressor. To be quantized and runnable in vLLM, Llama4TextMoe modules are permanently replaced using the replace_modules_for_calibration method which linearizes the modules. This allows the model to be quantized to schemes including WN16 with GPTQ and NVFP4.

Simplified and updated Recipe classes

Recipe classes have been updated with the following features:

Merged multiple recipe-related classes into a single, unified Recipe class
Simplified modifier creation, lifecycle management, and parsing logic
Improved serialization and deserialization for clarity and maintainability
Reduced redundant stages and arguments handling for easier debugging and usage

Configurable Observer arguments

Observer arguments can now be configured as a dict through the observer_kwargs quantization argument, which can be set through oneshot recipes.

Assets 4

28 Jul 19:05

dhuangnm

0.6.0.1

0461bf9

v0.6.0.1

What's Changed

Cap transformers version for hotfix 0.6.0.1 by @dhuangnm in #1671

Full Changelog: 0.6.0...0.6.0.1

Contributors

dhuangnm

Assets 4

24 Jun 15:22

dhuangnm

0.6.0

c052d2c

v0.6.0

What's Changed

[Experimental] Mistral-format FP8 quantization by @mgoin in #1359
[Examples] [Bugfix] skip sparsity stats when saving checkpoints by @kylesayrs in #1528
[Examples] [Bugfix] Fix debug message by @kylesayrs in #1529
[Tests][NVFP4] No longer skip NVFP4A16 e2e test by @dsikka in #1538
[AWQ] Support for Calibration Datasets of varying feature dimension by @brian-dellabetta in #1536
fix qwen 2.5 VL multimodal example by @brian-dellabetta in #1541
[Example] [Bugfix] Fix Gemma ignore list by @kylesayrs in #1531
[Tests][NVFP4] Add e2e nvfp4 test by @dsikka in #1543
[Examples] Use more robust splits by @kylesayrs in #1544
[Bugfix] [Autowrapper] Fix visit_Delete by @kylesayrs in #1532
[Example] Fix Qwen VL ignore list by @arunmadhusud in #1545
[Tests] Fix Qwen2.5-VL-7B-Instruct Recipe by @dsikka in #1548
[Bugfix] Fix gemma2 generation by @kylesayrs in #1552
fix skipif check on tests involving gated HF models by @brian-dellabetta in #1553
[NVFP4] Fix global scale update when dealing with offloaded layers by @dsikka in #1554
oneshot entrypoint update by @ved1beta in #1445
LM Eval tests -- ignore vision tower for VL fp8 test by @brian-dellabetta in #1562
[Performance] Sequential onloading by @kylesayrs in #1263
[BugFix] Explicitly set gpu_memory_utilization by @rahul-tuli in #1560
Add Axolotl blog link by @rahul-tuli in #1563
[Bugfix] Fix multigpu dispatch_for_generation by @kylesayrs in #1567
[Testing] Set VLLM_WORKER_MULTIPROC_METHOD for e2e testing by @dsikka in #1569
[BugFix] Fix quantizaiton_2of4_sparse_w4a16 example by @shanjiaz in #1565
[Pipelines] infer model device with optional override by @kylesayrs in #1572
bump up requirement for compressed-tensors to 0.10.2 by @dhuangnm in #1581

New Contributors

@arunmadhusud made their first contribution in #1545

Full Changelog: 0.5.2...0.6.0

Contributors

mgoin, brian-dellabetta, and 7 other contributors

Assets 4

24 Jun 01:47

dhuangnm

0.5.2

c1c8541

v0.5.2

What's Changed

Exclude images from package by @kylesayrs in #1397
[Tracing] Skip non-ancestors of sequential targets by @kylesayrs in #1389
Consolidate build config by @dbarbuzzi in #1398
[Tests] Disable silently failing kv cache test by @kylesayrs in #1371
Drop flash_attn skip for quantizing_moe example tests by @dbarbuzzi in #1396
[VLM] Fix mllama targets by @kylesayrs in #1402
[Tests] Use requires_gpu, fix missing gpu test skip, add explicit test for gpu from gha by @kylesayrs in #1264
Implement QuantizationMixin by @kylesayrs in #1351
Add new-features section by @rahul-tuli in #1408
[Tracing] Support tracing of Gemma3 [#1248] by @kelkelcheng in #1373
bugfix kv cache quantization with ignored layers by @brian-dellabetta in #1312
AWQ sanitize_kwargs minor cleanup by @brian-dellabetta in #1405
[Tracing][Testing] Add tracing tests by @kylesayrs in #1335
fix lm eval test reproducibility issues by @brian-dellabetta in #1260
Pipeline Extraction by @kylesayrs in #1279
Add pull_request trigger to base tests workflow by @dbarbuzzi in #1417
removing RecipeMetadata and references by @shanjiaz in #1414
Update examples to only load required number of samples from dataset by @kylesayrs in #1118
[Tracing] Reinstate ignore functionality by @kylesayrs in #1423
[Typo] overriden by @kylesayrs in #1420
Rename SparsityModifierMixin to SparsityModifierBase by @kylesayrs in #1416
Remove RecipeArgs class & its references by @shanjiaz in #1429
[Examples] Standardize AWQ example by @kylesayrs in #1412
[Logging] Support logging once by @kylesayrs in #1431
Add: deepseekv2 smoothquant mappings by @rahul-tuli in #1433
AWQ QuantizationMixin + SequentialPipeline by @brian-dellabetta in #1426
patch awq tests/readme after QuantizationMixin refactor by @brian-dellabetta in #1439
Added more tests for Quantization24SparseW4A16 by @shanjiaz in #1434
[GPTQ] Add actorder option to modifier by @kylesayrs in #1424
[Bugfix][Tracing] Fix qwen2_5_vl by @kylesayrs in #1448
[Tests] Use proper offloading utils in test_compress_tensor_utils by @kylesayrs in #1449
[Tracing] Fix Traceable Imports by @kylesayrs in #1452
[NVFP4] Enable FP4 Weight-Only Quantization by @dsikka in #1309
Pin transformers to <4.52.0 by @brian-dellabetta in #1459
AWQ Apply Scales Bugfix when smooth layer output length doesn't match balance layer input length by @brian-dellabetta in #1451
Fix #1344 Extend e2e tests to add asym support for W8A8-Int8 by @ved1beta in #1345
[Tests] Fix activation recipe for w8a8 asym by @dsikka in #1461
AWQ Qwen and Phi mappings by @brian-dellabetta in #1440
[Observer] Optimize mse observer by @shanjiaz in #1450
Fix: Improve SmoothQuant Support for Mixture of Experts (MoE) Models by @rahul-tuli in #1455
[Tests] Add nvfp4a16 e2e test case by @dsikka in #1463
[Docs] Update README to list fp4 by @dsikka in #1462
Remove duplicate model id var from awq example recipe by @AndrewMead10 in #1467
Added observer type for test_min_max by @shanjiaz in #1466
Disable kernels during calibration (and tracing) by @kylesayrs in #1454
[GPTQ] Fix actorder resolution, add sentinel by @kylesayrs in #1453
Set show_progress to True by @dsikka in #1471
Remove compress by @dsikka in #1470
raise error if block quantization is used, as it is not yet supported by @brian-dellabetta in #1476
[Tests] Increase max seq length for tracing tests by @kylesayrs in #1478
[Tests] Fix dynamic field to be a bool, not string by @dsikka in #1480
[Examples] Fix qwen vision examples by @kylesayrs in #1481
[NVFP4] Update to use tensor_group strategy; update observers by @dsikka in #1484
loosen lmeval assertions to upper or lower bound by @brian-dellabetta in #1477
Revert "expand observers to calculate gparams, add example for activa… by @dsikka in #1486
fix rest of the minmax tests by @shanjiaz in #1469
Add warning for non-divisible group quantization by @kylesayrs in #1401
[AWQ] Support accumulation for reduced memory usage by @kylesayrs in #1435
[Tracing] Code AutoWrapper by @kylesayrs in #1411
Removed RecipeTuple & RecipeContainer class by @shanjiaz in #1460
Unpin to support transformers==4.52.3 by @kylesayrs in #1479
[Tests] GPTQ Actorder Resolution Tests by @kylesayrs in #1468
[Testing] Skip FP4 Test by @dsikka in #1499
[Bugfix] Remove tracing imports from tests by @kylesayrs in #1498
[Testing] Use a slightly larger model that works with group_size 128 by @dsikka in #1502
skip tracing tests if token unavailable by @brian-dellabetta in #1493
Fix missing logs when calling oneshot by @kelkelcheng in #1446
[NVFP4] Expand observers to calculate gparam, support NVFP4 Activations by @dsikka in #1487
[Tests] Remove duplicate test by @kylesayrs in #1500
[Model] Mistral3 example and test by @kylesayrs in #1490
[NVFP4] Use observers to generate global weight scales by @dsikka in #1504
Revert "[NVFP4] Use observers to generate global weight scales " by @dsikka in #1507
[NVFP4] Update global scale generation by @dsikka in #1508
[NVFP4] Fix onloading of fused layers by @dsikka in #1512
Pin pandas to <2.3 by @dbarbuzzi in #1515
AWQModifier fast resolve mappings, better logging, MoE support by @brian-dellabetta in #1444
Update setup.py by @dsikka in #1516
Use model compression pathways by @kylesayrs in #1419
[Example] [Bugfix] Fix Gemma3 Generation by @kylesayrs in #1517
[Docs] Update ReadME details for FP4 by @dsikka in #1519
[Examples] [Bugfix] Perform sample generation before saving as compressed by @kylesayrs in #1530
Add citation information both in README as well as native GitHub file support by @markurtz in #1527
update compress...

Contributors

dbarbuzzi, brian-dellabetta, and 9 other contributors

Assets 4

29 Apr 01:34

dbarbuzzi

0.5.1

ef175d7

v0.5.1

What's Changed

Update nm-actions/changed-files to v1.16.0 by @dbarbuzzi in #1311
docs: fix missing git clone command and repo name typos in DEVELOPING.md by @gattshjott in #1325
Update e2e/lm-eval test infrastructure by @dbarbuzzi in #1323
fix(logger): normalize log_file_level input for consistency by @gattshjott in #1324
[Utils] Replace preserve_attr with patch_attr by @kylesayrs in #1187
Fix cut off log in entrypoints/utils.py post_process() by @mgoin in #1336
[Tests] Update condition for sparsity check to be more robust by @dsikka in #1337
[Utils] Add skip_weights_download for developers and testing by @kylesayrs in #1334
replace custom version handling with setuptools-scm by @dhellmann in #1322
[Compression] Update sparsity calculation lifecycle when fetching the compressor by @dsikka in #1332
[Sequential] Support models with nested _no_split_modules by @kylesayrs in #1329
[Tracing] Remove TraceableWhisperForConditionalGeneration by @kylesayrs in #1310
Add torch device to list of offloadable types by @kylesayrs in #1348
Reduce SmoothQuant Repr by @kylesayrs in #1289
Use align_module_device util by @kylesayrs in #1298
Fix project URL in setup.py by @tiran in #1353
Update trigger on PR comment workflow by @dbarbuzzi in #1357
Add timing functionality to lm-eval tests by @ved1beta in #1346
[Callbacks][Docs] Add docstrings to saving functions by @kylesayrs in #1201
Move: recipe parsing test from e2e/ to main test suite by @rahul-tuli in #1360
Smoothquant typehinting by @kylesayrs in #1285
AWQ Modifier by @brian-dellabetta in #1177
[Tests] Update transformers tests to run kv_cache tests by @dsikka in #1364
[Transformers] Support latest transformers by @dsikka in #1352
Update test_consecutive_runs.py by @dsikka in #1366
[Docs] Mention AWQ, some clean-up by @dsikka in #1367
Fix versioning for source installs by @dbarbuzzi in #1370
[Testing] Reduce error verbosity of cleanup by @kylesayrs in #1365
Update test_oneshot_and_finetune.py to use pytest.approx by @markurtz in #1339
[Tracing] Better runtime error messages by @kylesayrs in #1307
[Tests] Fix test case; update structure by @dsikka in #1375
fix: Make Recipe.model_dump() output compatible with model_validate() by @ved1beta in #1328
Add: documentation for enhanced save_pretrained parameters by @rahul-tuli in #1377
Revert "fix: Make Recipe.model_dump() output compatible .... by @rahul-tuli in #1378
AWQ resolved mappings -- ensure shapes align by @brian-dellabetta in #1372
Update w4a16_actorder_weight.yaml lmeval config by @dbarbuzzi in #1380
[WIP] Add AWQ Asym e2e test case by @dsikka in #1374
Bump version; set ct version by @dsikka in #1381
bugfix AWQ with Llama models and python 3.9 by @brian-dellabetta in #1384
awq -- hotfix to missing kwargs by @brian-dellabetta in #1395

New Contributors

@gattshjott made their first contribution in #1325
@dhellmann made their first contribution in #1322
@tiran made their first contribution in #1353
@ved1beta made their first contribution in #1346

Full Changelog: 0.5.0...0.5.1

Contributors

dhellmann, tiran, and 9 other contributors

Assets 4

03 Apr 13:23

dhuangnm

0.5.0

25b1138

v0.5.0

What's Changed

re-add vllm e2e test now that bug is fixed by @brian-dellabetta in #1162
Fix Readme Imports by @kylesayrs in #1165
Remove event_called by @kylesayrs in #1155
Update: Test name by @rahul-tuli in #1172
Remove lifecycle initialized_structure attribute by @kylesayrs in #1156
[VLM] Qwen 2.5 VL by @kylesayrs in #1113
Revert bump by @dsikka in #1178
Remove CLI by @dsikka in #1144
Add group act order case to lm_eval test by @dsikka in #1080
Update e2e test timings ouputs by @dsikka in #1179
[Oneshot Refactor] Main refactor by @horheynm in #1110
[StageRunner Removal] Remove Evalulate / validate pathway by @horheynm in #1145
[StageRemoval] Remove Predict pathway by @horheynm in #1146
Fix 2of4 Apply Example by @dsikka in #1181
Fix Sparse2of4 Example by @dsikka in #1182
Add qwen moe w4a16 example by @mgoin in #1186
[Callbacks] Consolidate Saving Methods by @kylesayrs in #1168
lmeval tests multimodal by @brian-dellabetta in #1150
[Dataset Performance] Add num workers on dataset processing - labels, tokenization by @horheynm in #1189
Fix a minor typo by @eldarkurtic in #1191
[Callbacks] Remove pre_initialize_structure by @kylesayrs in #1160
Make transformers-tests job conditional on files changed by @dbarbuzzi in #1197
Update finetune tests to decrease execution time by @dsikka in #1208
Update transformers tests to speed-up execution by @dsikka in #1211
Fix logging bug in oneshot.py by @aman2304 in #1213
[Training] Decouple Argument parser by @horheynm in #1207
Remove MonkeyPatch for GPUs by @dsikka in #1227
[Cosmetic] Rename data_args to dataset_args by @horheynm in #1206
[Training] Datasets - update Module by @horheynm in #1209
[BugFix] Fix logging disabling bug and add tests by @aman2304 in #1218
[Training] Unifying Preprocess + Postprocessing logic for Train/Oneshot by @horheynm in #1212
[Docs] Add info on when to use which PTQ/Sparsification by @horheynm in #1157
[Callbacks] Remove MagnitudePruningModifier.leave_enabled by @kylesayrs in #1198
Replace Xenova model stub with nm-testing model stub by @kylesayrs in #1239
Offload Cache Support torch.dtype by @kylesayrs in #1141
Remove unused/duplicated/non-applicable utils from pytorch/utils/helpers by @kylesayrs in #1174
[Bugfix] Staged 2of4 example by @kylesayrs in #1238
wandb/tensorboard loggers set default init to False by @brian-dellabetta in #1235
fixing reproducibility of lmeval tests by @brian-dellabetta in #1220
[Audio] People's Speech dataset and tracer tool by @kylesayrs in #1086
Use KV cache constant names provided by compressed tensors by @kylesayrs in #1200
[Bugfix] Raise error for processor remote code by @kylesayrs in #1184
Remove missing weights silencers in favor of HFQuantizer solution by @kylesayrs in #1017
Fix run_compressed tests by @dsikka in #1246
[Train] Training Pipeline by @horheynm in #1214
[Tests] Increase maximum quantization error by @kylesayrs in #1245
[Callbacks] Remove EventLifecycle and on_start event by @kylesayrs in #1170
[Bugfix] Disable generation of deepseek models with transformers>=4.48 by @kylesayrs in #1259
Remove clear_ml by @dsikka in #1261
[Tests] Remove clear_ml test from GHA by @kylesayrs in #1265
Remove click by @dsikka in #1262
[Bugfix] Remove constant pruning from 2of4 examples by @kylesayrs in #1267
Addback: ConstantPruningModifier for finetuning cases by @rahul-tuli in #1272
Remove docker by @kylesayrs in #1255
move failing mulitmodal lmeval tests to skipped folder by @brian-dellabetta in #1273
Replace tj-action/changed-files by @dbarbuzzi in #1270
[BugFix]: Sparse2of4 example sparsity-only case by @rahul-tuli in #1282
Revert "update" by @dsikka in #1296
Fix Multi-Context Manager Syntax for Python 3.9 Compatibility by @rahul-tuli in #1287
Revert "Fix Multi-Context Manager Syntax for Python 3.9 Compatibility… by @dsikka in #1300
[StageRunner] Stage Runner entrypoint and pipeline by @horheynm in #1202
Bump: Min python version to 3.9 by @rahul-tuli in #1288
Keep quantization enabled during calibration by @kylesayrs in #1299
[BugFix] TRL distillation bug fix by @horheynm in #1278
Update: Readme for fp8 support by @rahul-tuli in #1304
[GPTQ] Add inversion fallback by @kylesayrs in #1283
fix typo by @eldarkurtic in #1290
[Tests] Fix oneshot + finetune test by passing splits to oneshot by @kylesayrs in #1316
[Tests] Remove the compress entrypoint by @dsikka in #1317
Fix Multi-Context Manager Syntax for Python 3.9 Compatibility by @rahul-tuli in #1313
[BugFix] Directly Convert Modifiers to Recipe Instance by @rahul-tuli in #1271
bump version, tag ct by @dsikka in #1318

New Contributors

@aman2304 made their first contribution in #1213

Full Changelog: 0.4.1...0.5.0

Contributors

dbarbuzzi, mgoin, and 7 other contributors

Assets 4

20 Feb 13:21

dhuangnm

0.4.1

6a1ba3c

v0.4.1

What's Changed

Remove version by @dsikka in #1077
Require 'ready' label for transformers tests by @dbarbuzzi in #1079
GPTQModifier Nits and Code Clarity by @kylesayrs in #1068
Also run on pushes to main by @dbarbuzzi in #1083
VLM: Phi3 Vision Example by @kylesayrs in #1032
VLM: Qwen2_VL Example by @kylesayrs in #1027
Composability with sparse and quantization compressors by @rahul-tuli in #948
Remove TraceableMistralForCausalLM by @kylesayrs in #1052
[Fix Test Failure]: Propagate name change to test by @rahul-tuli in #1088
[Audio] Support Audio Datasets by @kylesayrs in #1085
[Test Fix] Add Quantization then finetune tests by @horheynm in #964
[Smoothquant] Phi3 Vision Mappings by @kylesayrs in #1089
[VLM] Multimodal Data Collator by @kylesayrs in #1087
VLM: Model Tracing Guide by @kylesayrs in #1030
Turn off 2:4 sparse compression until supported in vllm by @rahul-tuli in #1092
[Test Fix] Fix Consecutive oneshot by @horheynm in #971
[Bug Fix] Fix test that requre GPU by @horheynm in #1096
Add Idefics3/SmolVLM quant support via traceable class by @leon-seidel in #1095
Traceability Guide: Clarity and typo by @kylesayrs in #1099
[VLM] Examples README by @kylesayrs in #1057
Raise warning for 24 compressed sparse-only models by @rahul-tuli in #1107
Remove log_model_load by @kylesayrs in #1016
Return empty sparsity config if targets and ignores are empty by @rahul-tuli in #1115
Remove uses of get_observer by @kylesayrs in #939
FSDP utils cleanup by @kylesayrs in #854
Update maintainers, add notice by @kylesayrs in #1091
Replace readme paths with urls by @kylesayrs in #1097
GPTQ add Arkiv link, move file location by @kylesayrs in #1100
Extend remove_hooks to remove subsets by @kylesayrs in #1021
[Audio] Whisper Example and Readme by @kylesayrs in #1106
[Audio] Add whisper fp8 dynamic example by @kylesayrs in #1111
[VLM] Update pixtral data collator to reflect latest transformers changes by @kylesayrs in #1116
Use unique test names in TestvLLM by @dbarbuzzi in #1124
Remove smoothquant from examples by @kylesayrs in #1121
Extend disable_hooks to keep subsets by @kylesayrs in #1023
Unpin pynvml to fix e2e test failures with vLLM by @dsikka in #1125
Replace LayerCompressor with HooksMixin by @kylesayrs in #1038
[Oneshot Refactor] Rename get_shared_processor_src to get_processor_name_from_model by @horheynm in #1108
Allow Shortcutting Min-max Observer by @kylesayrs in #887
[Polish] Remove unused code by @horheynm in #1128
Properly restore training mode with eval_context by @kylesayrs in #1126
SQ and QM: Remove torch.cuda.empty_cache, use calibration_forward_context by @kylesayrs in #1114
[Oneshot Refactor] dataclass Arguments by @horheynm in #1103
[Bugfix] SparseGPT, Pipelines by @kylesayrs in #1130
[Oneshot refactor] Refactor initialize_model_from_path by @horheynm in #1109
[e2e] Update vllm tests with additional datasets by @brian-dellabetta in #1131
Update: SparseGPT recipes by @rahul-tuli in #1142
Add timer support for testing by @dsikka in #1137
[Audio] Support Whisper V3 by @kylesayrs in #1147
Fix: Re-enable Sparse Compression for 2of4 Examples by @rahul-tuli in #1153
[VLM] Add caption to flickr dataset by @kylesayrs in #1138
[VLM] Update mllama traceable definition by @kylesayrs in #1140
Fix CPU Offloading by @dsikka in #1159
[TRL_SFT_Trainer] Fix and Update Examples code by @horheynm in #1161
[TRL_SFT_Trainer] Fix TRL-SFT Distillation Training by @horheynm in #1163
Bump version for patch release by @dsikka in #1166
Update DeepSeek Examples by @dsikka in #1175
Update gemma2 examples with a note about sample generation by @dsikka in #1176

New Contributors

@leon-seidel made their first contribution in #1095

Full Changelog: 0.4.0...0.4.1

Contributors

dbarbuzzi, brian-dellabetta, and 5 other contributors

Assets 4

Releases: vllm-project/llm-compressor

v0.8.1

What's Changed

Contributors

Uh oh!

v0.8.0

LLM Compressor v0.8.0 release notes

Support for multiple modifiers in oneshot compression runs ✨

Quantization and calibration support for Qwen3 models

FP8 quantization support for Qwen3 VL MoE models

Transforms support for non-full-size rotation sizes

Improved accuracy recovery by updating W4A16 schemes to use actorder "weight" by default

Updates and deprecations

Support for R4 spinquant-style transforms

Re-enabled support for W8A8 INT8 decompression

Updated ignore lists in example recipes to capture all vision components

Deprecated and removed unittest.TestCase

Uh oh!

v0.7.1

What's Changed

New Contributors

Contributors

Uh oh!

v0.7.0

LLM Compressor v0.7.0 release notes

Introducing Transforms ✨

Applying multiple compressors to a single model

Support for DeepSeekV3-style block FP8 quantization

Mixture of Experts support

Llama4 quantization

Simplified and updated Recipe classes

Configurable Observer arguments

Uh oh!

v0.6.0.1

What's Changed

Contributors

Uh oh!

v0.6.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.5.2

What's Changed

Contributors

Uh oh!

v0.5.1

What's Changed

New Contributors

Contributors

Uh oh!

v0.5.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.4.1

What's Changed

New Contributors

Contributors

Uh oh!