Releases: vllm-project/llm-compressor
v0.8.1
v0.8.0

LLM Compressor v0.8.0 release notes
This LLM Compressor v0.8.0 release introduces the following new features and enhancements:
- Support for multiple modifiers in oneshot compression runs
- Quantization and calibration support for Qwen3 models including FP8 quantization support for Qwen3 VL MoE models
- Transforms support for non-full-size rotation sizes
- Improved accuracy recovery by updating W4A16 schemes to use
actorder
"weight" by default
Support for multiple modifiers in oneshot compression runs ✨
LLM Compressor now supports using multiple modifiers in oneshot compression runs.
You can apply multiple modifiers across model layers. This includes applying different modifiers, such as AWQ and GPTQ, to specific submodules for W4A16 quantization all within a single oneshot call and with only pass-through calibration data.
Using multiple modifiers improves non-uniform model quantization, addressing issues such as varying layer sensitivity.
For more information, see Non-uniform quantization.
Quantization and calibration support for Qwen3 models
Quantization and calibration support for Qwen3 models has been added to LLM Compressor.
An updated Qwen3NextSparseMoeBlock
modeling definition has been added to temporarily update the MoE block during calibration in order to ensure that all the experts see data and are calibrated appropriately. This allows all experts to have calibrated scales while ensuring only the gated activation values are used.
FP8 and NVFP4 quantization examples have been added for the Qwen3-Next-80B-A3B-Instruct model. For more information see:
- examples/quantization_w8a8_fp8/qwen3_next_example.py
- examples/quantization_w4a4_fp4/qwen3_next_example.py
FP8 quantization support for Qwen3 VL MoE models
LLM Compressor now supports quantization for Qwen3 VL MoE models. You can now use data-free pathways such as FP8 channel-wise and block-wise quantization. Pathways requiring data such W4A16 and NVFP4 are planned for a future release.
Examples have been added for FP8 quantization for the Qwen/Qwen3-VL-235B-A22B-Instruct model.
For more information see:
An updated definition has been added for Qwen3VLMoeTextSparseMoeBlock
which replaces all the MoE blocks with a linearized model definition such that a list of layers is used as opposed to a 3D parameter. This model definition enables quantization and is runnable in vLLM.
Transforms support for non-full-size rotation sizes
You can now set a transform_block_size
field in the Transform-based modifier classes SpinQuantModifier
and QuIPModifier
. You can configure transforms of variable size with this field, and you no longer need to restrict hadamards to match the size of the weight.
It is typically beneficial to set the hadamard block size to match the quantization group size. Examples have been updated to show how to use this field when applying the QuIP Modifier.
For more information, see:
To efficiently run QuIP-style rotations using the hadacore kernels in vLLM, see examples/transform/README.md.
Improved accuracy recovery by updating W4A16 schemes to use actorder "weight" by default
The GPTQModifier
class now uses "weight" activation ordering by default. Weight or "static" activation ordering has been shown to significantly improve accuracy recovery with no additional cost at runtime.
For more information and benchmarks, see vllm/pull/8135
Updates and deprecations
Support for R4 spinquant-style transforms
Support for R4 spinquant-style transforms has been added, which allows quantization of the down_proj
layer with increased accuracy recovery. You can use this transform by specifying SpinQuantModifier(rotations=["R4"])
in the oneshot recipe.
Re-enabled support for W8A8 INT8 decompression
W8A8 INT8 decompression and model generation has been re-enabled in LLM Compressor.
The following changes have been made:
- The
ModelCompressor
class has been updated to support compressing models initialized on the meta device. - The
SparseCompressor
andQuantizationCompressor
classes have been modified to be compatible with meta devices. - The
compress_weight()
function has been modified across sparse compressors to accept module input, enabling correct behavior for meta-initialized shells. - Decompression and offload device detection has been updated to handle meta modules and empty modules gracefully.
Updated ignore lists in example recipes to capture all vision components
Ignore lists in example recipes were updated to correctly capture all vision components. Previously, some vision components like model.vision_tower
were not being caught, causing downstream issues when serving models with vLLM.
Deprecated and removed unittest.TestCase
The unittest.TestCase
test case has been deprecated and removed and has been replaced with standardized pytest
test definitions.
v0.7.1
What's Changed
- [Examples] Create qwen_2_5_vl_example.py by @Zhao-Dongyu in #1752
- [fix] Fix visual layer ignore pattern for Qwen2.5-VL models by @Zhao-Dongyu in #1766
- [Transform] Fix QuIP targets by @kylesayrs in #1770
New Contributors
- @Zhao-Dongyu made their first contribution in #1752
Full Changelog: 0.7.0...0.7.1
v0.7.0

LLM Compressor v0.7.0 release notes
This LLM Compressor v0.7.0 release introduces the following new features and enhancements:
- Transforms support, including QuIP and SpinQuant algorithms
- Apply multiple compressors to a single model for mixed-precision quantization
- Support for DeepSeekV3-style block FP8 quantization
- Expanded Mixture of Experts (MoE) calibration support, including support with NVFP4 quantization
- Llama4 quantization support with vLLM compatibility
- Configurable observer arguments
- Simplified and unified Recipe classes for easier usage and debugging
Introducing Transforms ✨
LLM Compressor now supports transforms. With transforms, you can inject additional matrix operations within a model for the purposes of increasing the accuracy recovery as a result of quantization. Transforms allow rotating weights or activations into spaces with smaller dynamic ranges, reducing quantization error.
Two algorithms are supported in this release:
- QuIP transforms inject transforms before and after weights to assist with weight-only quantization
- SpinQuant transforms inject transforms whose inverses span across multiple weights, assisting in both weight and activation quantization. In this release, fused R1 and R2 (i.e. offline) transforms are available. The full lifecycle has been validated to confirm that the models produced by LLM Compressor match the performance outlined in the original SpinQuant paper. Learned rotations and online R3 and R4 rotations will be added in a future release.
The functionality for both algorithms available through the new QuIPModifier
and SpinQuantModifier
classes.
Applying multiple compressors to a single model
LLM Compressor now supports applying multiple compressors to a single model. This extends support for non-uniform quantization recipes, such as combining NVFP4 and FP8 quantization. This provides finer control over per-layer quantization, allowing more precise handling of layers that are especially sensitive to certain quantization types.
Models with more than one compressor applied have their format set to mixed-precision
in the config.json
file. Additionally, each config_group
now includes a format key that specifies the format used for the layers targeted by that group.
Support for DeepSeekV3-style block FP8 quantization
You can now apply DeepSeekV3-style block FP8 quantization during model compression, a technique designed to further compress large language models for more efficient inference. The changes encompass the fundamental implementation of block-wise quantization, robust handling of quantization parameters, updated documentation, and a practical example to guide users in applying this new compression scheme.
Mixture of Experts support
LLM Compressor now includes enhanced general Mixture of Experts (MoE) calibration support, including support for MoEs with NVFP4 quantization. Forward passes of the MoE models can be controlled during calibration by adding custom modules to the replace_modules_for_calibration
function which permanently changes the MoE module or moe_calibration_context
function to temporarily update modules during calibration.
Llama4 quantization
LLama4 quantization is now supported in LLM Compressor. To be quantized and runnable in vLLM, Llama4TextMoe
modules are permanently replaced using the replace_modules_for_calibration
method which linearizes the modules. This allows the model to be quantized to schemes including WN16 with GPTQ and NVFP4.
Simplified and updated Recipe classes
Recipe classes have been updated with the following features:
- Merged multiple recipe-related classes into a single, unified
Recipe
class - Simplified modifier creation, lifecycle management, and parsing logic
- Improved serialization and deserialization for clarity and maintainability
- Reduced redundant stages and arguments handling for easier debugging and usage
Configurable Observer arguments
Observer arguments can now be configured as a dict through the observer_kwargs
quantization argument, which can be set through oneshot recipes.
v0.6.0.1
v0.6.0
What's Changed
- [Experimental] Mistral-format FP8 quantization by @mgoin in #1359
- [Examples] [Bugfix] skip sparsity stats when saving checkpoints by @kylesayrs in #1528
- [Examples] [Bugfix] Fix debug message by @kylesayrs in #1529
- [Tests][NVFP4] No longer skip NVFP4A16 e2e test by @dsikka in #1538
- [AWQ] Support for Calibration Datasets of varying feature dimension by @brian-dellabetta in #1536
- fix qwen 2.5 VL multimodal example by @brian-dellabetta in #1541
- [Example] [Bugfix] Fix Gemma ignore list by @kylesayrs in #1531
- [Tests][NVFP4] Add e2e nvfp4 test by @dsikka in #1543
- [Examples] Use more robust splits by @kylesayrs in #1544
- [Bugfix] [Autowrapper] Fix visit_Delete by @kylesayrs in #1532
- [Example] Fix Qwen VL ignore list by @arunmadhusud in #1545
- [Tests] Fix
Qwen2.5-VL-7B-Instruct
Recipe by @dsikka in #1548 - [Bugfix] Fix gemma2 generation by @kylesayrs in #1552
- fix skipif check on tests involving gated HF models by @brian-dellabetta in #1553
- [NVFP4] Fix global scale update when dealing with offloaded layers by @dsikka in #1554
- oneshot entrypoint update by @ved1beta in #1445
- LM Eval tests -- ignore vision tower for VL fp8 test by @brian-dellabetta in #1562
- [Performance] Sequential onloading by @kylesayrs in #1263
- [BugFix] Explicitly set gpu_memory_utilization by @rahul-tuli in #1560
- Add Axolotl blog link by @rahul-tuli in #1563
- [Bugfix] Fix multigpu
dispatch_for_generation
by @kylesayrs in #1567 - [Testing] Set
VLLM_WORKER_MULTIPROC_METHOD
for e2e testing by @dsikka in #1569 - [BugFix] Fix quantizaiton_2of4_sparse_w4a16 example by @shanjiaz in #1565
- [Pipelines] infer model device with optional override by @kylesayrs in #1572
- bump up requirement for compressed-tensors to 0.10.2 by @dhuangnm in #1581
New Contributors
- @arunmadhusud made their first contribution in #1545
Full Changelog: 0.5.2...0.6.0
v0.5.2
What's Changed
- Exclude images from package by @kylesayrs in #1397
- [Tracing] Skip non-ancestors of sequential targets by @kylesayrs in #1389
- Consolidate build config by @dbarbuzzi in #1398
- [Tests] Disable silently failing kv cache test by @kylesayrs in #1371
- Drop
flash_attn
skip for quantizing_moe example tests by @dbarbuzzi in #1396 - [VLM] Fix mllama targets by @kylesayrs in #1402
- [Tests] Use requires_gpu, fix missing gpu test skip, add explicit test for gpu from gha by @kylesayrs in #1264
- Implement
QuantizationMixin
by @kylesayrs in #1351 - Add new-features section by @rahul-tuli in #1408
- [Tracing] Support tracing of Gemma3 [#1248] by @kelkelcheng in #1373
- bugfix kv cache quantization with ignored layers by @brian-dellabetta in #1312
- AWQ sanitize_kwargs minor cleanup by @brian-dellabetta in #1405
- [Tracing][Testing] Add tracing tests by @kylesayrs in #1335
- fix lm eval test reproducibility issues by @brian-dellabetta in #1260
- Pipeline Extraction by @kylesayrs in #1279
- Add
pull_request
trigger to base tests workflow by @dbarbuzzi in #1417 - removing RecipeMetadata and references by @shanjiaz in #1414
- Update examples to only load required number of samples from dataset by @kylesayrs in #1118
- [Tracing] Reinstate ignore functionality by @kylesayrs in #1423
- [Typo] overriden by @kylesayrs in #1420
- Rename SparsityModifierMixin to SparsityModifierBase by @kylesayrs in #1416
- Remove RecipeArgs class & its references by @shanjiaz in #1429
- [Examples] Standardize AWQ example by @kylesayrs in #1412
- [Logging] Support logging once by @kylesayrs in #1431
- Add: deepseekv2 smoothquant mappings by @rahul-tuli in #1433
- AWQ QuantizationMixin + SequentialPipeline by @brian-dellabetta in #1426
- patch awq tests/readme after QuantizationMixin refactor by @brian-dellabetta in #1439
- Added more tests for Quantization24SparseW4A16 by @shanjiaz in #1434
- [GPTQ] Add
actorder
option to modifier by @kylesayrs in #1424 - [Bugfix][Tracing] Fix qwen2_5_vl by @kylesayrs in #1448
- [Tests] Use proper offloading utils in
test_compress_tensor_utils
by @kylesayrs in #1449 - [Tracing] Fix Traceable Imports by @kylesayrs in #1452
- [NVFP4] Enable FP4 Weight-Only Quantization by @dsikka in #1309
- Pin transformers to <4.52.0 by @brian-dellabetta in #1459
- AWQ Apply Scales Bugfix when smooth layer output length doesn't match balance layer input length by @brian-dellabetta in #1451
- Fix #1344 Extend e2e tests to add asym support for W8A8-Int8 by @ved1beta in #1345
- [Tests] Fix activation recipe for w8a8 asym by @dsikka in #1461
- AWQ Qwen and Phi mappings by @brian-dellabetta in #1440
- [Observer] Optimize mse observer by @shanjiaz in #1450
- Fix: Improve
SmoothQuant
Support for Mixture of Experts (MoE) Models by @rahul-tuli in #1455 - [Tests] Add nvfp4a16 e2e test case by @dsikka in #1463
- [Docs] Update README to list fp4 by @dsikka in #1462
- Remove duplicate model id var from awq example recipe by @AndrewMead10 in #1467
- Added observer type for test_min_max by @shanjiaz in #1466
- Disable kernels during calibration (and tracing) by @kylesayrs in #1454
- [GPTQ] Fix actorder resolution, add sentinel by @kylesayrs in #1453
- Set
show_progress
to True by @dsikka in #1471 - Remove
compress
by @dsikka in #1470 - raise error if block quantization is used, as it is not yet supported by @brian-dellabetta in #1476
- [Tests] Increase max seq length for tracing tests by @kylesayrs in #1478
- [Tests] Fix dynamic field to be a bool, not string by @dsikka in #1480
- [Examples] Fix qwen vision examples by @kylesayrs in #1481
- [NVFP4] Update to use
tensor_group
strategy; update observers by @dsikka in #1484 - loosen lmeval assertions to upper or lower bound by @brian-dellabetta in #1477
- Revert "expand observers to calculate gparams, add example for activa… by @dsikka in #1486
- fix rest of the minmax tests by @shanjiaz in #1469
- Add warning for non-divisible group quantization by @kylesayrs in #1401
- [AWQ] Support accumulation for reduced memory usage by @kylesayrs in #1435
- [Tracing] Code AutoWrapper by @kylesayrs in #1411
- Removed RecipeTuple & RecipeContainer class by @shanjiaz in #1460
- Unpin to support
transformers==4.52.3
by @kylesayrs in #1479 - [Tests] GPTQ Actorder Resolution Tests by @kylesayrs in #1468
- [Testing] Skip FP4 Test by @dsikka in #1499
- [Bugfix] Remove tracing imports from tests by @kylesayrs in #1498
- [Testing] Use a slightly larger model that works with group_size 128 by @dsikka in #1502
- skip tracing tests if token unavailable by @brian-dellabetta in #1493
- Fix missing logs when calling oneshot by @kelkelcheng in #1446
- [NVFP4] Expand observers to calculate gparam, support NVFP4 Activations by @dsikka in #1487
- [Tests] Remove duplicate test by @kylesayrs in #1500
- [Model] Mistral3 example and test by @kylesayrs in #1490
- [NVFP4] Use observers to generate global weight scales by @dsikka in #1504
- Revert "[NVFP4] Use observers to generate global weight scales " by @dsikka in #1507
- [NVFP4] Update global scale generation by @dsikka in #1508
- [NVFP4] Fix onloading of fused layers by @dsikka in #1512
- Pin pandas to <2.3 by @dbarbuzzi in #1515
- AWQModifier fast resolve mappings, better logging, MoE support by @brian-dellabetta in #1444
- Update setup.py by @dsikka in #1516
- Use model compression pathways by @kylesayrs in #1419
- [Example] [Bugfix] Fix Gemma3 Generation by @kylesayrs in #1517
- [Docs] Update ReadME details for FP4 by @dsikka in #1519
- [Examples] [Bugfix] Perform sample generation before saving as compressed by @kylesayrs in #1530
- Add citation information both in README as well as native GitHub file support by @markurtz in #1527
- update compress...
v0.5.1
What's Changed
- Update nm-actions/changed-files to v1.16.0 by @dbarbuzzi in #1311
- docs: fix missing git clone command and repo name typos in DEVELOPING.md by @gattshjott in #1325
- Update e2e/lm-eval test infrastructure by @dbarbuzzi in #1323
- fix(logger): normalize log_file_level input for consistency by @gattshjott in #1324
- [Utils] Replace
preserve_attr
withpatch_attr
by @kylesayrs in #1187 - Fix cut off log in entrypoints/utils.py
post_process()
by @mgoin in #1336 - [Tests] Update condition for sparsity check to be more robust by @dsikka in #1337
- [Utils] Add
skip_weights_download
for developers and testing by @kylesayrs in #1334 - replace custom version handling with setuptools-scm by @dhellmann in #1322
- [Compression] Update sparsity calculation lifecycle when fetching the compressor by @dsikka in #1332
- [Sequential] Support models with nested
_no_split_modules
by @kylesayrs in #1329 - [Tracing] Remove
TraceableWhisperForConditionalGeneration
by @kylesayrs in #1310 - Add torch device to list of offloadable types by @kylesayrs in #1348
- Reduce SmoothQuant Repr by @kylesayrs in #1289
- Use
align_module_device
util by @kylesayrs in #1298 - Fix project URL in setup.py by @tiran in #1353
- Update trigger on PR comment workflow by @dbarbuzzi in #1357
- Add timing functionality to lm-eval tests by @ved1beta in #1346
- [Callbacks][Docs] Add docstrings to saving functions by @kylesayrs in #1201
- Move: recipe parsing test from
e2e/
to main test suite by @rahul-tuli in #1360 - Smoothquant typehinting by @kylesayrs in #1285
- AWQ Modifier by @brian-dellabetta in #1177
- [Tests] Update transformers tests to run kv_cache tests by @dsikka in #1364
- [Transformers] Support latest transformers by @dsikka in #1352
- Update test_consecutive_runs.py by @dsikka in #1366
- [Docs] Mention AWQ, some clean-up by @dsikka in #1367
- Fix versioning for source installs by @dbarbuzzi in #1370
- [Testing] Reduce error verbosity of cleanup by @kylesayrs in #1365
- Update test_oneshot_and_finetune.py to use pytest.approx by @markurtz in #1339
- [Tracing] Better runtime error messages by @kylesayrs in #1307
- [Tests] Fix test case; update structure by @dsikka in #1375
- fix: Make Recipe.model_dump() output compatible with model_validate() by @ved1beta in #1328
- Add: documentation for enhanced
save_pretrained
parameters by @rahul-tuli in #1377 - Revert "fix: Make Recipe.model_dump() output compatible .... by @rahul-tuli in #1378
- AWQ resolved mappings -- ensure shapes align by @brian-dellabetta in #1372
- Update w4a16_actorder_weight.yaml lmeval config by @dbarbuzzi in #1380
- [WIP] Add AWQ Asym e2e test case by @dsikka in #1374
- Bump version; set ct version by @dsikka in #1381
- bugfix AWQ with Llama models and python 3.9 by @brian-dellabetta in #1384
- awq -- hotfix to missing kwargs by @brian-dellabetta in #1395
New Contributors
- @gattshjott made their first contribution in #1325
- @dhellmann made their first contribution in #1322
- @tiran made their first contribution in #1353
- @ved1beta made their first contribution in #1346
Full Changelog: 0.5.0...0.5.1
v0.5.0
What's Changed
- re-add vllm e2e test now that bug is fixed by @brian-dellabetta in #1162
- Fix Readme Imports by @kylesayrs in #1165
- Remove event_called by @kylesayrs in #1155
- Update: Test name by @rahul-tuli in #1172
- Remove lifecycle initialized_structure attribute by @kylesayrs in #1156
- [VLM] Qwen 2.5 VL by @kylesayrs in #1113
- Revert bump by @dsikka in #1178
- Remove CLI by @dsikka in #1144
- Add group act order case to lm_eval test by @dsikka in #1080
- Update e2e test timings ouputs by @dsikka in #1179
- [Oneshot Refactor] Main refactor by @horheynm in #1110
- [StageRunner Removal] Remove Evalulate / validate pathway by @horheynm in #1145
- [StageRemoval] Remove Predict pathway by @horheynm in #1146
- Fix 2of4 Apply Example by @dsikka in #1181
- Fix Sparse2of4 Example by @dsikka in #1182
- Add qwen moe w4a16 example by @mgoin in #1186
- [Callbacks] Consolidate Saving Methods by @kylesayrs in #1168
- lmeval tests multimodal by @brian-dellabetta in #1150
- [Dataset Performance] Add num workers on dataset processing - labels, tokenization by @horheynm in #1189
- Fix a minor typo by @eldarkurtic in #1191
- [Callbacks] Remove pre_initialize_structure by @kylesayrs in #1160
- Make
transformers-tests
job conditional on files changed by @dbarbuzzi in #1197 - Update finetune tests to decrease execution time by @dsikka in #1208
- Update transformers tests to speed-up execution by @dsikka in #1211
- Fix logging bug in oneshot.py by @aman2304 in #1213
- [Training] Decouple Argument parser by @horheynm in #1207
- Remove MonkeyPatch for GPUs by @dsikka in #1227
- [Cosmetic] Rename data_args to dataset_args by @horheynm in #1206
- [Training] Datasets - update Module by @horheynm in #1209
- [BugFix] Fix logging disabling bug and add tests by @aman2304 in #1218
- [Training] Unifying Preprocess + Postprocessing logic for Train/Oneshot by @horheynm in #1212
- [Docs] Add info on when to use which PTQ/Sparsification by @horheynm in #1157
- [Callbacks] Remove
MagnitudePruningModifier.leave_enabled
by @kylesayrs in #1198 - Replace Xenova model stub with nm-testing model stub by @kylesayrs in #1239
- Offload Cache Support torch.dtype by @kylesayrs in #1141
- Remove unused/duplicated/non-applicable utils from pytorch/utils/helpers by @kylesayrs in #1174
- [Bugfix] Staged 2of4 example by @kylesayrs in #1238
- wandb/tensorboard loggers set default init to False by @brian-dellabetta in #1235
- fixing reproducibility of lmeval tests by @brian-dellabetta in #1220
- [Audio] People's Speech dataset and tracer tool by @kylesayrs in #1086
- Use KV cache constant names provided by compressed tensors by @kylesayrs in #1200
- [Bugfix] Raise error for processor remote code by @kylesayrs in #1184
- Remove missing weights silencers in favor of HFQuantizer solution by @kylesayrs in #1017
- Fix run_compressed tests by @dsikka in #1246
- [Train] Training Pipeline by @horheynm in #1214
- [Tests] Increase maximum quantization error by @kylesayrs in #1245
- [Callbacks] Remove EventLifecycle and on_start event by @kylesayrs in #1170
- [Bugfix] Disable generation of deepseek models with transformers>=4.48 by @kylesayrs in #1259
- Remove clear_ml by @dsikka in #1261
- [Tests] Remove clear_ml test from GHA by @kylesayrs in #1265
- Remove click by @dsikka in #1262
- [Bugfix] Remove constant pruning from 2of4 examples by @kylesayrs in #1267
- Addback: ConstantPruningModifier for finetuning cases by @rahul-tuli in #1272
- Remove docker by @kylesayrs in #1255
- move failing mulitmodal lmeval tests to skipped folder by @brian-dellabetta in #1273
- Replace tj-action/changed-files by @dbarbuzzi in #1270
- [BugFix]: Sparse2of4 example sparsity-only case by @rahul-tuli in #1282
- Revert "update" by @dsikka in #1296
- Fix Multi-Context Manager Syntax for Python 3.9 Compatibility by @rahul-tuli in #1287
- Revert "Fix Multi-Context Manager Syntax for Python 3.9 Compatibility… by @dsikka in #1300
- [StageRunner] Stage Runner entrypoint and pipeline by @horheynm in #1202
- Bump: Min python version to 3.9 by @rahul-tuli in #1288
- Keep quantization enabled during calibration by @kylesayrs in #1299
- [BugFix] TRL distillation bug fix by @horheynm in #1278
- Update: Readme for fp8 support by @rahul-tuli in #1304
- [GPTQ] Add inversion fallback by @kylesayrs in #1283
- fix typo by @eldarkurtic in #1290
- [Tests] Fix oneshot + finetune test by passing splits to oneshot by @kylesayrs in #1316
- [Tests] Remove the
compress
entrypoint by @dsikka in #1317 - Fix Multi-Context Manager Syntax for Python 3.9 Compatibility by @rahul-tuli in #1313
- [BugFix] Directly Convert Modifiers to Recipe Instance by @rahul-tuli in #1271
- bump version, tag ct by @dsikka in #1318
New Contributors
Full Changelog: 0.4.1...0.5.0
v0.4.1
What's Changed
- Remove version by @dsikka in #1077
- Require 'ready' label for transformers tests by @dbarbuzzi in #1079
- GPTQModifier Nits and Code Clarity by @kylesayrs in #1068
- Also run on pushes to
main
by @dbarbuzzi in #1083 - VLM: Phi3 Vision Example by @kylesayrs in #1032
- VLM: Qwen2_VL Example by @kylesayrs in #1027
- Composability with sparse and quantization compressors by @rahul-tuli in #948
- Remove
TraceableMistralForCausalLM
by @kylesayrs in #1052 - [Fix Test Failure]: Propagate name change to test by @rahul-tuli in #1088
- [Audio] Support Audio Datasets by @kylesayrs in #1085
- [Test Fix] Add Quantization then finetune tests by @horheynm in #964
- [Smoothquant] Phi3 Vision Mappings by @kylesayrs in #1089
- [VLM] Multimodal Data Collator by @kylesayrs in #1087
- VLM: Model Tracing Guide by @kylesayrs in #1030
- Turn off 2:4 sparse compression until supported in vllm by @rahul-tuli in #1092
- [Test Fix] Fix Consecutive oneshot by @horheynm in #971
- [Bug Fix] Fix test that requre GPU by @horheynm in #1096
- Add Idefics3/SmolVLM quant support via traceable class by @leon-seidel in #1095
- Traceability Guide: Clarity and typo by @kylesayrs in #1099
- [VLM] Examples README by @kylesayrs in #1057
- Raise warning for 24 compressed sparse-only models by @rahul-tuli in #1107
- Remove log_model_load by @kylesayrs in #1016
- Return empty sparsity config if targets and ignores are empty by @rahul-tuli in #1115
- Remove uses of get_observer by @kylesayrs in #939
- FSDP utils cleanup by @kylesayrs in #854
- Update maintainers, add notice by @kylesayrs in #1091
- Replace readme paths with urls by @kylesayrs in #1097
- GPTQ add Arkiv link, move file location by @kylesayrs in #1100
- Extend
remove_hooks
to remove subsets by @kylesayrs in #1021 - [Audio] Whisper Example and Readme by @kylesayrs in #1106
- [Audio] Add whisper fp8 dynamic example by @kylesayrs in #1111
- [VLM] Update pixtral data collator to reflect latest transformers changes by @kylesayrs in #1116
- Use unique test names in
TestvLLM
by @dbarbuzzi in #1124 - Remove smoothquant from examples by @kylesayrs in #1121
- Extend
disable_hooks
to keep subsets by @kylesayrs in #1023 - Unpin
pynvml
to fix e2e test failures with vLLM by @dsikka in #1125 - Replace LayerCompressor with HooksMixin by @kylesayrs in #1038
- [Oneshot Refactor] Rename get_shared_processor_src to get_processor_name_from_model by @horheynm in #1108
- Allow Shortcutting Min-max Observer by @kylesayrs in #887
- [Polish] Remove unused code by @horheynm in #1128
- Properly restore training mode with
eval_context
by @kylesayrs in #1126 - SQ and QM: Remove
torch.cuda.empty_cache
, usecalibration_forward_context
by @kylesayrs in #1114 - [Oneshot Refactor] dataclass Arguments by @horheynm in #1103
- [Bugfix] SparseGPT, Pipelines by @kylesayrs in #1130
- [Oneshot refactor] Refactor initialize_model_from_path by @horheynm in #1109
- [e2e] Update vllm tests with additional datasets by @brian-dellabetta in #1131
- Update: SparseGPT recipes by @rahul-tuli in #1142
- Add timer support for testing by @dsikka in #1137
- [Audio] Support Whisper V3 by @kylesayrs in #1147
- Fix: Re-enable Sparse Compression for 2of4 Examples by @rahul-tuli in #1153
- [VLM] Add caption to flickr dataset by @kylesayrs in #1138
- [VLM] Update mllama traceable definition by @kylesayrs in #1140
- Fix CPU Offloading by @dsikka in #1159
- [TRL_SFT_Trainer] Fix and Update Examples code by @horheynm in #1161
- [TRL_SFT_Trainer] Fix TRL-SFT Distillation Training by @horheynm in #1163
- Bump version for patch release by @dsikka in #1166
- Update DeepSeek Examples by @dsikka in #1175
- Update gemma2 examples with a note about sample generation by @dsikka in #1176
New Contributors
- @leon-seidel made their first contribution in #1095
Full Changelog: 0.4.0...0.4.1