Releases: ModelCloud/GPTQModel
GPT-QModel v5.4.2
Notable Changes:
- Fix double fwd regression by @Qubitium in #2198
- Add cli: gptqmodel env by @ZX-ModelCloud in #2192
- [CI] compile wheel with python -m build by @CSY-ModelCloud in #2193
What's Changed
- Start v5.5.0 devel branch (odd version) by @Qubitium in #2191
- Update version from 5.5.0 to 5.4.2 patch release by @Qubitium in #2199
- [CI] copy wheel to local dir instead of using http server by @CSY-ModelCloud in #2200
Full Changelog: v5.4.0...v5.4.2
GPT-QModel v5.4.0
Notable Changes:
- AWQ Torch Fused Kernel by @Qubitium in #2190
- Make torch fused op compilable by @jiqing-feng in #2182
- [FIX] AWQ MoE by @ZX-ModelCloud in #2171
- add :? capture only syntax by @Qubitium in #2173
What's Changed
- Update latest news section in README.md by @Qubitium in #2166
- run forward pass even for empty subset to produce correct layer outputs by @avtc in #2161
- Reduce AWQ memory usage by @Qubitium in #2167
- Awq update by @Qubitium in #2168
- Retry partial to to fix accelerate invalid argument for first moe layer (reapply) by @avtc in #2169
- Awq update by @Qubitium in #2172
- adjust retry partial.to by @avtc in #2175
- cleanup awq_get_modules_for_scaling() by @ZX-ModelCloud in #2179
- [FIX] qwen3 moe sparse moe block by @ZX-ModelCloud in #2184
- Add module convert by @LRL2-ModelCloud in #2183
- Cleanup by @Qubitium in #2185
- Update pypcre version to 0.2.5 by @LRL2-ModelCloud in #2186
- Update pypcre version to 0.2.5 by @Qubitium in #2189
- [FIX] version("triton") crash on torch+xpu by @ZX-ModelCloud in #2188
Full Changelog: v5.2.0...v5.4.0
GPT-QModel v5.2.0
Notable Changes:
- Minimax M2, Granite Nano, Qwen3-VL, Brumpy model support
AWQquantization now out of beta and now fully integrated into life cycle- New
VramStrategy.Balancedproperty to spreadMoEmodules to different gpus - New pure torch AWQ kernel
- New
calibration_concat_separatorproperty - Fixed HF bug that did not save
mtplayers for GLM 4.5/4.6 (air) models. - Fixed multi-gpu cuda asserts due to stream/sync
What's Changed
- try not adding mem guards for marlin kernel launch protection by @Qubitium in https://github.com/ModelCloud/GPTQModel/*pull/2108
- MoE vram by @Qubitium in #2110
- Fix GLM 4.5/4.6 and AIr not saving mtp layer after save (HF bug) by @LRL2-ModelCloud in #2109
- torchao 0.14.1 update by @Qubitium in #2111
- Test refractor by @Qubitium in #2113
- Bump the github-actions group with 2 updates by @dependabot[bot] in #2120
- [FIX] xpu unit test by @ZX-ModelCloud in #2122
- modular by @Qubitium in #2123
- update scores by @Qubitium in #2124
- Fp8 dequant by @Qubitium in #2125
- Model dequant by @Qubitium in #2126
- Fp4 e2m1 by @Qubitium in #2127
- [FIX] ovis2, compatible with transformers v4.57.1 by @ZX-ModelCloud in #2129
- fix cols padding by @LRL2-ModelCloud in #2130
- [FIX] ovis_1_6 quantization by @ZX-ModelCloud in #2131
- Minimax m2 by @Qubitium in #2128
- Fix awq marlin kernel for bf16 by @Qubitium in #2135
- [FIX] incorrect AWQ NODES by @ZX-ModelCloud in #2133
- add support_offload_to_disk check by @LRL2-ModelCloud in #2134
- Add Awq torch kernel by @Qubitium in #2137
- Marin by @Qubitium in #2139
- Marin scores by @Qubitium in #2141
- Fix triton version detection in nogil patcher by @amd-vlarakic in #2144
- Fix qwen2 omni by @LRL2-ModelCloud in #2140
- [MODEL] Add GraniteMoEHybrid by @ZX-ModelCloud in #2142
- Fold AWQ into proper Looper/Layer/Subset Lifecycle by @Qubitium in #2138
- Refine GPT-QModel description in README by @Qubitium in #2145
- fix device_map by @LRL2-ModelCloud in #2146
- [MODEL] Add Qwen3-VL by @techshoww in #2136
- Add calibration_concat_separator by @Qubitium in #2148
- add test_qwen3_vl.py by @LRL2-ModelCloud in #2147
- Fix triton monkeypatch by @Qubitium in #2149
- [MODEL] Add Brumby by @Qubitium in #2150
- Dedup/Cleanup by @Qubitium in #2151
- Prep for 5.2 release by @Qubitium in #2152
- Dedup3 by @Qubitium in #2153
- add missing file by @Qubitium in #2154
- GPTAQ rename by @Qubitium in #2155
- fix ci test by @Qubitium in #2158
- fix setup license by @Qubitium in #2160
- FIx snapshot_download receiving unsupported kwargs by @Qubitium in #2162
- Retry partial.to to fix accelerate invalid argument error for first moe layer for >4 GPU setups by @avtc in #2163
- Comments + Sync by @Qubitium in #2164
- Stats/Logs by @Qubitium in #2165
New Contributors
- @amd-vlarakic made their first contribution in #2144
- @techshoww made their first contribution in #2136
Full Changelog: v5.0.0...v5.2.0
GPT-QModel v5.0.0
Notable Changes:
- New Data-parallel quant support for MoE models on multi-gpu using
nogilPython (Python >= 3.13t withPYTHON_GIL=0env). - New
offload_to_disksupport enabled by default to massively reduce cpu ram usage. - New Intel optimized and Amd compatible
cpuhw acceleratedTorchFusedkernel. - Packing stage is now 4x faster and now inlined with quantization.
- Vram pressure for large models reduced during quantization.
act_group_awareis now 16k+ times faster and the default whendesc_act=Falsefor higher quality recovery without inference penalty ofdesc_act=True.- New beta quality AWQ support with full GEMM, GEMM_Fast, Marlin kernel support.
- New LFM, Ling, Qwen3 Omni model support.
- Bitblas kernel updated to support Bitblas 0.1.0.post1 reelase.
- Quantization is now faster with reduced vram usage. Enhanced logging support with LogBar.
- And much much more...
What's Changed
- rename
torch_dtypetodtypeto sync with hf transformers by @Qubitium in #1804 - drop support for python < 3.11 by @CSY-ModelCloud in #1805
- hard deprecated ipex in favor of torch_fused by @Qubitium in #1807
- update pyproject.toml by @CSY-ModelCloud in #1808
- [CI] release with 3.13t by @CSY-ModelCloud in #1811
- [QUANTIZATION] Add AWQ support by @ZX-ModelCloud in #1703
- find mapping by @LRL-ModelCloud in #1812
- Update README.md by @Qubitium in #1813
- Update version.py by @Qubitium in #1814
- Turtle in a half shell by @Qubitium in #1809
- note about memory saving by @Qubitium in #1817
- move fail_safe by @LRL-ModelCloud in #1818
- rename turtle method by @Qubitium in #1820
- add threads by @Qubitium in #1821
- remove AWQ mod defs by @ZX-ModelCloud in #1822
- [CI] use new docker by @CSY-ModelCloud in #1823
- Fix awq quantize by @LRL-ModelCloud in #1824
- [CI] use new docker for release source by @CSY-ModelCloud in #1825
- fix awq pack by @LRL-ModelCloud in #1826
- fix loading autoawq models and hf/vllm/sglang loading of newly awq qu… by @Qubitium in #1827
- wrong arg check by @Qubitium in #1828
- fix thread task var scoping by @Qubitium in #1829
- fix call param by @Qubitium in #1830
- fix threads > 1 not considered (unsafe) by @Qubitium in #1832
- cleanup by @Qubitium in #1833
- fix gptqmodel offload paths conflict by @Qubitium in #1834
- Ci test by @Qubitium in #1835
- eora: always diff in fp32 + cleanup by @Qubitium in #1836
- add register_buffer/parameter to NamedModule class by @Qubitium in #1837
- typo by @Qubitium in #1839
- add thread safety to all classes by @Qubitium in #1840
- fix fail_safe by @LRL-ModelCloud in #1844
- update marlin kernel by @ZX-ModelCloud in #1838
- fix fp32 reduce on/off by @Qubitium in #1845
- bypass marlin kernel bias issue by @Qubitium in #1846
- disable marlin atomics by default as it failed ci accuracy test by @Qubitium in #1847
- [FIX] awq marlin by @ZX-ModelCloud in #1816
- cleanup var names by @Qubitium in #1849
- pack per module by @LRL-ModelCloud in #1842
- [CI] use new docker by @CSY-ModelCloud in #1850
- tweak eora test by @Qubitium in #1851
- wait for thread tasks only when every module has completed. by @Qubitium in #1852
- [FIX] Compatible with vllm v0.10.2 by @ZX-ModelCloud in #1855
- move req.txt into toml by @CSY-ModelCloud in #1858
- do not create buffers only to overite them by @Qubitium in #1857
- pop states after use by @Qubitium in #1859
- [FIX] multiple "register_buffers" parameters by @ZX-ModelCloud in #1860
- Low memory pack by @Qubitium in #1861
- fix packing ci test by @Qubitium in #1862
- simplify by @Qubitium in #1853
- Fix 3bit packing regression in previous commit by @Qubitium in #1863
- remove deprecated
parallel_packingproperty by @Qubitium in #1864 - Fix qqq quant/offloading by @Qubitium in #1866
- temp disable awq gemm kernel due to failing ci by @Qubitium in #1867
- update vllm compat by @Qubitium in #1869
- fix regression by @Qubitium in #1870
- fix setup.py crashed because torch may not support float8_e8m0fnu by @CSY-ModelCloud in #1871
- [FIX] AwqGEMMQuantLinear skip gptq_v1 convert to v2 by @ZX-ModelCloud in #1872
- Fix awq gemm auto kernel selection order by @Qubitium in #1873
- Update README.md by @Qubitium in #1874
- reduce forwarding to minimal by @Qubitium in #1876
- Update README.md by @Qubitium in #1877
- fix exllama tests by @Qubitium in #1879
- debug print all params/buffers by @Qubitium in #1880
- skip internal loading of non-pkg compatible quantization models, i.e.… by @Qubitium in #1881
- Loader by @Qubitium in #1882
- Cleanup awq by @Qubitium in #1883
- remove broken test by @Qubitium in #1884
- [CI] remove old cuda/torch support for release by @CSY-ModelCloud in #1885
- fix loader by @LRL-ModelCloud in #1886
- fix nvcc warnings about pending cuda > 13.x compat by @Qubitium in #1887
- fix packing speed test by @Qubitium in #1889
- fix licenses warning by @CSY-ModelCloud in #1888
- set licenses to apache by @CSY-ModelCloud in #1890
- [FIX] AwqGEMMQuantLinear should is PackableQuantLinear by @ZX-ModelCloud in #1891
- skip modules that have no parameters and no buffers since they can't be offloaded by @LRL-ModelCloud in #1892
- skip modules that have no parameters and no buffers since they can't offload by @LRL-ModelCloud in #1894
- Fix device check by @Qubitium in #1896
- [CI] disable test install by @CSY-ModelCloud in #1895
- remove hash feature by @Qubitium in #1897
- fix cuda ext cannot be loaded by @Qubitium in #1898
- lock numpy to 2.2.6 by @CSY-ModelCloud in #1899
- [FIX] test_lm_eval.py by @ZX-ModelCloud in #1900
- Patch fix model save by @Qubitium in #1901
- Ugly patch save 2 by @Qubitium in #1902
- fix potential leak by @Qubitium in #1904
- [FIX] test_integration by @ZX-ModelCloud in #1903
- fix build will uploaded a empty wheel by @CSY-ModelCloud in #1905
- fix lm_head quant by @LRL-ModelCloud in #1906
- batch tweaks by @Qubitium in #1907
- [FIX] test_kernel_output_torch_fused by @ZX-ModelCloud in ...
GPT-QModel v4.2.5
What's Changed
- Cleanup hyb_act by @Qubitium in #1791
- Remove torch import in setup.py by @Qubitium in #1729
- Refractor: rename
hyb_acttoact_group_awareby @Qubitium in #1794 - Cleanup by @Qubitium in #1795, #1796
- [CI] Add torch 2.8.0 by @CSY-ModelCloud in #1797
- [CI] torch-2.6.0+cu128-python-3.9 does not exist by @CSY-ModelCloud in #1798
- Fix wf_unsqueeze_zero and wf_unsqueeze_neg_one by @LRL-ModelCloud in #1799
- GAR field save to meta on quant save by @Qubitium in #1800
- Add pyproject.toml by @CSY-ModelCloud in #1801
- [CI] Don't detect arch list when it has already been set & fix build-system requirments by @CSY-ModelCloud in #1802
Full Changelog: v4.2.0...v4.2.5
GPT-QModel v4.2.0
Notable Changes
- Add Qwen3-Next by @Qubitium and @LRL-ModelCloud in #1787
- Add Apertus support by @LRL-ModelCloud in #1767
- Add Kimi k2 support by @LRL-ModelCloud in #1768
- Add Klear support by @LRL-ModelCloud in #1769
- Add FastLLM support by @LRL-ModelCloud in #1771
- Add Nemotron H support by @LRL-ModelCloud in #1773
- Add
fail_safeoption by @LRL-ModelCloud in #1775 - Use threading lock to protect unsafe tensor moves in multi-gpu by @Qubitium in #1778
- Avoid building experimental extensions to reduce wheel size by @Qubitium in #1763
What's Changed
- Fix LlavaQwen2GPTQ by @LRL-ModelCloud in #1772
- Fix Q.to on multi-gpu gptq when proceeding fast and has many experts and gpus by @avtc in #1774
- Bump actions/setup-python from 5 to 6 in the github-actions group by @dependabot[bot] in #1758
- [CI] fix release jobs were skipped by @CSY-ModelCloud in #1759
- ignore compile warns about var declared but not used by @Qubitium in #1760
- allow prebuilt wheel path to be customized via env by @Qubitium in #1761
- add build toggles for all cpp kernels by @Qubitium in #1764
- fix multi gpu inference by @LRL-ModelCloud in #1762
- [CI] reduce wheel download size by @CSY-ModelCloud in #1765
- start 4.2.0-dev cycle by @Qubitium in #1766
- fix klear by @LRL-ModelCloud in #1770
- FIX transformers >= 4.56.1 force changed
torch.default_dtypeby @Qubitium in #1779 - fix multi gpu fail_safe by @LRL-ModelCloud in #1780
- fix device instance by @LRL-ModelCloud in #1783
- prepare for 4.2 release by @Qubitium in #1785
Full Changelog: v4.1.0...v4.2.0
GPT-QModel v4.1.0
Notable Changes:
- Add a config option: mock_quantization to simplify heavy computations… by @avtc in #1731
- Add GLM-4.5-Air support by @avtc in #1730
- Add GPT-OSS support by @LRL2-ModelCloud in #1737
- Add LongCatFlashGPTQ by @LRL-ModelCloud in #1751
- Add Llama 4 Support by @Qubitium in #1508
What's Changed
- Minor Cleanup by @Qubitium in #1718
- disable some compilation on torch 2.8 due to compat issues by @Qubitium in #1727
- add glm4 moe test by @LRL2-ModelCloud in #1734
- deprecate autoround by @Qubitium in #1735
- [FIX] test_kernel_output with XPU by @ZX-ModelCloud in #1741
- cleanup checks for GIL control, GIL=0, and python >= 3.13.3t by @Qubitium in #1743
- update torch/transformer depends by @Qubitium in #1749
- reduce pkg depend by @Qubitium in #1750
- fix triton compat check for 3.13.3t by @Qubitium in #1752
- Bump torch from 2.7.1 to 2.8.0 in /gptqmodel_ext/exllama_eora by @dependabot[bot] in #1755
- pkg update: tokenicer 0.0.5 by @Qubitium in #1756
New Contributors
Full Changelog: v4.0.0...v4.1.0
GPT-QModel v4.0.0
Notable Changes
- Supprt add glm4 by @glide-the in #1559
- Add Xiaomi MiMo model by @Qubitium in #1571
- Free threading (GIL free) Quantization for Linear NxGPU scaling of Quantization by @Qubitium in #1581
- feat: add Qwen-Omni support. by @tiger-of-shawn in #1613
- add Qwen 2.5 Omni support by @Qubitium in #1615
- [MODEL] ERNIE4.5 by @LRL-ModelCloud in #1645
- [MODEL]support pangu_alpha model by @ZX-ModelCloud in #1646
- new baidu ernie & huawei pangu model support by @Qubitium in #1647
- [MODEL] Add falcon h1 support by @LRL-ModelCloud in #1621
- feat(gemma3): also support larger gemma3 models and not only small te… by @joennlae in #1627
- Add Group Aware Reordering (GAR) for Efficient Activation Reordering by @tgafni in #1656
- Enable pytorch fused op on XPU by @jiqing-feng in #1660
- [MODEL] Add Seed-OSS support by @LRL2-ModelCloud in #1702
Other Changed
-
[CI] add release source with github's vm by @CSY-ModelCloud in #1543
-
Fix rotation for tied embedding models by @smpanaro in #1550
-
Fix input processing for convolution by @Cecilwang in #1554
-
[FIX] moe model quant division by zero issue by @LRL-ModelCloud in #1565
-
[FIX] remove too short calib data by @LRL-ModelCloud in #1566
-
[FIX] hook_module and qwen3_moe by @LRL-ModelCloud in #1569
-
[FIX] hook linear and triton by @LRL-ModelCloud in #1570
-
[MISC] simplify model definition by @LRL-ModelCloud in #1572
-
[FIX]qwen2 moe loop module by @LRL-ModelCloud in #1574
-
[CI] fix unit test was unable to run by @CSY-ModelCloud in #1580
-
fix has_gil was not imported & device-smi api wrong by @CSY-ModelCloud in #1586
-
fix older python didn't have EnumType by @CSY-ModelCloud in #1590
-
[FIX] get_module_by_name_prefix by @LRL-ModelCloud in #1591
-
[CI] update release CI, add torch 2.7.0 by @CSY-ModelCloud in #1592
-
[FIX] Qwen2.5 vl quant by @LRL-ModelCloud in #1623
-
Bump torch from 2.6.0 to 2.7.1 in /gptqmodel_ext/exllama_eora by @dependabot[bot] in #1628
-
fix bug for device error by @kaixuanliu in #1631
-
[FIX]config seq len by @LRL-ModelCloud in #1640
-
register buffer for
wf_unsqueeze_zeroandwf_unsqueeze_neg_oneto… by @kaixuanliu in #1642 -
set_postfix is a tqdm function, no need anymore by @CSY-ModelCloud in #1643
-
fix exception to avoid memory issue by @jiqing-feng in #1679
-
lm_head hooked by @Chunfei-He in #1673
-
Bump the github-actions group across 1 directory with 2 updates by @dependabot[bot] in #1677
-
Model config.use_cache not correctly used during inference for some models by @LRL2-ModelCloud in #1686
-
[FIX] transformers compat by @LRL2-ModelCloud in #1687
-
Update module_looper.py by @LRL2-ModelCloud in #1690
-
Update requirements.txt by @LRL2-ModelCloud in #1689
-
add ACCEPT_USE_FLASH_ATTEN2_ARG by @LRL2-ModelCloud in #1693
-
Fix kwarg vs pos arg hidden states by @LRL2-ModelCloud in #1694
-
fix import Perplexity failed by @CSY-ModelCloud in #1695
-
[CI] fix CI installed wrong libs' version by @CSY-ModelCloud in #1696
-
[FIX] GIL Check by @ZX-ModelCloud in #1697
-
[FIX] minicpm test by @LRL2-ModelCloud in #1698
-
[FIX] use AutoModelForImageTextToText instead of AutoModelForVision2Seq by @ZX-ModelCloud in #1699
-
[CI] change qwen2.5-omni model path by @ZX-ModelCloud in #1701
-
[CI] install jieba for test_pangu_alpha by @CSY-ModelCloud in #1706
-
disable torch.compile by @LRL2-ModelCloud in #1707
-
FIX minicpm CI test by @LRL2-ModelCloud in #1708
-
[CI] update torch for build by @CSY-ModelCloud in #1709
-
[CI] update release matrix by @CSY-ModelCloud in #1710
-
[CI] install torch compiled with cuda 126 by @CSY-ModelCloud in #1711
-
use "attn_implementation" by @LRL2-ModelCloud in #1712
-
[CI] add 5090 support & install latest intel_extension_for_pytorch by @CSY-ModelCloud in #1713
-
[CI] don't compile 5090 for cuda < 12.8 by @CSY-ModelCloud in #1714
-
[CI] Update unit test docker by @CSY-ModelCloud in #1715
-
[CI] fix release ci by @CSY-ModelCloud in #1716
-
fix model path is not public by @CSY-ModelCloud in #1720
-
[CI] don't exit when package doesn't exist by @CSY-ModelCloud in #1719
-
[CI] no need install logbar manually by @CSY-ModelCloud in #1721
-
[CI] remove legacy tests & skip intel tests & disable flash_attn for some models by @CSY-ModelCloud in #1722
-
[CI] no need install uv by @CSY-ModelCloud in #1723
-
[CI] use new docker with uv binary to fix shim/uv didn't exist by @CSY-ModelCloud in #1724
New Contributors
- @Cecilwang made their first contribution in #1554
- @glide-the made their first contribution in #1559
- @tiger-of-shawn made their first contribution in https://github.com/ModelClo...
GPT-QModel v3.0.0
🎉 New ground-breaking GPTQ v2 quantization option for improved model quantization accuracy validated by GSM8K_PLATINUM benchmarks vs original gptq.
✨ New Phi4-MultiModal model support.
✨ New Nvidia Nemotron Ultra model support.
✨ New Dream model support. New experimental multi-gpu quantization support. Reduced vram usage. Faster quantization.
What's Changed
- Multi GPU Quantization by @Qubitium in #1502
- experimental multi-gpu quantization by @Qubitium in #1503
- reduce allocation by @Qubitium in #1504
- revert add_ by @Qubitium in #1506
- Switch to non-deprecated mlx.core.clear_cache() by @smpanaro in #1510
- Dream Model Support by @Qubitium in #1512
- fix disabling batch/mask for dream by @Qubitium in #1514
- reduce tensor device movement by @Qubitium in #1516
- fix deepseek v3 module order by @Qubitium in #1517
- Nemotron Ultra Support by @Qubitium in #1518
- faster process_batch by @Qubitium in #1519
- Fix missing arg due to recent
Processorapi changes by @Qubitium in #1523 - Fix gpt2 columns calculation by @Qubitium in #1524
- temp damper should not overwrite damp cfg by @Qubitium in #1526
- Replace module hooking with tree-defined targeting by @Qubitium in #1527
- Fix compat with XPU by @Qubitium in #1535
- Phi4 MultiModal by @Qubitium in #1511
- disable selection of ExllamaV2 kernel for group_size=16 for now by @Qubitium in #1537
- Add Gptqv2 by @yhhhli and @Qubitium in #1533
New Contributors
Full Changelog: v2.2.0...v3.0.0
GPTQModel v2.2.0
What's Changed
✨ New Qwen 2.5 VL model support. Prelim Qwen 3 model support.
✨ New samples log column during quantization to track module activation in MoE models.
✨ Loss log column now color-coded to highlight modules that are friendly/resistant to quantization.
✨ Progress (per-step) stats during quantization now streamed to log file.
✨ Auto bfloat16 dtype loading for models based on model config.
✨ Fix kernel compile for Pytorch/ROCm.
✨ Slightly faster quantization and auto-resolve some low-level oom issues for smaller vram gpus.
- Enable ipex tests for CPU/XPU by @jiqing-feng in #1460
- test kernel accuracies with more shapes on cuda by @Qubitium in #1461
- Fix rocm flags by @Qubitium in #1467
- use table like logging format by @Qubitium in #1471
- stream process log entries to persistent file by @Qubitium in #1472
- fix some models need trust-remote-code arg by @Qubitium in #1474
- Fix wq dtype by @Qubitium in #1475
- add colors to quant loss column by @Qubitium in #1477
- add prelim qwen3 support by @Qubitium in #1478
- Update eora.py for further optimization by @nbasyl in #1488
- faster cholesky inverse and avoid oom when possible by @Qubitium in #1494
- [MODEL] supports qwen2_5_vl by @ZX-ModelCloud in #1493
Full Changelog: v2.1.0...v2.2.0