GPT-QModel v5.0.0
Notable Changes:
- New Data-parallel quant support for MoE models on multi-gpu using
nogilPython (Python >= 3.13t withPYTHON_GIL=0env). - New
offload_to_disksupport enabled by default to massively reduce cpu ram usage. - New Intel optimized and Amd compatible
cpuhw acceleratedTorchFusedkernel. - Packing stage is now 4x faster and now inlined with quantization.
- Vram pressure for large models reduced during quantization.
act_group_awareis now 16k+ times faster and the default whendesc_act=Falsefor higher quality recovery without inference penalty ofdesc_act=True.- New beta quality AWQ support with full GEMM, GEMM_Fast, Marlin kernel support.
- New LFM, Ling, Qwen3 Omni model support.
- Bitblas kernel updated to support Bitblas 0.1.0.post1 reelase.
- Quantization is now faster with reduced vram usage. Enhanced logging support with LogBar.
- And much much more...
What's Changed
- rename
torch_dtypetodtypeto sync with hf transformers by @Qubitium in #1804 - drop support for python < 3.11 by @CSY-ModelCloud in #1805
- hard deprecated ipex in favor of torch_fused by @Qubitium in #1807
- update pyproject.toml by @CSY-ModelCloud in #1808
- [CI] release with 3.13t by @CSY-ModelCloud in #1811
- [QUANTIZATION] Add AWQ support by @ZX-ModelCloud in #1703
- find mapping by @LRL-ModelCloud in #1812
- Update README.md by @Qubitium in #1813
- Update version.py by @Qubitium in #1814
- Turtle in a half shell by @Qubitium in #1809
- note about memory saving by @Qubitium in #1817
- move fail_safe by @LRL-ModelCloud in #1818
- rename turtle method by @Qubitium in #1820
- add threads by @Qubitium in #1821
- remove AWQ mod defs by @ZX-ModelCloud in #1822
- [CI] use new docker by @CSY-ModelCloud in #1823
- Fix awq quantize by @LRL-ModelCloud in #1824
- [CI] use new docker for release source by @CSY-ModelCloud in #1825
- fix awq pack by @LRL-ModelCloud in #1826
- fix loading autoawq models and hf/vllm/sglang loading of newly awq qu… by @Qubitium in #1827
- wrong arg check by @Qubitium in #1828
- fix thread task var scoping by @Qubitium in #1829
- fix call param by @Qubitium in #1830
- fix threads > 1 not considered (unsafe) by @Qubitium in #1832
- cleanup by @Qubitium in #1833
- fix gptqmodel offload paths conflict by @Qubitium in #1834
- Ci test by @Qubitium in #1835
- eora: always diff in fp32 + cleanup by @Qubitium in #1836
- add register_buffer/parameter to NamedModule class by @Qubitium in #1837
- typo by @Qubitium in #1839
- add thread safety to all classes by @Qubitium in #1840
- fix fail_safe by @LRL-ModelCloud in #1844
- update marlin kernel by @ZX-ModelCloud in #1838
- fix fp32 reduce on/off by @Qubitium in #1845
- bypass marlin kernel bias issue by @Qubitium in #1846
- disable marlin atomics by default as it failed ci accuracy test by @Qubitium in #1847
- [FIX] awq marlin by @ZX-ModelCloud in #1816
- cleanup var names by @Qubitium in #1849
- pack per module by @LRL-ModelCloud in #1842
- [CI] use new docker by @CSY-ModelCloud in #1850
- tweak eora test by @Qubitium in #1851
- wait for thread tasks only when every module has completed. by @Qubitium in #1852
- [FIX] Compatible with vllm v0.10.2 by @ZX-ModelCloud in #1855
- move req.txt into toml by @CSY-ModelCloud in #1858
- do not create buffers only to overite them by @Qubitium in #1857
- pop states after use by @Qubitium in #1859
- [FIX] multiple "register_buffers" parameters by @ZX-ModelCloud in #1860
- Low memory pack by @Qubitium in #1861
- fix packing ci test by @Qubitium in #1862
- simplify by @Qubitium in #1853
- Fix 3bit packing regression in previous commit by @Qubitium in #1863
- remove deprecated
parallel_packingproperty by @Qubitium in #1864 - Fix qqq quant/offloading by @Qubitium in #1866
- temp disable awq gemm kernel due to failing ci by @Qubitium in #1867
- update vllm compat by @Qubitium in #1869
- fix regression by @Qubitium in #1870
- fix setup.py crashed because torch may not support float8_e8m0fnu by @CSY-ModelCloud in #1871
- [FIX] AwqGEMMQuantLinear skip gptq_v1 convert to v2 by @ZX-ModelCloud in #1872
- Fix awq gemm auto kernel selection order by @Qubitium in #1873
- Update README.md by @Qubitium in #1874
- reduce forwarding to minimal by @Qubitium in #1876
- Update README.md by @Qubitium in #1877
- fix exllama tests by @Qubitium in #1879
- debug print all params/buffers by @Qubitium in #1880
- skip internal loading of non-pkg compatible quantization models, i.e.… by @Qubitium in #1881
- Loader by @Qubitium in #1882
- Cleanup awq by @Qubitium in #1883
- remove broken test by @Qubitium in #1884
- [CI] remove old cuda/torch support for release by @CSY-ModelCloud in #1885
- fix loader by @LRL-ModelCloud in #1886
- fix nvcc warnings about pending cuda > 13.x compat by @Qubitium in #1887
- fix packing speed test by @Qubitium in #1889
- fix licenses warning by @CSY-ModelCloud in #1888
- set licenses to apache by @CSY-ModelCloud in #1890
- [FIX] AwqGEMMQuantLinear should is PackableQuantLinear by @ZX-ModelCloud in #1891
- skip modules that have no parameters and no buffers since they can't be offloaded by @LRL-ModelCloud in #1892
- skip modules that have no parameters and no buffers since they can't offload by @LRL-ModelCloud in #1894
- Fix device check by @Qubitium in #1896
- [CI] disable test install by @CSY-ModelCloud in #1895
- remove hash feature by @Qubitium in #1897
- fix cuda ext cannot be loaded by @Qubitium in #1898
- lock numpy to 2.2.6 by @CSY-ModelCloud in #1899
- [FIX] test_lm_eval.py by @ZX-ModelCloud in #1900
- Patch fix model save by @Qubitium in #1901
- Ugly patch save 2 by @Qubitium in #1902
- fix potential leak by @Qubitium in #1904
- [FIX] test_integration by @ZX-ModelCloud in #1903
- fix build will uploaded a empty wheel by @CSY-ModelCloud in #1905
- fix lm_head quant by @LRL-ModelCloud in #1906
- batch tweaks by @Qubitium in #1907
- [FIX] test_kernel_output_torch_fused by @ZX-ModelCloud in #1908
- sync shell model with turtle before save by @Qubitium in #1910
- enable fail-safe mode for ci by @Qubitium in #1911
- fix batch code to ignore masked output by @Qubitium in #1912
- reduce memory with boolean masks by @Qubitium in #1913
- Potential fix for code scanning alert no. 17: Workflow does not contain permissions by @Qubitium in #1914
- fix cuda thread ctx by @Qubitium in #1915
- remove prev thread fix, replaced by main changes by @Qubitium in #1916
- diff colors per dtype/device by @Qubitium in #1917
- sync qqq init with super changes by @Qubitium in #1918
- remove auto-gc by @Qubitium in #1919
- remove buffered-fwd toggle by @Qubitium in #1920
- remove calibration_enable_gpu_cache toggle by @Qubitium in #1921
- fix use m_device derived since weight might not exists if module only… by @Qubitium in #1923
- [FIX] test_qqq with groupsize=-1 by @ZX-ModelCloud in #1922
- fix test_ppl by @LRL2-ModelCloud in #1926
- fix offload threading bug by @Qubitium in #1927
- [FIX] test_qqq with group_size=128 by @ZX-ModelCloud in #1925
- fix loading of og qqq quantized models on hf by @Qubitium in #1928
- Update README.md by @Qubitium in #1929
- fix inference mode not applied to threads by @Qubitium in #1930
- fix more thread tasks without proper ctx by @Qubitium in #1931
- fix eora compat by @Qubitium in #1932
- [FIX] calibration_dataset is empty by @ZX-ModelCloud in #1935
- add eora toggle to ci test by @Qubitium in #1934
- [FIX] test_serialization by @ZX-ModelCloud in #1936
- missing offload dealloc tracking by @Qubitium in #1937
- Update pyproject.toml by @Qubitium in #1938
- [FIX] QQQ quantize by @ZX-ModelCloud in #1940
- [FIX] gptqv2 by @ZX-ModelCloud in #1942
- Qwen3 omni moe support by @LRL2-ModelCloud in #1939
- Update README.md by @Qubitium in #1944
- Threadx by @Qubitium in #1945
- Fix thread pool bugs by @Qubitium in #1948
- update logbar depend version by @Qubitium in #1949
- [FIX] qwen2_5_vl by @ZX-ModelCloud in #1946
- Data Parallel by @Qubitium in #1950
- Fix auto gc thread not blocking on idle by @Qubitium in #1952
- Fix qwen3 omni by @LRL2-ModelCloud in #1951
- [FIX] bitblas by @ZX-ModelCloud in #1953
- Directly save meta files to disk on model save by @Qubitium in #1954
- Bypass accelerate by @Qubitium in #1955
- use tf32 ctx by @Qubitium in #1956
- awq fixes by @Qubitium in #1957
- fix turtle model not ready when looper starts by @Qubitium in #1958
- remove unused named_module.target_device_stream and preprocess_stream… by @Qubitium in #1959
- revert: stream properties cannot be pickled by @Qubitium in #1960
- Replicate + Turtle state fix by @Qubitium in #1961
- fix ctx for convert v1-v2 v2-v1 using in_place tensor mutation by @Qubitium in #1962
- Fix q to by @Qubitium in #1963
- FIx device memory usage: use device-smi
metricsapi by @Qubitium in #1964 - marlin atomics should be disabled by default by @Qubitium in #1965
- Update pyproject.toml by @Qubitium in #1967
- Fix pad tokens passed to quantization capture by @Qubitium in #1968
- Fix Python 3.14t compat and Marlin with Cuda 13.0 by @Qubitium in #1969
- auto switch cuda toolkit script for multiple venv with multiple torch… by @Qubitium in #1970
- FIX v2 to v1 conversion regression in latest refractors by @Qubitium in #1972
- Fix act_group_aware accuracy by using stable argsort by @Qubitium in #1973
- AWQ register_buffers: bool incorrectly passed by @Qubitium in #1974
- update bitblas kernel by @Qubitium in #1975
- fix tf32 on/off regression by @Qubitium in #1978
- fix thread safety for all torch.linalg and reuse ThreadSafe by @Qubitium in #1979
- fix opt/qwen2.5/3 omni compat by @Qubitium in #1980
- Add OVIS 2.5 Support by @Qubitium in #1981
- Fix missing file by @Qubitium in #1983
- Fix replicate is not thread safe by @Qubitium in #1984
- Better progress bar by @Qubitium in #1985
- Fix missing memory release on cuda:0 by @Qubitium in #1986
- Use merged safetensors instead of multiple .dat files by @Qubitium in #1987
- fix transformers 4.57.0 compat by @Qubitium in #1989
- Replicate safety by @Qubitium in #1988
- missing upload by @Qubitium in #1990
- Fix tranformer backward compat by @Qubitium in #1991
- Fix Ovis regression by @Qubitium in #1992
- Logbar 0.1.0 by @Qubitium in #1993
- use logbar 0.1.1 with external log/pb conflict resolution by @Qubitium in #1994
- Bug fixes by @Qubitium in #1995
- Experimental groupsize 256, 512, 1024 for torch/triton kernel by @Qubitium in #1996
- Cleanup by @Qubitium in #1997
- add ling/ring support by @LRL2-ModelCloud in #2001
- [FIX] bitblas inference by @ZX-ModelCloud in #2002
- enable cpu torch fused op by @jiqing-feng in #2000
- Fix LING by @LRL2-ModelCloud in #2003
- [FIX] qqq quantize by @ZX-ModelCloud in #2005
- [FIX] gptq v2 quantize by @ZX-ModelCloud in #2006
- Make Torch kernel optimistically use triton dequant for 3.3x improvement by @Qubitium in #2007
- Fix model test model loading by @Qubitium in #2008
- use pypcre by @Qubitium in #2010
- [FIX] Add triton code for awq gemm by @ZX-ModelCloud in #2011
- Add lfm2 support by @LRL2-ModelCloud in #2012
- Update requirements.txt by @Qubitium in #2013
- Turtle pool by @Qubitium in #2009
- Fix race in threadpoolctl, bug in troch_sync helper by @Qubitium in #2014
- Torch replicate segfaults so let's default to simple copy by @Qubitium in #2015
- correctly report the actual kernel names by @Qubitium in #2016
- fix missing git add by @Qubitium in #2017
- [FIX] test with transformers/optimum/peft by @ZX-ModelCloud in #2018
- Fix multi gpu regression by @Qubitium in #2019
- Fix marlin kernel compiler warnings by @Qubitium in #2020
- Replicate by @Qubitium in #2021
- replicate still unsafe to use by @Qubitium in #2022
- Fix pack block by @Qubitium in #2023
- update scores by @Qubitium in #2024
- make act_group_aware default true by @Qubitium in #2025
- Make sure packer is part of dist by @Qubitium in #2026
- fix test by @LRL2-ModelCloud in #2028
- fix test_qzero_offsets by @LRL2-ModelCloud in #2029
- fix test_ppl by @LRL2-ModelCloud in #2030
- [FIX] unit test and Adapted to mlx-lm 0.28.2 by @ZX-ModelCloud in #2031
- Fix attn mask ci test by @Qubitium in #2034
- Eora fix by @Qubitium in #2033
- remove logger board by @Qubitium in #2035
- Add Tensor Parallel optimized weight processor by @Qubitium in #2036
- safety check before cpp call by @Qubitium in #2037
- Use replicate by @Qubitium in #2038
- Tests by @Qubitium in #2039
- Fix qwen3 moe by @Qubitium in #2040
- fix test_dynamic by @LRL2-ModelCloud in #2041
- [FIX] unittest test_packable / test_multi_gpu_inference / test_parameter_count by @ZX-ModelCloud in #2043
- Use spinner by @Qubitium in #2042
- Memory fix by @Qubitium in #2045
- don't log batch values if batch is not enabled by @LRL2-ModelCloud in #2044
- optimize eora for multi-gpu and memory usage by @Qubitium in #2046
- Remove broken streams by @Qubitium in #2048
- Scores by @Qubitium in #2049
- [FIX] awq_moe by @ZX-ModelCloud in #2047
- Fix packing module scaling due to incorrect shared locking by @Qubitium in #2050
- [FIX] test_gptqv2, use original process_batch by @ZX-ModelCloud in #2051
- XIELUActivation will use some weights when activation init, so can't … by @LRL2-ModelCloud in #2052
- Stream quantized tensors by @Qubitium in #2053
- Fix offload false by @LRL2-ModelCloud in #2054
- Fix multi-gpu accumulation drift vs single gpu by @Qubitium in #2055
- nogil patch safetensors/triton by @Qubitium in #2056
- Update Tests by @Qubitium in #2059
- Fix streaming mem corruption by @Qubitium in #2060
- Fix streaming vram regression by @Qubitium in #2061
- Fix glm def by @Qubitium in #2062
- build whls for last 2 versions of pytorch and from py 3.10 to 3.14 by @Qubitium in #2063
- Fix setup by @Qubitium in #2064
- fix setup warnings by @Qubitium in #2065
- cleanup by @Qubitium in #2066
- [CI] remove py 3.14t by @CSY-ModelCloud in #2067
- [CI] update CI runner's ip by @CSY-ModelCloud in #2068
- lock parent by @Qubitium in #2069
- merge locking code by @Qubitium in #2070
- [CI] fold long logs & install tabulate by @CSY-ModelCloud in #2071
- Cleanup4 by @Qubitium in #2073
- [CI] fold release job's long logs by @CSY-ModelCloud in #2072
- [FIX] qwen vl quantize by @ZX-ModelCloud in #2075
- Fix multi-gpu loading of quantized model by @Qubitium in #2076
- [CI] remove unsupported versions by @CSY-ModelCloud in #2077
- [CI] use new docker image for CI by @CSY-ModelCloud in #2078
- remove unused 'debug' by @LRL2-ModelCloud in #2080
- [FIX] test_benchmark_gar / test_bits / test_dynamic by @ZX-ModelCloud in #2083
- Machete by @Qubitium in #2082
- allow py3.10 by @Qubitium in #2084
- fix nm-calibration dataset path by @CSY-ModelCloud in #2079
- Fix setup/cutlass was not installed & checkout dir doesn't exist by @CSY-ModelCloud in #2085
- device_map need {device_type:device_index} by @LRL2-ModelCloud in #2086
- [FIX] test_eval by @ZX-ModelCloud in #2087
- fix cutlass path by @Qubitium in #2088
- fix nvcc spawning too many threads by @Qubitium in #2090
- fix nvcc compiler warning by @Qubitium in #2091
- [CI] add python 3.14t to release by @CSY-ModelCloud in #2089
- Temporally fix so files were not included in wheel by @CSY-ModelCloud in #2093
- Update README with python3-dev and ninja requirements by @Qubitium in #2094
- disable machete by @Qubitium in #2095
- [FIX] torch_fused on CPU by @ZX-ModelCloud in #2096
- [FIX] test_inference_speed by @ZX-ModelCloud in #2097
- Notes by @Qubitium in #2098
- fix setup license prop by @Qubitium in #2099
- fix transformers 4.57.1 breaking torchao compat by @Qubitium in #2100
- bitblas has strange compat issues with pip nvidia libs by @Qubitium in #2101
- note bitblas update by @Qubitium in #2102
- it is normal for some awq layers do not have err_loss info by @Qubitium in #2103
- fix threadx failing ci by @Qubitium in #2104
- Omni test by @Qubitium in #2105
- [CI] check wheel size by @CSY-ModelCloud in #2106
- Bump version from 5.0.0-dev0 to 5.0.0 by @Qubitium in #2107
Full Changelog: v4.2.5...v5.0.0