GPTQModel

Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.

News

Archived News

* 12/23/2024 [1.5.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.5.0): Multi-modal (image-to-text) optimized quantization support has been added for Qwen 2-VL and Ovis 1.6-VL. Previous image-to-text model quantizations did not use image calibration data, resulting in less than optimal post-quantization results. Version 1.5.0 is the first release to provide a stable path for multi-modal quantization: only text layers are quantized. * 12/19/2024 [1.4.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.5): Windows 11 support added/validated. Ovis VL model support with image dataset calibration. Fixed `dynamic` loading. Reduced quantization vram usage. * 12/15/2024 [1.4.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.2): MacOS `gpu` (Metal) and `cpu` (M+) support added/validated for inference and quantization. Cohere 2 model support added. * 12/13/2024 [1.4.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.1): Added Qwen2-VL model support. `mse` quantization control exposed in `QuantizeConfig`. Monkey patch `patch_vllm()` and `patch_hf()` api added to allow Transformers/Optimum/PEFT and vLLM to correctly loaded GPTQModel quantized models while upstream PRs are in pending status. * 12/10/2024 [1.4.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.0) `EvalPlus` harness integration merged upstream. We now support both `lm-eval` and `EvalPlus`. Added pure torch `Torch` kernel. Refactored `Cuda` kernel to be `DynamicCuda` kernel. `Triton` kernel now auto-padded for max model support. `Dynamic` quantization now supports both positive `+:`:default, and `-:` negative matching which allows matched modules to be skipped entirely for quantization. Fixed auto-`Marlin` kerenl selection. Added auto-kernel fallback for unsupported kernel/module pairs. Lots of internal refractor and cleanup in-preparation for transformers/optimum/peft upstream PR merge. Deprecated the saving of `Marlin` weight format since `Marlin` supports auto conversion of `gptq` format to `Marlin` during runtime.

What is GPTQModel?

GPTQModel originated as major refractor of AutoGPTQ but is now a full-stand-in replacement with cleaner api, up-to-date model support, faster inference, higher quality quants.

Public tests/papers and ModelCloud's internal tests have shown that GPTQ is on-par and/or exceeds other 4bit quantization methods in terms of both quality recovery and production-level inference speed for token latency and rps. GPTQ has the optimal blend of quality and inference speed you need in a real-world production deployment.

Features

✨ Native integration with HF Transformers (main), Optimum (main), and Peft (main)
🚀 vLLM and SGLang inference integration for quantized model with format = FORMAT.GPTQ
🚀 Extensive model support for: Ovis VL, Llama 1-3.3, Qwen2-VL, Olmo2, Hymba, GLM, IBM Granite, Llama 3.2 Vision, MiniCPM3, GRIN-Moe, Phi 1-4, EXAONE 3.0, InternLM 2.5, Gemma 2, DeepSeek-V2, DeepSeek-V2-Lite, ChatGLM, MiniCPM, Qwen2MoE, DBRX.
✨ Linux, MacOS, Windows platform quantization and accelerated inference support for CUDA (Nvidia), XPU (Intel), ROCm (AMD), MPS (Apple Silicon), CPU (Intel/AMD/Apple Silicon).
💯 100% CI unit-test coverage for all supported models and kernels including post-quantization quality regression.
✨ Dynamic mixed quantization control on a per-module basis. Each layer/module can have a unique quantization config or be excluded from quantization all together.
🚀 Intel/IPEX hardware accelerated quantization/inference for CPU [avx, amx, xmx] and Intel GPU [Arc + Datacenter Max].
🚀 Microsoft/BITBLAS format + dynamically compiled inference.
✨ Intel/AutoRound alternative gptq-inference compatible quantization method.
✨ Asymmetric Sym=False support. Model weights sharding support with optional hash check of model weights on load.
✨ lm_head module quant inference support for further VRAM reduction.
🚀 45% faster packing stage in quantization (Llama 3.1 8B). 50% faster PPL calculations (OPT).

Quality: GPTQ 4bit (5.0 bpw) can match BF16:

🤗 ModelCloud quantized Vortex models on HF

Model Support

Model
Baichuan	✅	Falcon	✅	Llama 1-3.3	✅	OLMo2	✅	Yi	✅
Bloom	✅	Gemma 2	✅	Llama 3.2 VL	✅	Ovis 1.6	✅	XVERSE	✅
ChatGLM	✅	GPTBigCod	✅	LongLLaMA	✅	Phi 1-4	✅
CodeGen	✅	GPTNeoX	✅	MiniCPM3	✅	Qwen	✅
Cohere 1-2	✅	GPT-2	✅	Mistral	✅	Qwen2 MoE	✅
DBRX Converted	✅	GPT-J	✅	Mixtral	✅	Qwen2 VL	✅
Deci	✅	Granite	✅	MobileLLM	✅	RefinedWeb	✅
DeepSeek-V2/V3/R1	✅	GRIN-MoE	✅	MOSS	✅	StableLM	✅
DeepSeek-V2-Lite	✅	Hymba	✅	MPT	✅	StarCoder2	✅
EXAONE 3.0	✅	InternLM 1/2.5	✅	OPT	✅	TeleChat2	✅

Platform and HW Support

GPTQModel is validated for Linux, MacOS, and Windows 11:

Platform	Device		Optimized Arch	Kernels
🐧 Linux	Nvidia GPU	✅	`Ampere+`	Marlin, Exllama V2, Exallma V1, Triton, DyanamicCuda, Torch
🐧 Linux	Intel XPU	✅	`Arc`, `Datacenter Max`	IPEX, Torch, Triton
🐧 Linux	AMD GPU	✅	`7900XT+`, `ROCm 6.2+`	Exllama V2, Exallma V1, DyanamicCuda, Torch
🐧 Linux	Intel/AMD CPU	✅	`avx`, `amx`, `xmx`	IPEX, Torch
🍎 MacOS	GPU (Metal) / CPU	✅	`Apple Silicon`, `M1+`	Torch, MLX via conversion
🪟 Windows	GPU (Nvidia) / CPU	✅	`Nvidia`	DynamicCuda, Torch

Install

PIP/UV

# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.
# Example: pip install -v --no-build-isolation gptqmodel[vllm,sglang,bitblas,ipex,auto_round]
pip install -v gptqmodel --no-build-isolation
uv pip install -v gptqmodel --no-build-isolation

Install from source

# clone repo
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel

# pip: compile and install
# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.
# Example: pip install -v --no-build-isolation .[vllm,sglang,bitblas,ipex,auto_round]
pip install -v . --no-build-isolation

Inference

Three line api to use GPTQModel for gptq model inference:

from gptqmodel import GPTQModel

model = GPTQModel.load("ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5")
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output

OpenAI API compatible end-point

# load model using above inference guide first
model.serve(host="0.0.0.0",port="12345")

Quantization

Basic example of using GPTQModel to quantize a llm model:

from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig

model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"

calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
  ).select(range(1024))["text"]

quant_config = QuantizeConfig(bits=4, group_size=128)

model = GPTQModel.load(model_id, quant_config)

# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=2)

model.save(quant_path)

# test post-quant inference
model = GPTQModel.load(quant_path)
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output

For more advanced features of model quantization, please reference to this script

How to Add Support for a New Model

Read the gptqmodel/models/llama.py code which explains in detail via comments how the model support is defined. Use it as guide to PR for to new models. Most models follow the same pattern.

Evaluation and Quality Benchmarks

GPTQModel inference is integrated into both lm-eval and evalplus
We highly recommend avoid using ppl and use lm-eval/evalplus to validate post-quantization model quality. ppl should only be used for regression tests and is not a good indicator of model output quality.

# gptqmodel is integrated into lm-eval >= v0.4.7
pip install lm-eval>=0.4.7

# gptqmodel is integrated into evalplus[main]
pip install -U "evalplus @ git+https://github.com/evalplus/evalplus"

Below is a basic sample using GPTQModel.eval API

from gptqmodel import GPTQModel
from gptqmodel.utils.eval import EVAL

model_id = "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1"

# Use `lm-eval` as framework to evaluate the model
lm_eval_results = GPTQModel.eval(model_id, framework=EVAL.LM_EVAL, tasks=[EVAL.LM_EVAL.ARC_CHALLENGE], output_file='lm-eval_result.json')

# Use `evalplus` as framework to evaluate the model
evalplus_results = GPTQModel.eval(model_id, framework=EVAL.EVALPLUS, tasks=[EVAL.EVALPLUS.HUMAN], output_file='evalplus_result.json')

Dynamic Quantization (Per Module QuantizeConfig Override)

QuantizeConfig.dynamic is dynamic control which allows specific matching modules to be skipped for quantization (negative matching) or have a unique [bits, group_size, sym, desc_act, mse, pack_dtype] property override per matching module vs base QuantizeConfig (postive match with override).

Sample QuantizerConfig.dynamic usage:

dynamic = { 
    # `.*\.` matches the layers_node prefix 
    # layer index start at 0 
    
    # positive match: layer 19, gate module 
    r"+:.*\.18\..*gate.*": {"bits": 8, "group_size": 64},  
    
    # positgive match: layer 20, gate module (prefix defaults to positive if missing)
    r".*\.19\..*gate.*": {"bits": 8, "group_size": 64},  
    
    # negative match: skip layer 21, gate module
    r"-:.*\.20\..*gate.*": {"bits": 8, "group_size": 64}, 
    
    # negative match: skip all down modules for all layers
    r"-:.*down.*": {},  
 }

Citation

@misc{gptqmodel,
    author = {ModelCloud.ai and [email protected]},
    title = {GPTQModel},
    year = {2025},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/modelcloud/gptqmodel}},
    note = {Contact: [email protected]}
}

@article{frantar-gptq,
  title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers}, 
  author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
  year={2022},
  journal={arXiv preprint arXiv:2210.17323}
}

@article{frantar2024marlin,
  title={MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models},
  author={Frantar, Elias and Castro, Roberto L and Chen, Jiale and Hoefler, Torsten and Alistarh, Dan},
  journal={arXiv preprint arXiv:2408.11743},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1,934 Commits
.buildkite		.buildkite
.github		.github
chat		chat
examples		examples
format		format
gptqmodel		gptqmodel
gptqmodel_ext		gptqmodel_ext
licenses		licenses
tests		tests
.gitignore		.gitignore
CREDITS.md		CREDITS.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

GPTQModel

News

What is GPTQModel?

Features

Quality: GPTQ 4bit (5.0 bpw) can match BF16:

Model Support

Platform and HW Support

Install

PIP/UV

Install from source

Inference

OpenAI API compatible end-point

Quantization

How to Add Support for a New Model

Evaluation and Quality Benchmarks

Dynamic Quantization (Per Module QuantizeConfig Override)

Citation

About

Uh oh!

Releases

Packages

Languages

Uh oh!

License

Uh oh!

ZX-ModelCloud/GPTQModel

Folders and files

Latest commit

History

Repository files navigation

GPTQModel

News

What is GPTQModel?

Features

Quality: GPTQ 4bit (5.0 bpw) can match BF16:

Model Support

Platform and HW Support

Install

PIP/UV

Install from source

Inference

OpenAI API compatible end-point

Quantization

How to Add Support for a New Model

Evaluation and Quality Benchmarks

Dynamic Quantization (Per Module QuantizeConfig Override)

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages