Skip to content

Commit a834641

Browse files
committed
feat: GPTQv2 enablement
Signed-off-by: omobayode.fagbohungbe <[email protected]>
1 parent 7467f68 commit a834641

File tree

3 files changed

+31
-13
lines changed

3 files changed

+31
-13
lines changed

examples/GPTQ/README.md

Lines changed: 27 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ For generative LLMs, very often the bottleneck of inference is no longer the com
77

88
- [FMS Model Optimizer requirements](../../README.md#requirements)
99
- `gptqmodel` is needed for this example. Use `pip install gptqmodel` or [install from source](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file)
10+
- It is advised to install from source if you plan to use GPTQv2
1011
- Optionally for the evaluation section below, install [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness)
1112
```
1213
pip install lm-eval
@@ -41,7 +42,10 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
4142
--quant_method gptq \
4243
--output_dir Meta-Llama-3-8B-GPTQ \
4344
--bits 4 \
44-
--group_size 128
45+
--group_size 128 \
46+
--use_version2 False \
47+
--v2_mem_device cpu \
48+
4549
```
4650
The model that can be found in the specified output directory (`Meta-Llama-3-8B-GPTQ` in our case) can be deployed and inferenced via `vLLM`.
4751
@@ -89,26 +93,34 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
8993
| | | |none | 5|perplexity|↓ |3.7915|± |0.0727|
9094
9195
- Quantized model with the settings showed above (`desc_act` default to False.)
92-
-
93-
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
94-
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
95-
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6365 |± |0.0067|
96-
| | | |none | 5|perplexity|↓ |5.9307 |± |0.1830|
96+
- GPTQv1
97+
98+
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
99+
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
100+
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6365 |± |0.0067|
101+
| | | |none | 5|perplexity|↓ |5.9307 |± |0.1830|
102+
103+
- GPTQv2
104+
105+
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
106+
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
107+
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6817 |± |0.0065|
108+
| | | |none | 5|perplexity|↓ |4.3994 |± |0.0995|
97109
98110
- Quantized model with `desc_act` set to `True` (could improve the model quality, but at the cost of inference speed.)
99-
-
100-
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
101-
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
102-
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6193 |± |0.0068|
103-
| | | |none | 5|perplexity|↓ |5.8879 |± |0.1546|
111+
- GPTQv1
112+
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
113+
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
114+
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6193 |± |0.0068|
115+
| | | |none | 5|perplexity|↓ |5.8879 |± |0.1546|
104116
105117
> [!NOTE]
106118
> There is some randomness in generating the model and data, the resulting accuracy may vary ~$\pm$ 0.05.
107119
108120
109121
## Code Walk-through
110122
111-
1. Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py)
123+
1. Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py). GPTQv1 is supported by default. To use GPTQv2, set the parameter `v2` to `True` and `v2_memory_device` to `cpu`.
112124
113125
```python
114126
from gptqmodel import GPTQModel, QuantizeConfig
@@ -118,6 +130,8 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
118130
group_size=gptq_args.group_size,
119131
desc_act=gptq_args.desc_act,
120132
damp_percent=gptq_args.damp_percent,
133+
v2=gptq_args.use_version2,
134+
v2_memory_device=gptq_args.v2_mem_device,
121135
)
122136
123137
```
@@ -158,4 +172,4 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
158172
tokenizer.save_pretrained(output_dir) # optional
159173
```
160174
> [!NOTE]
161-
> 1. GPTQ of a 70B model usually takes ~4-10 hours on A100.
175+
> 1. GPTQ of a 70B model usually takes ~4-10 hours on A100 with GPTQv1.

fms_mo/run_quant.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,8 @@ def run_gptq(model_args, data_args, opt_args, gptq_args):
140140
group_size=gptq_args.group_size,
141141
desc_act=gptq_args.desc_act,
142142
damp_percent=gptq_args.damp_percent,
143+
v2=gptq_args.use_version2,
144+
v2_memory_device=gptq_args.v2_mem_device,
143145
)
144146

145147
# Add custom model_type mapping to gptqmodel LUT so GPTQModel can recognize them.

fms_mo/training_args.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,8 @@ class GPTQArguments(TypeChecker):
206206
use_cuda_fp16: bool = True
207207
autotune_warmup_after_quantized: bool = False
208208
cache_examples_on_gpu: bool = True
209+
use_version2: bool = False
210+
v2_mem_device: Optional[str] = field(default="cpu", metadata={"choices": ["auto", "cpu", "cuda"]})
209211

210212

211213
@dataclass

0 commit comments

Comments
 (0)