You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/GPTQ/README.md
+27-13Lines changed: 27 additions & 13 deletions
Original file line number
Diff line number
Diff line change
@@ -7,6 +7,7 @@ For generative LLMs, very often the bottleneck of inference is no longer the com
7
7
8
8
-[FMS Model Optimizer requirements](../../README.md#requirements)
9
9
-`gptqmodel` is needed for this example. Use `pip install gptqmodel` or [install from source](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file)
10
+
- It is advised to install from source if you plan to use GPTQv2
10
11
- Optionally for the evaluation section below, install [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness)
11
12
```
12
13
pip install lm-eval
@@ -41,7 +42,10 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
41
42
--quant_method gptq \
42
43
--output_dir Meta-Llama-3-8B-GPTQ \
43
44
--bits 4 \
44
-
--group_size 128
45
+
--group_size 128 \
46
+
--use_version2 False \
47
+
--v2_mem_device cpu \
48
+
45
49
```
46
50
The model that can be found in the specified output directory (`Meta-Llama-3-8B-GPTQ` in our case) can be deployed and inferenced via `vLLM`.
47
51
@@ -89,26 +93,34 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
89
93
| | | |none | 5|perplexity|↓ |3.7915|± |0.0727|
90
94
91
95
- Quantized model with the settings showed above (`desc_act` default to False.)
> There is some randomness in generating the model and data, the resulting accuracy may vary ~$\pm$ 0.05.
107
119
108
120
109
121
## Code Walk-through
110
122
111
-
1. Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py)
123
+
1. Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py). GPTQv1 is supported by default. To use GPTQv2, set the parameter `v2` to `True` and `v2_memory_device` to `cpu`.
112
124
113
125
```python
114
126
from gptqmodel import GPTQModel, QuantizeConfig
@@ -118,6 +130,8 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
118
130
group_size=gptq_args.group_size,
119
131
desc_act=gptq_args.desc_act,
120
132
damp_percent=gptq_args.damp_percent,
133
+
v2=gptq_args.use_version2,
134
+
v2_memory_device=gptq_args.v2_mem_device,
121
135
)
122
136
123
137
```
@@ -158,4 +172,4 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
158
172
tokenizer.save_pretrained(output_dir) # optional
159
173
```
160
174
> [!NOTE]
161
-
> 1. GPTQ of a 70B model usually takes ~4-10 hours on A100.
175
+
> 1. GPTQ of a 70B model usually takes ~4-10 hours on A100 with GPTQv1.
0 commit comments