Skip to content

Commit

Permalink
GPTQ example refinements (#1145)
Browse files Browse the repository at this point in the history
Signed-off-by: YIYANGCAI <[email protected]>
  • Loading branch information
YIYANGCAI authored Aug 18, 2023
1 parent 6ee8466 commit 66f7c10
Show file tree
Hide file tree
Showing 10 changed files with 378 additions and 349 deletions.
3 changes: 3 additions & 0 deletions .azure-pipelines/scripts/codeScan/pyspelling/inc_dict.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ acc
Acc
accuracies
acdc
actorder
ACDC
Acknowledgement
activations
Expand Down Expand Up @@ -1253,6 +1254,8 @@ npz
nq
nrix
ns
nsample
nsamples
nsdf
nSsKchNAySU
nthreads
Expand Down
17 changes: 13 additions & 4 deletions docs/source/quantization_weight_only.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ There are many excellent works for weight only quantization to improve its accur
| bits | [1-8] |
| group_size | [-1, 1-N] |
| scheme | ['asym', 'sym'] |
| algorithm | ['RTN', 'AWQ'] |
| algorithm | ['RTN', 'AWQ', 'GPTQ'] |

**RTN arguments**:
| rtn_args | default value | comments |
Expand All @@ -53,8 +53,18 @@ There are many excellent works for weight only quantization to improve its accur
| mse_range | True | Whether search for the best clip range from range [0.89, 1.0, 0.01] |
| folding | False | False will allow insert mul before linear when the scale cannot be absorbed by last layer, else won't |

**GPTQ arguments**:
| gptq_args | default value | comments |
|:----------:|:-------------:|:-------------------------------------------------------------------:|
| actorder | False | Whether to sort Hessian's diagonal values to rearrange channel-wise quantization order|
| percdamp | 0.01 | Percentage of Hessian's diagonal values' average, which will be added to Hessian's diagonal to increase numerical stability|
| nsamples | 128 | Calibration samples' size |
| pad_max_length | 2048 | Whether to align calibration data to a fixed length. This value should not exceed model's acceptable sequence length. Please refer to model's config json to find out this value.|
| use_max_length | False | Whether to align all calibration data to fixed length, which equals to pad_max_length. |
| block_size | 128 | Channel number in one block to execute a GPTQ quantization iteration |


**Note**: `group_size=-1` indicates the per-channel quantization per output channel. `group_size=[1-N]` indicates splitting the input channel elements per group_size.
**Note**: `group_size=-1` indicates the per-channel quantization per output channel. `group_size=[1-N]` indicates splitting the input channel elements per group_size. Term **group_size** in GPTQ refers to number of channels which share the same quantization parameters.

### **Export Compressed Model**
To support low memory inference, Neural Compressor implemented WeightOnlyLinear, a torch.nn.Module, to compress the fake quantized fp32 model. Since torch does not provide flexible data type storage, WeightOnlyLinear combines low bits data into a long date type, such as torch.int8 and torch.int32. Low bits data includes weights and zero points. When using WeightOnlyLinear for inference, it will restore the compressed data to float32 and run torch linear function.
Expand Down Expand Up @@ -82,9 +92,8 @@ conf = PostTrainingQuantConfig(
},
},
},
### GPTQ is WIP
recipes={
# 'gptq_args':{'percdamp': 0.01},
# 'gptq_args':{'percdamp': 0.01, 'actorder':True, 'block_size': 128, 'nsamples': 128, 'use_full_length': False},
'awq_args':{'auto_scale': True, 'mse_range': True, 'n_blocks': 5},
},
)
Expand Down

This file was deleted.

Loading

0 comments on commit 66f7c10

Please sign in to comment.