Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch 3x document #1824

Merged
merged 44 commits into from
Jun 4, 2024
Merged
Changes from 1 commit
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
a655138
draft for document
xin3he May 28, 2024
ebc260b
Update PT_MXQuant.md
mengniwang95 May 31, 2024
004af16
Update PT_MXQuant.md
mengniwang95 May 31, 2024
b2056de
add doc for sq
violetch24 May 31, 2024
c73145e
update PT_WeightOnlyQuant.md
Kaihui-intel May 31, 2024
ab0dde4
update PT_WeightOnlyQuant.md
Kaihui-intel May 31, 2024
36282b3
add static quant doc
violetch24 May 31, 2024
fd4aeed
minor fix
violetch24 May 31, 2024
dc68e5d
upload PT_WeightOnlyQuant.md
Kaihui-intel May 31, 2024
1a0cfad
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 31, 2024
bdecc7c
Merge branch 'pt_doc' of https://github.com/intel/neural-compressor i…
Kaihui-intel May 31, 2024
ed055b3
update Torch.md
xin3he May 31, 2024
c62c6f0
add quantization.md
xin3he May 31, 2024
7eb0574
add mix-precision docs
zehao-intel May 31, 2024
884c100
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 31, 2024
14d0c54
Merge remote-tracking branch 'origin/master' into pt_doc
xin3he Jun 3, 2024
ea6849c
merge TF docs and move folder
xin3he Jun 3, 2024
1cc9afd
docs refine
violetch24 Jun 3, 2024
fd649bc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 3, 2024
bb2111f
Add pt dynamic quant (#1833)
yiliu30 Jun 3, 2024
213c60f
add save
xin3he Jun 3, 2024
c51a39c
remove style
xin3he Jun 3, 2024
0b0ee32
correct filename
yiliu30 Jun 3, 2024
036838a
remove file
yiliu30 Jun 3, 2024
990b576
update per review
xin3he Jun 3, 2024
c8bdcca
update PT_WeightOnlyQuant.md
Kaihui-intel Jun 3, 2024
4d0e9d8
Merge branch 'pt_doc' of https://github.com/intel/neural-compressor i…
Kaihui-intel Jun 3, 2024
6a50b93
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 3, 2024
b8d0d28
add pt2e static
yiliu30 Jun 3, 2024
68e436c
fix for comment
violetch24 Jun 3, 2024
f7836dd
Merge branch 'pt_doc' of https://github.com/intel/neural-compressor i…
violetch24 Jun 3, 2024
b869599
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 3, 2024
86f9957
update PT_WeightOnlyQuant.md
Kaihui-intel Jun 3, 2024
191f191
Merge branch 'pt_doc' of https://github.com/intel/neural-compressor i…
Kaihui-intel Jun 3, 2024
633e522
update PT_WeightOnlyQuant.md layer-wise
Kaihui-intel Jun 3, 2024
b518efe
Update PT_MXQuant.md
mengniwang95 Jun 3, 2024
a66456f
dix typo
yiliu30 Jun 3, 2024
e581516
update imgs for 3x pt
xin3he Jun 4, 2024
a7429f4
update per review
xin3he Jun 4, 2024
bdca2be
Update PT_MXQuant.md
mengniwang95 Jun 4, 2024
dac08cf
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 4, 2024
28b3fdc
Merge branch 'master' into pt_doc
xin3he Jun 4, 2024
006879c
refine mix-precision
zehao-intel Jun 4, 2024
40ddaed
add note for ipex
yiliu30 Jun 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 123 additions & 9 deletions docs/source/3x/PT_WeightOnlyQuant.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,10 @@
PyTorch Weight Only Quantization
===============
- [Introduction](#introduction)
- [Supported Matrix](#supported-matrix)
- [Usage](#usage)
- [Get Started](#get-started)
- [Commom arguments](#commom-arguments)
- [RTN](#rtn)
- [GPTQ](#gptq)
- [AutoRound](#autoround)
Expand All @@ -14,21 +16,68 @@ PyTorch Weight Only Quantization
- [Saving and Loading](#saving-and-loading)
- [Examples](#examples)


## Introduction

The INC 3x New API provides support for quantizing PyTorch models using WeightOnlyQuant.
As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computational demands of these modern architectures while maintaining the accuracy. Compared to normal quantization like W8A8, weight only quantization is probably a better trade-off to balance the performance and the accuracy, since we will see below that the bottleneck of deploying LLMs is the memory bandwidth and normally weight only quantization could lead to better accuracy.

Model inference: Roughly speaking , two key steps are required to get the model's result. The first one is moving the model from the memory to the cache piece by piece, in which, memory bandwidth $B$ and parameter count $P$ are the key factors, theoretically the time cost is $P*4 /B$. The second one is computation, in which, the device's computation capacity $C$ measured in FLOPS and the forward FLOPs $F$ play the key roles, theoretically the cost is $F/C$.

Text generation: The most famous application of LLMs is text generation, which predicts the next token/word based on the inputs/context. To generate a sequence of texts, we need to predict them one by one. In this scenario, $F\approx P$ if some operations like bmm are ignored and past key values have been saved. However, the $C/B$ of the modern device could be to **100X,** that makes the memory bandwidth as the bottleneck in this scenario.

Besides, as mentioned in many papers[1][2], activation quantization is the main reason to cause the accuracy drop. So for text generation task, weight only quantization is a preferred option in most cases.

Theoretically, round-to-nearest (RTN) is the most straightforward way to quantize weight using scale maps. However, when the number of bits is small (e.g. 3), the MSE loss is larger than expected. A group size is introduced to reduce elements using the same scale to improve accuracy.


## Supported Matrix

For detailed information on quantization fundamentals, please refer to the Quantization document [Quantization](./quantization.md).
| Algorithms/Backend | PyTorch eager mode |
|--------------|----------|
| RTN | ✔ |
| GPTQ | ✔ |
| AutoRound| ✔ |
| AWQ | ✔ |
| TEQ | ✔ |
| HQQ | ✔ |
> **RTN:** A quantification method that we can think of very intuitively. It does not require additional datasets and is a very fast quantization method. Generally speaking, RTN will convert the weight into a uniformly distributed integer data type, but some algorithms, such as Qlora, propose a non-uniform NF4 data type and prove its theoretical optimality.

> **GPTQ:** A new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly efficient[4]. The weights of each column are updated based on the fixed-scale pseudo-quantization error and the inverse of the Hessian matrix calculated from the activations. The updated columns sharing the same scale may generate a new max/min value, so the scale needs to be saved for restoration.

> **AutoRound:** AutoRound is an advanced weight-only quantization algorithm for low-bits LLM inference. It's tailored for a wide range of models and consistently delivers noticeable improvements, often significantly outperforming SignRound[6] with the cost of more tuning time for quantization.

> **AWQ:** Proved that protecting only 1% of salient weights can greatly reduce quantization error. the salient weight channels are selected by observing the distribution of activation and weight per channel. The salient weights are also quantized after multiplying a big scale factor before quantization for preserving.

> **TEQ:** A trainable equivalent transformation that preserves the FP32 precision in weight-only quantization. It is inspired by AWQ while providing a new solution to search for the optimal per-channel scaling factor between activations and weights.

> **HQQ:** The HQQ[7] method focuses specifically on minimizing errors in the weights rather than the layer activation. Additionally, by incorporating a sparsity-promoting loss, such as the $l_{p<1}$-norm, we effectively model outliers through a hyper-Laplacian distribution. This distribution more accurately captures the heavy-tailed nature of outlier errors compared to the squared error, resulting in a more nuanced representation of error distribution.

## Usage

### Get Started

The INC 3x New API supports quantizing PyTorch models using prepare and convert for WeightOnlyQuant quantization.
The INC 3x New API supports quantizing PyTorch models using prepare and convert [APIs](./PyTorch.md#quantization-apis) for WeightOnlyQuant quantization.

#### Commom arguments
| Config | Capability |
|---|---|
| dtype (str)| ['int', 'nf4', 'fp4'] |
| bits (int)| [1, ..., 8] |
| group_size (int)| [-1, 1, ..., $C_{in}$] |
| use_sym (bool)| [True, False] |
Kaihui-intel marked this conversation as resolved.
Show resolved Hide resolved

Notes:
- *group_size = -1* refers to **per output channel quantization**. Taking a linear layer (input channel = $C_{in}$, output channel = $C_{out}$) for instance, when *group size = -1*, quantization will calculate total $C_{out}$ quantization parameters. Otherwise, when *group_size = gs* quantization parameters are calculate with every $gs$ elements along with the input channel, leading to total $C_{out} \times (C_{in} / gs)$ quantization parameters.
- 4-bit NormalFloat(NF4) is proposed in QLoRA[5]. 'fp4' includes [fp4_e2m1](../../neural_compressor/adaptor/torch_utils/weight_only.py#L37) and [fp4_e2m1_bnb](https://github.com/TimDettmers/bitsandbytes/blob/18e827d666fa2b70a12d539ccedc17aa51b2c97c/bitsandbytes/functional.py#L735). By default, fp4 refers to fp4_e2m1_bnb.


#### RTN
| rtn_args | comments | default value |
|----------|-------------|-------------------------------------------------------------------|
| group_dim (int) | Dimension for grouping | Default is 1 |
| use_full_range (bool) | Enables full range for activations | Default is False |
| use_mse_search (bool) | Enables mean squared error (MSE) search | Default is False |
| use_layer_wise (bool) | Enables quantize model per layer | Defaults to False |
Kaihui-intel marked this conversation as resolved.
Show resolved Hide resolved
| model_path (str) | Model path that is used to load state_dict per layer | |

``` python
# Quantization code
Expand All @@ -40,6 +89,16 @@ model = convert(model)
```

#### GPTQ
| gptq_args | comments | default value |
|----------|-------------|-------------------------------------------------------------------|
| use_mse_search (bool) | Enables mean squared error (MSE) search | Default is False
| use_layer_wise (bool) | Enables quantize model per layer | Defaults to False |
| model_path (str) | Model path that is used to load state_dict per layer | |
| use_double_quant (bool) | Enables double quantization | Default is False |
| act_order (bool) | Whether to sort Hessian's diagonal values to rearrange channel-wise quantization order | Default is False |
| percdamp (float) | Percentage of Hessian's diagonal values' average, which will be added to Hessian's diagonal to increase numerical stability | Default is 0.01. |
| block_size (int) | Execute GPTQ quantization per block, block shape = [C_out, block_size] | Default is 128 |
| static_groups (bool) | Whether to calculate group wise quantization parameters in advance. This option mitigate actorder's extra computational requirements. | Default is False. |

``` python
# Quantization code
Expand All @@ -52,7 +111,26 @@ model = convert(model)
```

#### AutoRound

| autoround_args | comments | default value |
|----------|-------------|-------------------------------------------------------------------|
| enable_full_range (bool) | Whether to enable full range quantization | Default is False
| batch_size (int) | Batch size for training | Default is 8 |
| lr_scheduler | The learning rate scheduler to be used | None |
| enable_quanted_input (bool) | Whether to use quantized input data | Default is True |
| enable_minmax_tuning (bool) | Whether to enable min-max tuning | Default is True |
| lr (float) | The learning rate | Default is 0 |
| minmax_lr (float) | The learning rate for min-max tuning | Default is None |
| low_gpu_mem_usage (bool) | Whether to use low GPU memory | Default is True |
| iters (int) | Number of iterations | Default is 200 |
| seqlen (int) | Length of the sequence | Default is 2048 |
| n_samples (int) | Number of samples | Default is 512 |
| sampler (str) | The sampling method | Default is "rand" |
| seed (int) | The random seed | Default is 42 |
| n_blocks (int) | Number of blocks | Default is 1 |
| gradient_accumulate_steps (int) | Number of gradient accumulation steps | Default is 1 |
| not_use_best_mse (bool) | Whether to use mean squared error | Default is False |
| dynamic_max_gap (int) | The dynamic maximum gap | Default is -1 |
| scale_dtype (str) | The data type of quantization scale to be used, different kernels have different choices | Default is "float16" |
``` python
# Quantization code
from neural_compressor.torch.quantization import prepare, convert, AutoRoundConfig
Expand All @@ -64,7 +142,15 @@ model = convert(model)
```

#### AWQ

| awq_args | comments | default value |
|----------|-------------|-------------------------------------------------------------------|
| group_dim (int) | Dimension for grouping | default is 1 |
| use_full_range (bool) | Enables full range for activations | default is False |
| use_mse_search (bool) | Enables mean squared error (MSE) search | default is False |
| use_layer_wise (bool) | Enables quantize model per layer | default is False |
| use_auto_scale (bool) | Enables best scales search based on activation distribution | default is True |
| use_auto_clip (bool) | Enables clip range search | default is True |
| folding(bool) | Allow insert mul before linear when the scale cannot be absorbed by last layer | default is False. |
``` python
# Quantization code
from neural_compressor.torch.quantization import prepare, convert, AWQConfig
Expand All @@ -76,7 +162,14 @@ model = convert(model)
```

#### TEQ

| teq_args | comments | default value |
|----------|-------------|-------------------------------------------------------------------|
| group_dim (int) | Dimension for grouping | default is 1 |
| use_full_range (bool) | Enables full range for activations | default is False |
| use_mse_search (bool) | Enables mean squared error (MSE) search | default is False |
| use_layer_wise (bool) | Enables quantize model per layer | default is False |
| use_double_quant (bool) | Enables double quantization | default is False |
| folding(bool) | Allow insert mul before linear when the scale cannot be absorbed by last layer | default is False |
``` python
# Quantization code
from neural_compressor.torch.quantization import prepare, convert, TEQConfig
Expand All @@ -88,7 +181,12 @@ model = convert(model)
```

#### HQQ

| hqq_args | comments | default value |
|----------|-------------|-------------------------------------------------------------------|
| quant_zero (bool) | Whether to quantize zero point | Default is True |
| quant_scale: (bool) | Whether to quantize scale: point | Default is False |
| scale_quant_group_size (int) | The group size for quantizing scale | Default is 128 |
| skip_lm_head (bool) | Whether to skip for quantizing lm_head | Default is True |
``` python
# Quantization code
from neural_compressor.torch.quantization import prepare, convert, HQQConfig
Expand Down Expand Up @@ -127,7 +225,7 @@ quant_config.set_local("lm_head", lm_head_config)
```

### Saving and Loading
The saved_results folder contains two files: quantized_model.pt and qconfig.json, and the generated model is a quantized model.
The saved_results folder contains two files: quantized_model.pt and qconfig.json, and the generated model is a quantized model. The quantitative model will include WeightOnlyLinear. To support low memory inference, Intel(R) Neural Compressor implemented WeightOnlyLinear, a torch.nn.Module, to compress the fake quantized fp32 model. Since torch does not provide flexible data type storage, WeightOnlyLinear combines low bits data into a long date type, such as torch.int8 and torch.int32. Low bits data includes weights and zero points. When using WeightOnlyLinear for inference, it will restore the compressed data to float32 and run torch linear function.
```python
# Quantization code
from neural_compressor.torch.quantization import prepare, convert, RTNConfig
Expand All @@ -152,3 +250,19 @@ loaded_model = load(
## Examples

Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/llm) on how to quantize a model with WeightOnlyQuant.

## Reference

[1]. Xiao, Guangxuan, et al. "Smoothquant: Accurate and efficient post-training quantization for large language models." arXiv preprint arXiv:2211.10438 (2022).

[2]. Wei, Xiuying, et al. "Outlier suppression: Pushing the limit of low-bit transformer language models." arXiv preprint arXiv:2209.13325 (2022).

[3]. Lin, Ji, et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv preprint arXiv:2306.00978 (2023).

[4]. Frantar, Elias, et al. "Gptq: Accurate post-training quantization for generative pre-trained transformers." arXiv preprint arXiv:2210.17323 (2022).

[5]. Dettmers, Tim, et al. "Qlora: Efficient finetuning of quantized llms." arXiv preprint arXiv:2305.14314 (2023).

[6]. Cheng, Wenhua, et al. "Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs" arXiv preprint arXiv:2309.05516 (2023).

[7]. Badri, Hicham and Shaji, Appu. "Half-Quadratic Quantization of Large Machine Learning Models." [Online] Available: https://mobiusml.github.io/hqq_blog/ (2023).