intel · chensuyue · Jun 4, 2024 · May 28, 2024 · May 31, 2024 · May 31, 2024
diff --git a/docs/source/3x/PT_WeightOnlyQuant.md b/docs/source/3x/PT_WeightOnlyQuant.md
@@ -2,8 +2,10 @@
 PyTorch Weight Only Quantization
 ===============
 - [Introduction](#introduction)
+- [Supported Matrix](#supported-matrix)
 - [Usage](#usage)
   - [Get Started](#get-started)
+    - [Commom arguments](#commom-arguments)
     - [RTN](#rtn)
     - [GPTQ](#gptq)
     - [AutoRound](#autoround)
@@ -14,21 +16,68 @@ PyTorch Weight Only Quantization
   - [Saving and Loading](#saving-and-loading)
 - [Examples](#examples)
 
-
 ## Introduction
 
-The INC 3x New API provides support for quantizing PyTorch models using WeightOnlyQuant.
+As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computational demands of these modern architectures while maintaining the accuracy.  Compared to normal quantization like W8A8,  weight only quantization is probably a better trade-off to balance the performance and the accuracy, since we will see below that the bottleneck of deploying LLMs is the memory bandwidth and normally weight only quantization could lead to better accuracy.
+
+Model inference: Roughly speaking , two key steps are required to get the model's result. The first one is moving the model from the memory to the cache piece by piece, in which, memory bandwidth $B$ and parameter count $P$ are the key factors, theoretically the time cost is  $P*4 /B$. The second one is  computation, in which, the device's computation capacity  $C$  measured in FLOPS and the forward FLOPs $F$ play the key roles, theoretically the cost is $F/C$.
+
+Text generation:  The most famous application of LLMs is text generation, which predicts the next token/word  based on the inputs/context. To generate a sequence of texts, we need to predict them one by one. In this scenario,  $F\approx P$  if some operations like bmm are ignored and past key values have been saved. However, the  $C/B$ of the modern device could be to **100X,** that makes the memory bandwidth as the bottleneck in this scenario.
+
+Besides, as mentioned in many papers[1][2], activation quantization is the main reason to cause the accuracy drop. So for text generation task,  weight only quantization is a preferred option in most cases.
+
+Theoretically, round-to-nearest (RTN) is the most straightforward way to quantize weight using scale maps. However, when the number of bits is small (e.g. 3), the MSE loss is larger than expected. A group size is introduced to reduce elements using the same scale to improve accuracy.
+
+
+## Supported Matrix
 
-For detailed information on quantization fundamentals, please refer to the Quantization document [Quantization](./quantization.md).
+| Algorithms/Backend |   PyTorch eager mode  |  
+|--------------|----------|
+|       RTN      |  &#10004;  |
+|       GPTQ     |  &#10004;  |
+|       AutoRound|  &#10004;  |
+|       AWQ      |  &#10004;  |
+|       TEQ      |  &#10004;  | 
+|       HQQ      |  &#10004;  |
+> **RTN:** A quantification method that we can think of very intuitively. It does not require additional datasets and is a very fast quantization method. Generally speaking, RTN will convert the weight into a uniformly distributed integer data type, but some algorithms, such as Qlora, propose a non-uniform NF4 data type and prove its theoretical optimality.
 
+> **GPTQ:** A new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly efficient[4]. The weights of each column are updated based on the fixed-scale pseudo-quantization error and the inverse of the Hessian matrix calculated from the activations. The updated columns sharing the same scale may generate a new max/min value, so the scale needs to be saved for restoration.
+
+> **AutoRound:** AutoRound is an advanced weight-only quantization algorithm for low-bits LLM inference. It's tailored for a wide range of models and consistently delivers noticeable improvements, often significantly outperforming SignRound[6] with the cost of more tuning time for quantization.
+
+> **AWQ:** Proved that protecting only 1% of salient weights can greatly reduce quantization error. the salient weight channels are selected by observing the distribution of activation and weight per channel. The salient weights are also quantized after multiplying a big scale factor before quantization for preserving.
+
+> **TEQ:** A trainable equivalent transformation that preserves the FP32 precision in weight-only quantization. It is inspired by AWQ while providing a new solution to search for the optimal per-channel scaling factor between activations and weights.
+
+> **HQQ:** The HQQ[7] method focuses specifically on minimizing errors in the weights rather than the layer activation. Additionally, by incorporating a sparsity-promoting loss, such as the $l_{p<1}$-norm, we effectively model outliers through a hyper-Laplacian distribution. This distribution more accurately captures the heavy-tailed nature of outlier errors compared to the squared error, resulting in a more nuanced representation of error distribution.
 
 ## Usage
 
 ### Get Started
 
-The INC 3x New API supports quantizing PyTorch models using prepare and convert for WeightOnlyQuant quantization.
+The INC 3x New API supports quantizing PyTorch models using prepare and convert [APIs](./PyTorch.md#quantization-apis) for WeightOnlyQuant quantization.
+
+#### Commom arguments
+| Config | Capability |
+|---|---|
+| dtype (str)| ['int', 'nf4', 'fp4'] |
+| bits (int)| [1, ..., 8] |
+| group_size (int)| [-1, 1, ..., $C_{in}$] |
+| use_sym (bool)| [True, False] |
+
+Notes:
+- *group_size = -1* refers to **per output channel quantization**. Taking a linear layer (input channel = $C_{in}$, output channel = $C_{out}$) for instance, when *group size = -1*, quantization will calculate total $C_{out}$ quantization parameters. Otherwise, when *group_size = gs* quantization parameters are calculate with every $gs$ elements along with the input channel, leading to total $C_{out} \times (C_{in} / gs)$ quantization parameters.
+- 4-bit NormalFloat(NF4) is proposed in QLoRA[5]. 'fp4' includes [fp4_e2m1](../../neural_compressor/adaptor/torch_utils/weight_only.py#L37) and [fp4_e2m1_bnb](https://github.com/TimDettmers/bitsandbytes/blob/18e827d666fa2b70a12d539ccedc17aa51b2c97c/bitsandbytes/functional.py#L735). By default, fp4 refers to fp4_e2m1_bnb.
+
 
 #### RTN
+|  rtn_args  | comments |                                 default value                            |
+|----------|-------------|-------------------------------------------------------------------|
+|               group_dim (int)       |  Dimension for grouping                                 |  Default is 1      |
+|               use_full_range (bool) |  Enables full range for activations                     |  Default is False  |
+|               use_mse_search (bool) |  Enables mean squared error (MSE)   search              |  Default is False  |
+|               use_layer_wise (bool) |  Enables quantize model per layer                       |  Defaults to False |
+|               model_path (str)      |  Model path that is used to load   state_dict per layer |                    |
 
 ``` python
 # Quantization code
@@ -40,6 +89,16 @@ model = convert(model)
 ```
 
 #### GPTQ
+|  gptq_args  | comments |      default value                                                       |
+|----------|-------------|-------------------------------------------------------------------|
+|               use_mse_search (bool)   |  Enables mean squared error (MSE) search                                                                                                   |  Default is False 
+|               use_layer_wise (bool)   |  Enables quantize model per layer                                                                                                          |  Defaults to False |
+|               model_path (str)        |  Model path that is used to load   state_dict per layer                                                                                    |                    |
+|               use_double_quant (bool) |  Enables double quantization                                                                                                               |  Default is False  |
+|               act_order (bool)        |  Whether to sort Hessian's diagonal   values to rearrange channel-wise quantization order                                                  |  Default is False  |
+|               percdamp (float)        |  Percentage of Hessian's diagonal   values' average, which will be added to Hessian's diagonal to increase   numerical stability           |  Default is 0.01.  |
+|               block_size (int)        |  Execute GPTQ quantization per   block, block shape = [C_out, block_size]                                                                  |  Default is 128     |
+|               static_groups (bool)    |  Whether to calculate group wise   quantization parameters in advance. This option mitigate actorder's extra   computational requirements. |  Default is False.  |
 
 ``` python
 # Quantization code
@@ -52,7 +111,26 @@ model = convert(model)
 ```
 
 #### AutoRound
-
+|  autoround_args  | comments |      default value                                                       |
+|----------|-------------|-------------------------------------------------------------------|
+|             enable_full_range (bool)        |  Whether to enable full range   quantization                                               | Default is False     
+|             batch_size (int)                |  Batch size for training                                                                   | Default is 8         |
+|             lr_scheduler                    |  The learning rate scheduler to be   used                                                  |     None                 |
+|             enable_quanted_input (bool)     |  Whether to use quantized input   data                                                     | Default is True      |
+|             enable_minmax_tuning (bool)     |  Whether to enable min-max   tuning                                                        | Default is True      |
+|             lr (float)                      |  The learning rate                                                                         | Default is 0         |
+|             minmax_lr (float)               |  The learning rate for min-max   tuning                                                    | Default is None      |
+|             low_gpu_mem_usage (bool)        |  Whether to use low GPU memory                                                             | Default is True      |
+|             iters (int)                     |  Number of iterations                                                                      | Default is 200       |
+|             seqlen (int)                    |  Length of the sequence                                                                    | Default is 2048      |
+|             n_samples (int)                 |  Number of samples                                                                         | Default is 512       |
+|             sampler (str)                   |  The sampling method                                                                       | Default is "rand"    |
+|             seed (int)                      |  The random seed                                                                           | Default is 42        |
+|             n_blocks (int)                  |  Number of blocks                                                                          | Default is 1         |
+|             gradient_accumulate_steps (int) |  Number of gradient accumulation   steps                                                   | Default is 1         |
+|             not_use_best_mse (bool)         |  Whether to use mean squared   error                                                       | Default is False     |
+|             dynamic_max_gap (int)           |  The dynamic maximum gap                                                                   | Default is -1        |
+|             scale_dtype (str)               | The data type of quantization scale to be used, different kernels have   different choices | Default is "float16" |
 ``` python
 # Quantization code
 from neural_compressor.torch.quantization import prepare, convert, AutoRoundConfig
@@ -64,7 +142,15 @@ model = convert(model)
 ```
 
 #### AWQ
-
+|  awq_args  | comments |      default value                                                       |
+|----------|-------------|-------------------------------------------------------------------|
+|               group_dim (int)                |  Dimension for grouping                                                           |  default is 1       |
+|               use_full_range (bool)          |  Enables full range for activations                                               |  default is False   |
+|               use_mse_search (bool)          |  Enables mean squared error (MSE)   search                                        |  default is False   |
+|               use_layer_wise (bool)          |  Enables quantize model per layer                                                 |  default is False   |
+|               use_auto_scale (bool)          |  Enables best scales search based   on activation distribution                    |  default is True    |
+|               use_auto_clip (bool)           |   Enables clip range search                                                       |  default is True    |
+|               folding(bool)                  |  Allow insert mul before linear   when the scale cannot be absorbed by last layer |   default is False. |
 ``` python
 # Quantization code
 from neural_compressor.torch.quantization import prepare, convert, AWQConfig
@@ -76,7 +162,14 @@ model = convert(model)
 ```
 
 #### TEQ
-
+|  teq_args  | comments |      default value                                                       |
+|----------|-------------|-------------------------------------------------------------------|
+|               group_dim (int)         |  Dimension for grouping                                                           |  default is 1     |
+|               use_full_range (bool)   |  Enables full range for activations                                               |  default is False |
+|               use_mse_search (bool)   |  Enables mean squared error (MSE)   search                                        |  default is False |
+|               use_layer_wise (bool)   |  Enables quantize model per layer                                                 |  default is False |
+|               use_double_quant (bool) |  Enables double quantization                                                      |  default is False |
+|               folding(bool)           |  Allow insert mul before linear   when the scale cannot be absorbed by last layer |  default is False |
 ``` python
 # Quantization code
 from neural_compressor.torch.quantization import prepare, convert, TEQConfig
@@ -88,7 +181,12 @@ model = convert(model)
 ```
 
 #### HQQ
-
+|  hqq_args  | comments |      default value                                                       |
+|----------|-------------|-------------------------------------------------------------------|
+|           quant_zero (bool)            | Whether to quantize zero point         | Default is True  |
+|         quant_scale:   (bool)          | Whether to quantize scale: point       | Default is False |
+|           scale_quant_group_size (int) | The group size for quantizing scale    | Default is 128   |
+|         skip_lm_head   (bool)          | Whether to skip for quantizing lm_head | Default is True  |
 ``` python
 # Quantization code
 from neural_compressor.torch.quantization import prepare, convert, HQQConfig
@@ -127,7 +225,7 @@ quant_config.set_local("lm_head", lm_head_config)
 ```
 
 ### Saving and Loading
-The saved_results folder contains two files: quantized_model.pt and qconfig.json, and the generated model is a quantized model.
+The saved_results folder contains two files: quantized_model.pt and qconfig.json, and the generated model is a quantized model. The quantitative model will include WeightOnlyLinear. To support low memory inference, Intel(R) Neural Compressor implemented WeightOnlyLinear, a torch.nn.Module, to compress the fake quantized fp32 model. Since torch does not provide flexible data type storage, WeightOnlyLinear combines low bits data into a long date type, such as torch.int8 and torch.int32. Low bits data includes weights and zero points. When using WeightOnlyLinear for inference, it will restore the compressed data to float32 and run torch linear function.
 ```python
 # Quantization code
 from neural_compressor.torch.quantization import prepare, convert, RTNConfig
@@ -152,3 +250,19 @@ loaded_model = load(
 ## Examples
 
 Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/llm) on how to quantize a  model with WeightOnlyQuant.
+
+## Reference
+
+[1]. Xiao, Guangxuan, et al. "Smoothquant: Accurate and efficient post-training quantization for large language models." arXiv preprint arXiv:2211.10438 (2022).
+
+[2]. Wei, Xiuying, et al. "Outlier suppression: Pushing the limit of low-bit transformer language models." arXiv preprint arXiv:2209.13325 (2022).
+
+[3]. Lin, Ji, et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv preprint arXiv:2306.00978 (2023).
+
+[4]. Frantar, Elias, et al. "Gptq: Accurate post-training quantization for generative pre-trained transformers." arXiv preprint arXiv:2210.17323 (2022).
+
+[5]. Dettmers, Tim, et al. "Qlora: Efficient finetuning of quantized llms." arXiv preprint arXiv:2305.14314 (2023).
+
+[6]. Cheng, Wenhua, et al. "Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs" arXiv preprint arXiv:2309.05516 (2023).
+
+[7]. Badri, Hicham and Shaji, Appu. "Half-Quadratic Quantization of Large Machine Learning Models." [Online] Available: https://mobiusml.github.io/hqq_blog/ (2023).