Skip to content

Commit

Permalink
doc update (#1340)
Browse files Browse the repository at this point in the history
Update SmoothQuant INT8 accuracy
Signed-off-by: Lu, Yintong <[email protected]>
  • Loading branch information
yintong-lu authored Oct 25, 2023
1 parent b065cfc commit 6187c2e
Showing 1 changed file with 18 additions and 15 deletions.
33 changes: 18 additions & 15 deletions docs/source/smooth_quant.md
Original file line number Diff line number Diff line change
Expand Up @@ -280,13 +280,15 @@ For most of the models such as OPT and BLOOM, $\alpha = 0.5$ is a well-balanced

SmoothQuant method aims to split the quantization difficulty of weight and activation by using a fixed-value $\alpha$ for an entire model. However, as the distributions of activation outliers vary not only across different models but also across different layers within a model, we hereby propose a method to obtain layer-wise optimal $\alpha$ values with the ability to tune automatically.

Our proposed method consists of 5 major steps:
Our proposed method consists of 7 major steps:

- Hook input and output values of all layers using register_forward_hook.
- Generate a list of $\alpha$ values given user-defined $\alpha$ range and step_sizes.
- Recalculate smoothing factor given an $\alpha$ value and adjust parameters(weights and activations).
- Perform per-channel quantization_dequantization of weights and per-tensor quantization_dequantization of inputs to predict the layer-wise output corresponding to the given $\alpha$ value.
- Calculate the mean-squared loss with respect to the actual output value, recover the adjusted parameters and save the layer-wise optimal $\alpha$ values.
- Hook input minimum and maximum values of layers to be smoothed using register_forward_hook.
- Generate a list of $\alpha$ values of a user-defined range and set a default $\alpha$ value.
- Calculate smoothing factor using default $\alpha$ value, adjust parameters accordingly and forward the adjusted model given an input sample.
- Perform per-channel quantization_dequantization of weights and per-tensor quantization_dequantization of activations to predict output.
- Calculate the layer-wise loss with respect to FP32 output, iterate the previous two steps given each $\alpha$ value and save the layer-wise loss per alpha.
- Apply criterion on input LayerNorm op and obtain the layer-wise optimal alpha values of a single input sample.
- Iterate the previous three steps over a number of input samples and save the layer-wise optimal $\alpha$ values.



Expand Down Expand Up @@ -350,23 +352,24 @@ A list of models that achieved a <1% accuracy drop is shown below.
| LLaMa-13b | 0.7627 | 0.7590 | alpha=0.7, Ipex 2.1 |
| LLaMa-30b | 0.7759 | 0.7840 | alpha=0.7, Ipex 2.1 |
| LLaMa-65b | 0.7908 | 0.7957 | alpha=0.9, Ipex 2.1 |
| LLaMa-2-7b | 0.7392 | 0.7262/0.7330 | alpha=Auto, Ipex 2.1/Pytorch |
| LLaMa-2-7b-Chat | 0.7058 | 0.6994 | alpha=Auto, Ipex 2.1 |
| EleutherAI/gpt-j-6B | 0.6831 | 0.6821 | alpha=1.0, Ipex 2.1 |
| LLaMa-2-7b-hf* | 0.7392 | 0.7335 | alpha=Auto, Ipex 2.1 |
| LLaMa-2-7b-Chat* | 0.7058 | 0.6994 | alpha=Auto, Ipex 2.1 |
| EleutherAI/gpt-j-6B* | 0.6831 | 0.6821 | alpha=1.0, Ipex 2.1 |
| MBZUAI/LaMini-GPT-124m | 0.3804 | 0.3887 | alpha=0.5, Ipex 2.1 |
| MBZUAI/LaMini-GPT-774m | 0.5048 | 0.5057 | alpha=0.5, Ipex 2.1 |
| MBZUAI/LaMini-GPT-1.5b | 0.5443 | 0.5436 | alpha=0.5, Ipex 2.1 |
| mosaicml/mpt-7b-chat | 0.655 | 0.6499 | alpha=0.7, Ipex 2.1 |
| stabilityai/stablelm-base-alpha-3b | 0.4172 | 0.4149 | alpha=0.6, Ipex 2.1 |
| togethercomputer/RedPajama-INCITE-Base-3B-v1 | 0.6542 | 0.6735 | alpha=0.5, Ipex 2.1 |
| togethercomputer/RedPajama-INCITE-Chat-3B-v1 | 0.6718 | 0.6740 | alpha=0.5, Ipex 2.0 |
| togethercomputer/RedPajama-INCITE-Instruct-3B-v1 | 0.6569 | 0.6621 | alpha=0.5, Ipex 2.0 |
| togethercomputer/RedPajama-INCITE-Base-7B-v0.1 | 0.7143 | 0.7221 | alpha=0.5, Ipex 2.0 |
| togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1 | 0.6895 | 0.6953 | alpha=0.5, Ipex 2.0 |
| databricks/dolly-v1-6b | 0.6866 | 0.6895 | alpha=0.8, Ipex 2.1 |
| databricks/dolly-v2-3b | 0.6297 | 0.6247 | alpha=0.5, Ipex 2.1 |
| togethercomputer/RedPajama-INCITE-Chat-3B-v1* | 0.6718 | 0.6740 | alpha=0.5, Ipex 2.0 |
| togethercomputer/RedPajama-INCITE-Instruct-3B-v1* | 0.6569 | 0.6621 | alpha=0.5, Ipex 2.0 |
| togethercomputer/RedPajama-INCITE-Base-7B-v0.1* | 0.7143 | 0.7221 | alpha=0.5, Ipex 2.0 |
| togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1* | 0.6895 | 0.6953 | alpha=0.5, Ipex 2.0 |
| databricks/dolly-v1-6b* | 0.6866 | 0.6895 | alpha=0.8, Ipex 2.1 |
| databricks/dolly-v2-3b* | 0.6297 | 0.6247 | alpha=0.5, Ipex 2.1 |
| tiiuae/falcon-7b-instruct | 0.6437 | 0.6392 | alpha=0.7, Pytorch |

Please note that for models with asterisk(*), we have set all add ops to FP32 during quantization step to achieve desirable results.
## Example

User could refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant/README.md) on how to use smooth quant.
Expand Down

0 comments on commit 6187c2e

Please sign in to comment.