doc update (#1340)

Update SmoothQuant INT8 accuracy Signed-off-by: Lu, Yintong <[email protected]>
intel · Oct 25, 2023 · 6187c2e · 6187c2e
1 parent b065cfc
commit 6187c2e
Showing 1 changed file with 18 additions and 15 deletions.
diff --git a/docs/source/smooth_quant.md b/docs/source/smooth_quant.md
@@ -280,13 +280,15 @@ For most of the models such as OPT and BLOOM, $\alpha = 0.5$ is a well-balanced
 
 SmoothQuant method aims to split the quantization difficulty of weight and activation by using a fixed-value $\alpha$ for an entire model. However, as the distributions of activation outliers vary not only across different models but also across different layers within a model, we hereby propose a method to obtain layer-wise optimal $\alpha$ values with the ability to tune automatically.
 
-Our proposed method consists of 5 major steps:
+Our proposed method consists of 7 major steps:
 
--    Hook input and output values of all layers using register_forward_hook.
--    Generate a list of $\alpha$ values given user-defined $\alpha$ range and step_sizes.
--    Recalculate smoothing factor given an $\alpha$ value and adjust parameters(weights and activations).
--    Perform per-channel quantization_dequantization of weights and per-tensor quantization_dequantization of inputs to predict the layer-wise output corresponding to the given $\alpha$ value.
--    Calculate the mean-squared loss with respect to the actual output value, recover the adjusted parameters and save the layer-wise optimal $\alpha$ values.
+-    Hook input minimum and maximum values of layers to be smoothed using register_forward_hook.
+-    Generate a list of $\alpha$ values of a user-defined range and set a default $\alpha$ value.
+-    Calculate smoothing factor using default $\alpha$ value, adjust parameters accordingly and forward the adjusted model given an input sample.
+-    Perform per-channel quantization_dequantization of weights and per-tensor quantization_dequantization of activations to predict output.
+-    Calculate the layer-wise loss with respect to FP32 output, iterate the previous two steps given each $\alpha$ value and save the layer-wise loss per alpha.
+-    Apply criterion on input LayerNorm op and obtain the layer-wise optimal alpha values of a single input sample.
+-    Iterate the previous three steps over a number of input samples and save the layer-wise optimal $\alpha$ values.
 
 
 
@@ -350,23 +352,24 @@ A list of models that achieved a <1% accuracy drop is shown below.
 | LLaMa-13b | 0.7627 | 0.7590 | alpha=0.7, Ipex 2.1 |
 | LLaMa-30b | 0.7759 | 0.7840 | alpha=0.7, Ipex 2.1 |
 | LLaMa-65b | 0.7908 | 0.7957 | alpha=0.9, Ipex 2.1 |
-| LLaMa-2-7b | 0.7392 | 0.7262/0.7330  | alpha=Auto, Ipex 2.1/Pytorch |
-| LLaMa-2-7b-Chat | 0.7058 | 0.6994 | alpha=Auto, Ipex 2.1 |
-| EleutherAI/gpt-j-6B | 0.6831 | 0.6821 | alpha=1.0, Ipex 2.1 |
+| LLaMa-2-7b-hf* | 0.7392 | 0.7335  | alpha=Auto, Ipex 2.1 |
+| LLaMa-2-7b-Chat* | 0.7058 | 0.6994 | alpha=Auto, Ipex 2.1 |
+| EleutherAI/gpt-j-6B* | 0.6831 | 0.6821 | alpha=1.0, Ipex 2.1 |
 | MBZUAI/LaMini-GPT-124m | 0.3804 | 0.3887 | alpha=0.5, Ipex 2.1 |
 | MBZUAI/LaMini-GPT-774m | 0.5048 | 0.5057 | alpha=0.5, Ipex 2.1 |
 | MBZUAI/LaMini-GPT-1.5b | 0.5443 | 0.5436 | alpha=0.5, Ipex 2.1 |
 | mosaicml/mpt-7b-chat | 0.655 | 0.6499 | alpha=0.7, Ipex 2.1 |
 | stabilityai/stablelm-base-alpha-3b | 0.4172 | 0.4149 | alpha=0.6, Ipex 2.1 |
 | togethercomputer/RedPajama-INCITE-Base-3B-v1 | 0.6542 | 0.6735 | alpha=0.5, Ipex 2.1 |
-| togethercomputer/RedPajama-INCITE-Chat-3B-v1 | 0.6718 | 0.6740 | alpha=0.5, Ipex 2.0 |
-| togethercomputer/RedPajama-INCITE-Instruct-3B-v1 | 0.6569 | 0.6621 | alpha=0.5, Ipex 2.0 |
-| togethercomputer/RedPajama-INCITE-Base-7B-v0.1 | 0.7143 | 0.7221 | alpha=0.5, Ipex 2.0 |
-| togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1 | 0.6895 | 0.6953 | alpha=0.5, Ipex 2.0 |
-| databricks/dolly-v1-6b | 0.6866 | 0.6895 | alpha=0.8, Ipex 2.1 |
-| databricks/dolly-v2-3b | 0.6297 | 0.6247 | alpha=0.5, Ipex 2.1 |
+| togethercomputer/RedPajama-INCITE-Chat-3B-v1* | 0.6718 | 0.6740 | alpha=0.5, Ipex 2.0 |
+| togethercomputer/RedPajama-INCITE-Instruct-3B-v1* | 0.6569 | 0.6621 | alpha=0.5, Ipex 2.0 |
+| togethercomputer/RedPajama-INCITE-Base-7B-v0.1* | 0.7143 | 0.7221 | alpha=0.5, Ipex 2.0 |
+| togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1* | 0.6895 | 0.6953 | alpha=0.5, Ipex 2.0 |
+| databricks/dolly-v1-6b* | 0.6866 | 0.6895 | alpha=0.8, Ipex 2.1 |
+| databricks/dolly-v2-3b* | 0.6297 | 0.6247 | alpha=0.5, Ipex 2.1 |
 | tiiuae/falcon-7b-instruct | 0.6437 | 0.6392 | alpha=0.7, Pytorch |
 
+Please note that for models with asterisk(*), we have set all add ops to FP32 during quantization step to achieve desirable results.
 ## Example
 
 User could refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant/README.md) on how to use smooth quant.