From 6187c2eb9bf0902179ad68173c0941c7399ff85e Mon Sep 17 00:00:00 2001 From: yintong-lu <108845308+yintong-lu@users.noreply.github.com> Date: Wed, 25 Oct 2023 15:04:34 +0800 Subject: [PATCH] doc update (#1340) Update SmoothQuant INT8 accuracy Signed-off-by: Lu, Yintong --- docs/source/smooth_quant.md | 33 ++++++++++++++++++--------------- 1 file changed, 18 insertions(+), 15 deletions(-) diff --git a/docs/source/smooth_quant.md b/docs/source/smooth_quant.md index f052d92e9d8..332cda493f4 100644 --- a/docs/source/smooth_quant.md +++ b/docs/source/smooth_quant.md @@ -280,13 +280,15 @@ For most of the models such as OPT and BLOOM, $\alpha = 0.5$ is a well-balanced SmoothQuant method aims to split the quantization difficulty of weight and activation by using a fixed-value $\alpha$ for an entire model. However, as the distributions of activation outliers vary not only across different models but also across different layers within a model, we hereby propose a method to obtain layer-wise optimal $\alpha$ values with the ability to tune automatically. -Our proposed method consists of 5 major steps: +Our proposed method consists of 7 major steps: -- Hook input and output values of all layers using register_forward_hook. -- Generate a list of $\alpha$ values given user-defined $\alpha$ range and step_sizes. -- Recalculate smoothing factor given an $\alpha$ value and adjust parameters(weights and activations). -- Perform per-channel quantization_dequantization of weights and per-tensor quantization_dequantization of inputs to predict the layer-wise output corresponding to the given $\alpha$ value. -- Calculate the mean-squared loss with respect to the actual output value, recover the adjusted parameters and save the layer-wise optimal $\alpha$ values. +- Hook input minimum and maximum values of layers to be smoothed using register_forward_hook. +- Generate a list of $\alpha$ values of a user-defined range and set a default $\alpha$ value. +- Calculate smoothing factor using default $\alpha$ value, adjust parameters accordingly and forward the adjusted model given an input sample. +- Perform per-channel quantization_dequantization of weights and per-tensor quantization_dequantization of activations to predict output. +- Calculate the layer-wise loss with respect to FP32 output, iterate the previous two steps given each $\alpha$ value and save the layer-wise loss per alpha. +- Apply criterion on input LayerNorm op and obtain the layer-wise optimal alpha values of a single input sample. +- Iterate the previous three steps over a number of input samples and save the layer-wise optimal $\alpha$ values. @@ -350,23 +352,24 @@ A list of models that achieved a <1% accuracy drop is shown below. | LLaMa-13b | 0.7627 | 0.7590 | alpha=0.7, Ipex 2.1 | | LLaMa-30b | 0.7759 | 0.7840 | alpha=0.7, Ipex 2.1 | | LLaMa-65b | 0.7908 | 0.7957 | alpha=0.9, Ipex 2.1 | -| LLaMa-2-7b | 0.7392 | 0.7262/0.7330 | alpha=Auto, Ipex 2.1/Pytorch | -| LLaMa-2-7b-Chat | 0.7058 | 0.6994 | alpha=Auto, Ipex 2.1 | -| EleutherAI/gpt-j-6B | 0.6831 | 0.6821 | alpha=1.0, Ipex 2.1 | +| LLaMa-2-7b-hf* | 0.7392 | 0.7335 | alpha=Auto, Ipex 2.1 | +| LLaMa-2-7b-Chat* | 0.7058 | 0.6994 | alpha=Auto, Ipex 2.1 | +| EleutherAI/gpt-j-6B* | 0.6831 | 0.6821 | alpha=1.0, Ipex 2.1 | | MBZUAI/LaMini-GPT-124m | 0.3804 | 0.3887 | alpha=0.5, Ipex 2.1 | | MBZUAI/LaMini-GPT-774m | 0.5048 | 0.5057 | alpha=0.5, Ipex 2.1 | | MBZUAI/LaMini-GPT-1.5b | 0.5443 | 0.5436 | alpha=0.5, Ipex 2.1 | | mosaicml/mpt-7b-chat | 0.655 | 0.6499 | alpha=0.7, Ipex 2.1 | | stabilityai/stablelm-base-alpha-3b | 0.4172 | 0.4149 | alpha=0.6, Ipex 2.1 | | togethercomputer/RedPajama-INCITE-Base-3B-v1 | 0.6542 | 0.6735 | alpha=0.5, Ipex 2.1 | -| togethercomputer/RedPajama-INCITE-Chat-3B-v1 | 0.6718 | 0.6740 | alpha=0.5, Ipex 2.0 | -| togethercomputer/RedPajama-INCITE-Instruct-3B-v1 | 0.6569 | 0.6621 | alpha=0.5, Ipex 2.0 | -| togethercomputer/RedPajama-INCITE-Base-7B-v0.1 | 0.7143 | 0.7221 | alpha=0.5, Ipex 2.0 | -| togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1 | 0.6895 | 0.6953 | alpha=0.5, Ipex 2.0 | -| databricks/dolly-v1-6b | 0.6866 | 0.6895 | alpha=0.8, Ipex 2.1 | -| databricks/dolly-v2-3b | 0.6297 | 0.6247 | alpha=0.5, Ipex 2.1 | +| togethercomputer/RedPajama-INCITE-Chat-3B-v1* | 0.6718 | 0.6740 | alpha=0.5, Ipex 2.0 | +| togethercomputer/RedPajama-INCITE-Instruct-3B-v1* | 0.6569 | 0.6621 | alpha=0.5, Ipex 2.0 | +| togethercomputer/RedPajama-INCITE-Base-7B-v0.1* | 0.7143 | 0.7221 | alpha=0.5, Ipex 2.0 | +| togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1* | 0.6895 | 0.6953 | alpha=0.5, Ipex 2.0 | +| databricks/dolly-v1-6b* | 0.6866 | 0.6895 | alpha=0.8, Ipex 2.1 | +| databricks/dolly-v2-3b* | 0.6297 | 0.6247 | alpha=0.5, Ipex 2.1 | | tiiuae/falcon-7b-instruct | 0.6437 | 0.6392 | alpha=0.7, Pytorch | +Please note that for models with asterisk(*), we have set all add ops to FP32 during quantization step to achieve desirable results. ## Example User could refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant/README.md) on how to use smooth quant.