diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md index 62c51476f..efc5c0451 100644 --- a/docs/CONTRIBUTING.md +++ b/docs/CONTRIBUTING.md @@ -9,7 +9,7 @@ Contribution Guidelines 7. [Contributor Covenant Code of Conduct](#contributor-covenant-code-of-conduct) ## Create Pull Request -If you have improvements to ONNX Neural Compressor, send your pull requests for +If you have improvements to Neural Compressor, send your pull requests for [review](https://github.com/onnx/neural-compressor/pulls). If you are new to GitHub, view the pull request [How To](https://help.github.com/articles/using-pull-requests/). ### Step-by-Step guidelines @@ -27,7 +27,7 @@ If you are new to GitHub, view the pull request [How To](https://help.github.com Before sending your pull requests, follow the information below: - Add unit tests in [Unit Tests](https://github.com/onnx/neural-compressor/tree/main/test) to cover the code you would like to contribute. -- ONNX Neural Compressor has adopted the [Developer Certificate of Origin](https://en.wikipedia.org/wiki/Developer_Certificate_of_Origin), you must agree to the terms of Developer Certificate of Origin by signing off each of your commits with `-s`, e.g. `git commit -s -m 'This is my commit message'`. +- Neural Compressor has adopted the [Developer Certificate of Origin](https://en.wikipedia.org/wiki/Developer_Certificate_of_Origin), you must agree to the terms of Developer Certificate of Origin by signing off each of your commits with `-s`, e.g. `git commit -s -m 'This is my commit message'`. ## Pull Request Template @@ -43,7 +43,7 @@ See [PR template](/.github/pull_request_template.md) - Third-party dependency license compatible ## Pull Request Status Checks Overview -ONNX Neural Compressor use [Azure DevOps](https://learn.microsoft.com/en-us/azure/devops/pipelines/?view=azure-devops) for CI test. +Neural Compressor use [Azure DevOps](https://learn.microsoft.com/en-us/azure/devops/pipelines/?view=azure-devops) for CI test. And generally use [Azure Cloud Instance](https://azure.microsoft.com/en-us/pricing/purchase-options/pay-as-you-go) to deploy pipelines, e.g. Standard E16s v5. | Test Name | Test Scope | Test Pass Criteria | |-------------------------------|-----------------------------------------------|---------------------------| diff --git a/docs/autotune.md b/docs/autotune.md new file mode 100644 index 000000000..715b27884 --- /dev/null +++ b/docs/autotune.md @@ -0,0 +1,75 @@ +AutoTune +======================================== + +1. [Overview](#overview) +2. [How it Works](#how-it-works) +3. [Working with Autotune](#working-with-autotune) +4. [Get Started](#get-started) + + +## Overview + +Neural Compressor aims to help users quickly deploy low-precision models by leveraging popular compression techniques, such as post-training quantization and weight-only quantization algorithms. Despite having a variety of these algorithms, finding the appropriate configuration for a model can be difficult and time-consuming. To address this, we built the `autotune` module which identifies the best algorithm configuration for models to achieve optimal performance under the certain accuracy criteria. This module allows users to easily use predefined tuning recipes and customize the tuning space as needed. + +## How it Works + +The autotune module constructs the tuning space according to the pre-defined tuning set or users' tuning set. It iterates the tuning space and applies the configuration on given float model then records and compares its evaluation result with the baseline. The tuning process stops when meeting the exit policy. +The workflow is as below: + + + Workflow + + + +## Working with Autotune + +The `autotune` API can be used across all algorithms supported by Neural Compressor. It accepts three primary arguments: `model_input`, `tune_config`, and `eval_fn`. + +The `TuningConfig` class defines the tuning process, including the tuning space, order, and exit policy. + +- Define the tuning space + + User can define the tuning space by setting `config_set` with an algorithm configuration or a set of configurations. + ```python + # Use the default tuning space + config_set = config.get_woq_tuning_config() + + # Customize the tuning space with one algorithm configurations + config_set = config.RTNConfig(weight_sym=False, weight_group_size=[32, 64]) + + # Customize the tuning space with two algorithm configurations + config_set = [ + config.RTNConfig(weight_sym=False, weight_group_size=32), + config.GPTQConfig(weight_group_size=128, weight_sym=False), + ] + ``` + +- Define the tuning order + + The tuning order determines how the process traverses the tuning space and samples configurations. Users can customize it by configuring the `sampler`. Currently, we provide the [`default_sampler`](https://github.com/onnx/neural-compressor/blob/main/onnx_neural_compressor/quantization/tuning.py#L210), which samples configurations sequentially, always in the same order. + +- Define the exit policy + + The exit policy includes two components: accuracy goal (`tolerable_loss`) and the allowed number of trials (`max_trials`). The tuning process will stop when either condition is met. + +## Get Started +The example below demonstrates how to autotune a ONNX model on four `RTNConfig` configurations. + +```python +from onnx_neural_compressor.quantization import config, tuning + + +def eval_fn(model) -> float: + return ... + + +tune_config = tuning.TuningConfig( + config_set=config.RTNConfig( + weight_sym=[False, True], + weight_group_size=[32, 128] + ), + tolerable_loss=0.2, + max_trials=10, +) +q_model = tuning.autotune(model, tune_config=tune_config, eval_fn=eval_fn) +``` \ No newline at end of file diff --git a/docs/calibration.md b/docs/calibration.md index 33914cb38..8de90fd80 100644 --- a/docs/calibration.md +++ b/docs/calibration.md @@ -10,7 +10,7 @@ Quantization proves beneficial in terms of reducing the memory and computational ## Calibration Algorithms -Currently, ONNX Neural Compressor supports three popular calibration algorithms: +Currently, Neural Compressor supports three popular calibration algorithms: - MinMax: This method gets the maximum and minimum of input values as $α$ and $β$ [^1]. It preserves the entire range and is the simplest approach. @@ -18,7 +18,7 @@ Currently, ONNX Neural Compressor supports three popular calibration algorithms: - Percentile: This method only considers a specific percentage of values for calculating the range, ignoring the remainder which may contain outliers [^3]. It enhances resolution by excluding extreme values but still retaining noteworthy data. -> `kl` is used to represent the Entropy calibration algorithm in ONNX Neural Compressor. +> `kl` is used to represent the Entropy calibration algorithm in Neural Compressor. ## Reference diff --git a/docs/design.md b/docs/design.md index 833d59d38..5b6f50b0b 100644 --- a/docs/design.md +++ b/docs/design.md @@ -1,6 +1,6 @@ Design ===== -ONNX Neural Compressor features an architecture and workflow that aids in increasing performance and faster deployments across infrastructures. +Neural Compressor features an architecture and workflow that aids in increasing performance and faster deployments across infrastructures. ## Architecture diff --git a/docs/imgs/workflow.png b/docs/imgs/workflow.png index b45df9547..87be8e660 100644 Binary files a/docs/imgs/workflow.png and b/docs/imgs/workflow.png differ diff --git a/docs/installation_guide.md b/docs/installation_guide.md index 5337808de..79c2e8c0b 100644 --- a/docs/installation_guide.md +++ b/docs/installation_guide.md @@ -40,7 +40,7 @@ The following prerequisites and requirements must be satisfied for a successful ## System Requirements ### Validated Hardware Environment -#### ONNX Neural Compressor supports CPUs based on [Intel 64 architecture or compatible processors](https://en.wikipedia.org/wiki/X86-64): +#### Neural Compressor supports CPUs based on [Intel 64 architecture or compatible processors](https://en.wikipedia.org/wiki/X86-64): * Intel Xeon Scalable processor (formerly Skylake, Cascade Lake, Cooper Lake, Ice Lake, and Sapphire Rapids) * Intel Xeon CPU Max Series (formerly Sapphire Rapids HBM) diff --git a/docs/quantization.md b/docs/quantization.md index 39f274c4e..723d72b07 100644 --- a/docs/quantization.md +++ b/docs/quantization.md @@ -3,12 +3,14 @@ Quantization 1. [Quantization Introduction](#quantization-introduction) 2. [Quantization Fundamentals](#quantization-fundamentals) -3. [Accuracy Aware Tuning](#with-or-without-accuracy-aware-tuning) -4. [Get Started](#get-started) - 4.1 [Post Training Quantization](#post-training-quantization) - 4.2 [Specify Quantization Rules](#specify-quantization-rules) - 4.3 [Specify Quantization Backend and Device](#specify-quantization-backend-and-device) -5. [Examples](#examples) +3. [Get Started](#get-started) + + 3.1 [Post Training Quantization](#post-training-quantization) + + 3.2 [Specify Quantization Rules](#specify-quantization-rules) + + 3.3 [Specify Quantization Backend and Device](#specify-quantization-backend-and-device) +4. [Examples](#examples) ## Quantization Introduction @@ -18,19 +20,19 @@ Quantization is a very popular deep learning model optimization technique invent `Affine quantization` and `Scale quantization` are two common range mapping techniques used in tensor conversion between different data types. -The math equation is like: $$X_{int8} = round(Scale \times X_{fp32} + ZeroPoint)$$. +The math equation is like: $X_{int8} = round(Scale \times X_{fp32} + ZeroPoint)$. **Affine Quantization** -This is so-called `asymmetric quantization`, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255]. +This is so-called `Asymmetric quantization`, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255]. here: -If INT8 is specified, $Scale = (|X_{f_{max}} - X_{f_{min}}|) / 127$ and $ZeroPoint = -128 - X_{f_{min}} / Scale$. +If INT8 is specified, $Scale = (|X_{max} - X_{min}|) / 127$ and $ZeroPoint = -128 - X_{min} / Scale$. or -If UINT8 is specified, $Scale = (|X_{f_{max}} - X_{f_{min}}|) / 255$ and $ZeroPoint = - X_{f_{min}} / Scale$. +If UINT8 is specified, $Scale = (|X_{max} - X_{min}|) / 255$ and $ZeroPoint = - X_{min} / Scale$. **Scale Quantization** @@ -40,11 +42,11 @@ The math equation is like: here: -If INT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 127$ and $ZeroPoint = 0$. +If INT8 is specified, $Scale = max(abs(X_{max}), abs(X_{min})) / 127$ and $ZeroPoint = 0$. or -If UINT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 255$ and $ZeroPoint = 128$. +If UINT8 is specified, $Scale = max(abs(X_{max}), abs(X_{min})) / 255$ and $ZeroPoint = 128$. *NOTE* @@ -54,29 +56,25 @@ Sometimes the reduce_range feature, that's using 7 bit width (1 sign bit + 6 dat | Framework | Backend Library | Symmetric Quantization | Asymmetric Quantization | | :-------------- |:---------------:| ---------------:|---------------:| -| ONNX Runtime | [MLAS](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/core/mlas) | Weight (int8) | Activation (uint8) | +| ONNX Runtime | [MLAS](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/core/mlas) | Activation (int8/uint8), Weight (int8/uint8) | Activation (int8/uint8), Weight (int8/uint8) | +> ***Note*** +> +> Activation (uint8) + Weight (int8) is recommended for performance on x86-64 machines with AVX2 and AVX512 extensions. -#### Quantization Scheme -+ Symmetric Quantization - + int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1) -+ Asymmetric Quantization - + uint8: scale = (rmax - rmin) / (max(uint8) - min(uint8)); zero_point = min(uint8) - round(rmin / scale) #### Reference + MLAS: [MLAS Quantization](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/onnx_quantizer.py) ### Quantization Approaches -Quantization has three different approaches: +Quantization has two different approaches which belong to optimization on inference: 1) post training dynamic quantization 2) post training static quantization -The first two approaches belong to optimization on inference. The last belongs to optimization during training. Currently. ONNX Runtime doesn't support the last one. - #### Post Training Dynamic Quantization -The weights of the neural network get quantized into int8 format from float32 format offline. The activations of the neural network is quantized as well with the min/max range collected during inference runtime. +The weights of the neural network get quantized into 8 bits format from float32 format offline. The activations of the neural network is quantized as well with the min/max range collected during inference runtime. This approach is widely used in dynamic length neural networks, like NLP model. @@ -86,42 +84,20 @@ Compared with `post training dynamic quantization`, the min/max range in weights This approach is major quantization approach people should try because it could provide the better performance comparing with `post training dynamic quantization`. -## With or Without Accuracy Aware Tuning - -Accuracy aware tuning is one of unique features provided by Neural Compressor, compared with other 3rd party model compression tools. This feature can be used to solve accuracy loss pain points brought by applying low precision quantization and other lossy optimization methods. - -This tuning algorithm creates a tuning space based on user-defined configurations, generates quantized graph, and evaluates the accuracy of this quantized graph. The optimal model will be yielded if the pre-defined accuracy goal is met. - -Neural compressor also support to quantize all quantizable ops without accuracy tuning, user can decide whether to tune the model accuracy or not. Please refer to "Get Start" below. - -### Working Flow - -Currently `accuracy aware tuning` only supports `post training quantization`. - -User could refer to below chart to understand the whole tuning flow. - -accuracy aware tuning working flow - - ## Get Started -The design philosophy of the quantization interface of ONNX Neural Compressor is easy-of-use. It requests user to provide `model`, `calibration dataloader`, and `evaluation function`. Those parameters would be used to quantize and tune the model. +The design philosophy of the quantization interface of Neural Compressor is easy-of-use. It requests user to provide `model_input`, `model_output` and `quant_config`. Those parameters would be used to quantize and save the model. -`model` is the framework model location or the framework model object. +`model_input` is the ONNX model location or the ONNX model object. -`calibration dataloader` is used to load the data samples for calibration phase. In most cases, it could be the partial samples of the evaluation dataset. +`model_output` is the path to save ONNX model. -If a user needs to tune the model accuracy, the user should provide `evaluation function`. +`quant_config` is the configuration to do quantization. -`evaluation function` is a function used to evaluate model accuracy. It is a optional. This function should be same with how user makes evaluation on fp32 model, just taking `model` as input and returning a scalar value represented the evaluation accuracy. +User could leverage Neural Compressor to directly generate a fully quantized model without accuracy validation. Currently, Neural Compressor supports `Post Training Static Quantization` and `Post Training Dynamic Quantization`. -User could execute: ### Post Training Quantization -1. Without Accuracy Aware Tuning - -This means user could leverage ONNX Neural Compressor to directly generate a fully quantized model without accuracy aware tuning. It's user responsibility to ensure the accuracy of the quantized model meets expectation. ONNX Neural Compressor supports `Post Training Static Quantization` and `Post Training Dynamic Quantization`. - ``` python from onnx_neural_compressor.quantization import quantize, config from onnx_neural_compressor import data_reader @@ -138,47 +114,15 @@ qconfig = config.StaticQuantConfig(calibration_data_reader) # or qconfig = Dyna quantize(model, q_model_path, qconfig) ``` -2. With Accuracy Aware Tuning - -This means user could leverage the advance feature of ONNX Neural Compressor to tune out a best quantized model which has best accuracy and good performance. User should provide `eval_fn`. - -``` python -from onnx_neural_compressor import data_reader -from onnx_neural_compressor.quantization import tuning, config - -class DataReader(data_reader.CalibrationDataReader): - def get_next(self): ... - - def rewind(self): ... - - -data_reader = DataReader() - -# TuningConfig can accept: -# 1) a set of candidate configs like tuning.TuningConfig(config_set=[config.RTNConfig(weight_bits=4), config.GPTQConfig(weight_bits=4)]) -# 2) one config with a set of candidate parameters like tuning.TuningConfig(config_set=[config.GPTQConfig(weight_group_size=[32, 64])]) -# 3) our pre-defined config set like tuning.TuningConfig(config_set=config.get_woq_tuning_config()) -custom_tune_config = tuning.TuningConfig(config_set=[config.RTNConfig(weight_bits=4), config.GPTQConfig(weight_bits=4)]) -best_model = tuning.autotune( - model_input=model, - tune_config=custom_tune_config, - eval_fn=eval_fn, - calibration_data_reader=data_reader, -) -``` - ### Specify Quantization Rules -ONNX Neural Compressor support specify quantization rules by operator name. Users can use `set_local` API of configs to achieve the above purpose by below code: +Neural Compressor support specify quantization rules by operator name. Users can use `set_local` API of configs to achieve the above purpose by below code: ```python -fp32_config = config.GPTQConfig(weight_dtype="fp32") -quant_config = config.GPTQConfig( - weight_bits=4, - weight_dtype="int", - weight_sym=False, - weight_group_size=32, +op_config = config.StaticQuantConfig(per_channel=False) +quant_config = config.StaticQuantConfig( + per_channel=True, ) -quant_config.set_local("/h.4/mlp/fc_out/MatMul", fp32_config) +quant_config.set_local("/h.4/mlp/fc_out/MatMul", op_config) ``` @@ -235,4 +179,4 @@ Neural-Compressor will quantized models with user-specified backend or detecting ## Examples -User could refer to [examples](../../examples/onnxrt) on how to quantize a new model. +User could refer to [examples](../../examples) on how to quantize a new model. diff --git a/onnx_neural_compressor/__init__.py b/onnx_neural_compressor/__init__.py index 2175e2eba..b514e101a 100644 --- a/onnx_neural_compressor/__init__.py +++ b/onnx_neural_compressor/__init__.py @@ -11,4 +11,4 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -"""ONNX Neural Compressor: An open-source Python library supporting popular model compression techniques for ONNXRuntime Framework.""" +"""Neural Compressor: An open-source Python library supporting popular model compression techniques for ONNX models.""" diff --git a/onnx_neural_compressor/version.py b/onnx_neural_compressor/version.py index 08d071fc2..02d80e0b1 100644 --- a/onnx_neural_compressor/version.py +++ b/onnx_neural_compressor/version.py @@ -11,5 +11,5 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -"""ONNX Neural Compressor: An open-source Python library supporting popular model compression techniques for ONNX.""" +"""Neural Compressor: An open-source Python library supporting popular model compression techniques for ONNX models.""" __version__ = "1.0"