Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine PT2ONNX export #901

Merged
merged 17 commits into from
May 29, 2023
2 changes: 2 additions & 0 deletions .azure-pipelines/scripts/codeScan/pyspelling/inc_dict.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2613,3 +2613,5 @@ efb
netflix
DeBERTa
unilm
aten
hardswish
241 changes: 107 additions & 134 deletions docs/source/export.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,32 +9,70 @@ Export

4. [Appendix](#appendix)

# Introduction
Open Neural Network Exchange (ONNX) is an open standard format for representing machine learning models. Exporting FP32 PyTorch/Tensorflow models has become popular and easy to use. However, for Intel Neural Compressor, we hope to export the INT8 model into the ONNX format to achieve higher applicability in multiple frameworks.

Here we briefly introduce our export API for PyTorch FP32/INT8 models. First, the INT8 ONNX model is not directly exported from the INT8 PyTorch model, but quantized after obtaining the FP32 ONNX model using the mature torch.onnx.export API. To ensure the majority of the quantization process of ONNX is consistent with PyTorch, we reuse three key pieces of information from the Neural Compressor model to perform ONNX quantization.

- Quantized operations: Only operations quantized in PyTorch will be quantized in the quantization process of ONNX.
- Scale info: Scale information is collected from the quantization process of PyTorch.
- Weights of quantization aware training(QAT): For quantization aware training, the updated weights are passed to the ONNX model.
## Introduction
Open Neural Network Exchange (ONNX) is an open standard format for representing machine learning models. Exporting FP32 PyTorch/Tensorflow models has become popular and easy to use. For Intel Neural Compressor, we hope to export the INT8 model into the ONNX format to achieve higher applicability in multiple frameworks.

Here is the workflow of our export API for PyTorch/Tensorflow FP32/INT8 model.
<a target="_blank" href="./imgs/export.png" text-align:center>
<center>
<img src="./imgs/export.png" alt="Architecture" width=650 height=200>
<img src="./imgs/export.png" alt="Architecture" width=700 height=200>
</center>
</a>

# Supported Framework Model Matrix
## Supported Framework Model Matrix

<table>
<thead>
<tr>
<th>Framework</th>
<th>model type</th>
<th>exported ONNX model type</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="4">PyTorch</td>
<td>FP32</td>
<td>FP32</td>
</tr>
<tr>
<td>Post-Training Static Quantized INT8</td>
<td>QLinear/QDQ INT8</td>
</tr>
<tr>
<td>Post-Training Dynamic Quantized INT8</td>
<td>/</td>
</tr>
<tr>
<td>Quantization-aware Training INT8</td>
<td>QLinear/QDQ INT8</td>
</tr>
<tr>
<td rowspan="3">TensorFlow</td>
<td>FP32</td>
<td>FP32</td>
</tr>
<tr>
<td>Post-Training Static Quantized INT8</td>
<td>QDQ INT8</td>
</tr>
<tr>
<td>Quantization-aware Training INT8</td>
<td>QDQ INT8</td>
</tr>
</tbody>
</table>

> **Note**: Follow this step to export a post training dynamic quantized ONNX model from PyTorch model: \
1. export FP32 PyTorch model to FP32 ONNX model. \
2. use FP32 ONNX model as the input model for post training dynamic quantization.

| Export | PyTorch | TensorFlow |
| :---: | :---: |:----------:|
| FP32 Model -> FP32 ONNX Model | &#10004; | &#10004; |
| INT8 Model -> INT8 QDQ ONNX Model | &#10004; | &#10004; |
| INT8 Model -> INT8 QLinear ONNX Model | &#10004; | :x: |
## Examples

# Examples
### PyTorch Model

#### FP32 Model Export

## FP32 Model Export
```python
from neural_compressor.experimental.common import Model
from neural_compressor.config import Torch2ONNXConfig
Expand All @@ -50,15 +88,15 @@ fp32_onnx_config = Torch2ONNXConfig(
inc_model.export('fp32-model.onnx', fp32_onnx_config)
```

## INT8 Model Export
#### INT8 Model Export

```python
# q_model is a Neural Compressor model after performing quantization.
from neural_compressor.config import Torch2ONNXConfig
int8_onnx_config = Torch2ONNXConfig(
dtype="int8",
opset_version=14,
quant_format="QDQ", # or QLinear
quant_format="QLinear", # or QDQ
example_inputs=torch.randn(1, 3, 224, 224),
input_names=['input'],
output_names=['output'],
Expand All @@ -71,122 +109,57 @@ q_model.export('int8-model.onnx', int8_onnx_config)
- [Image recognition](/examples/pytorch/image_recognition/torchvision_models/export/fx/)
- [Text classification](/examples/pytorch/nlp/huggingface_models/text-classification/export/fx/)

# Appendix

Since there is a known quantization gap between PyTorch 'nn.Linear' module and ONNX 'MatMul + Add' subgraph, we provide three recipes.

For different recipes and ONNX INT8 model formats, 'nn.quantized.Linear' will be exported to the following subgraph:


<table class="docutils">
<thead>
<tr>
<th align="center">Recipe</th>
<th align="center">QDQ</th>
<th align="center">QLinear</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">QDQ_OP_FP32_BIAS</td>
<td>
<pre>
QuantizeLinear
|
DequantizeLinear
|
MatMul
|
Add
</pre>
</td>
<td>
<pre>
QuantizeLinear
|
MatMulIntegerToFloat
|
Add
</pre>
</td>
</tr>
<tr>
<td align="center">QDQ_OP_INT32_BIAS</td>
<td>
<pre>
QuantizeLinear
|
MatMulInteger
|
Add
|
Cast
|
Mul
</pre>
</td>
<td>
<pre>
QuantizeLinear
|
MatMulInteger
|
Add
|
Cast
|
Mul
</pre>
</td>
</tr>
<tr>
<td align="center">QDQ_OP_FP32_BIAS_QDQ</td>
<td>
<pre>
QuantizeLinear
|
DequantizeLinear
|
MatMul
|
Add
|
QuantizeLinear
|
DequantizeLinear
</pre>
</td>
<td>
<pre>
QuantizeLinear
|
MatMulIntegerToFloat
|
Add
|
QuantizeLinear
|
DequantizeLinear
</pre>
</td>
</tr>
</tbody>
</table>
### Tensorflow Model

#### FP32 Model Export

```python
from neural_compressor.experimental.common import Model
from neural_compressor.config import TF2ONNXConfig
inc_model = Model(model)
config = TF2ONNXConfig(dtype='fp32')
inc_model.export('fp32-model.onnx', config)
```

### INT8 Model Export

The default recipe is `QDQ_OP_FP32_BIAS`. If the accuracy of the exported ONNX INT8 model cannot meet your criterion, we recommend you try recipe `QDQ_OP_INT32_BIAS` and `QDQ_OP_FP32_BIAS_QDQ` as follows:
```python
# q_model is a Neural Compressor model after performing quantization.
from neural_compressor.config import Torch2ONNXConfig
int8_onnx_config = Torch2ONNXConfig(
dtype="int8",
opset_version=14,
quant_format="QDQ", # or QLinear
example_inputs=torch.randn(1, 3, 224, 224),
input_names=['input'],
output_names=['output'],
dynamic_axes={"input": {0: "batch_size"},
"output": {0: "batch_size"}},
recipe='QDQ_OP_INT32_BIAS', # or QDQ_OP_FP32_BIAS_QDQ
)
q_model.export('int8-model.onnx', int8_onnx_config)
from neural_compressor.config import TF2ONNXConfig
config = TF2ONNXConfig(dtype='int8')
q_model.export('int8-model.onnx', config)
```

> **Note**: Some export examples of computer vision task exist in examples. Users can leverage them to verify the accuracy and performance of the exported ONNX model.
- [resnet50_v1_5](/examples/tensorflow/image_recognition/tensorflow_models/resnet50_v1_5/export)
- [resnet50_v1](/examples/tensorflow/image_recognition/tensorflow_models/resnet50_v1/export)
- [vgg16](/examples/tensorflow/image_recognition/tensorflow_models/vgg16/export)
- [ssd_mobilenet_v1](/examples/tensorflow/object_detection/tensorflow_models/ssd_mobilenet_v1/export)
- [mobilenet_v2](/examples/tensorflow/image_recognition/tensorflow_models/mobilenet_v2/export)
- [faster_rcnn_resnet50](examples/tensorflow/object_detection/tensorflow_models/faster_rcnn_resnet50/export)

## Appendix

### Supported quantized ops

This table lists the TorchScript operators that are supported by ONNX export with torch v2.0. Refer to this [link](https://pytorch.org/docs/stable/onnx_supported_aten_ops.html) for more supported/unsupported ops.

| Operator | opset_version(s) |
| ---------------------------- | ---------------- |
| ``quantized::add`` | Since opset 10 |
| ``quantized::add_relu`` | Since opset 10 |
| ``quantized::cat`` | Since opset 10 |
| ``quantized::conv1d_relu`` | Since opset 10 |
| ``quantized::conv2d`` | Since opset 10 |
| ``quantized::conv2d_relu`` | Since opset 10 |
| ``quantized::group_norm`` | Since opset 10 |
| ``quantized::hardswish`` | Since opset 10 |
| ``quantized::instance_norm`` | Since opset 10 |
| ``quantized::layer_norm`` | Since opset 10 |
| ``quantized::leaky_relu`` | Since opset 10 |
| ``quantized::linear`` | Since opset 10 |
| ``quantized::mul`` | Since opset 10 |
| ``quantized::sigmoid`` | Since opset 10 |

> **Note**: The export function may fail due to unsupported operations. Please fallback unsupported quantized ops by setting 'op_type_dict' or 'op_name_dict' in 'QuantizationAwareTrainingConfig' or 'PostTrainingQuantConfig' config. Fallback examples please refer to [Text classification](/examples/pytorch/nlp/huggingface_models/text-classification/export/fx/)

Binary file modified docs/source/imgs/export.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,6 @@
import random
import shutil
import time
import warnings
import sys

import torch
import torch.nn as nn
import torch.nn.parallel
Expand Down Expand Up @@ -202,7 +199,7 @@ def eval_func(model):
int8_onnx_config = Torch2ONNXConfig(
dtype="int8",
opset_version=14,
quant_format="QDQ",
quant_format=args.quant_format,
example_inputs=torch.randn(1, 3, 224, 224),
input_names=['input'],
output_names=['output'],
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -535,6 +535,7 @@ def eval_func(model):
if model_args.export_dtype == 'int8':
from neural_compressor.quantization import fit
from neural_compressor.config import PostTrainingQuantConfig, TuningCriterion
from neural_compressor.utils.constant import FP32
tuning_criterion = TuningCriterion(
strategy="mse_v2",
strategy_kwargs={"confidence_batches": 1},
Expand All @@ -544,6 +545,7 @@ def eval_func(model):
approach="static",
quant_level=1,
tuning_criterion=tuning_criterion,
op_type_dict={"Embedding":FP32},
calibration_sampling_size=[300],
)
q_model = fit(model, conf=conf, calib_dataloader=eval_dataloader, eval_func=eval_func)
Expand Down
30 changes: 30 additions & 0 deletions neural_compressor/adaptor/pytorch.py
Original file line number Diff line number Diff line change
Expand Up @@ -829,6 +829,12 @@ def __init__(self, framework_specific_info):
if not self.benchmark:
assert False, "Unsupport approach: {}".format(self.approach)

# TODO: will be removed once 'op_type_dict' and 'op_name_dicts'
# for quant_aware_training can be handled in strategy
if self.approach == 'quant_aware_training':
self.qat_optype_wise = framework_specific_info.get('qat_optype_wise', None)
self.qat_op_wise = framework_specific_info.get('qat_op_wise', None)

self.fp32_results = []
self.fp32_preds_as_label = False

Expand Down Expand Up @@ -3608,6 +3614,7 @@ def _pre_hook_for_qat(self, dataloader=None):
quantizable_ops = []
tmp_model = self.fuse_fx_model(self.model, is_qat=True)
self._get_quantizable_ops_recursively(tmp_model, '', quantizable_ops)
self._remove_fallback_ops_for_qat(quantizable_ops)
bf16_ops = []
if self.version.release >= Version("1.11.0").release and self.use_bf16 and \
(CpuInfo().bf16 or os.getenv('FORCE_BF16') == '1'): # pragma: no cover
Expand Down Expand Up @@ -3719,6 +3726,29 @@ def _post_hook_for_qat(self):
self._dump_model_op_stats(self.model._model, self.model.q_config, self.approach)
torch_utils.util.get_embedding_contiguous(self.model._model)

def _get_fallback_ops_for_qat(self):
# get fallback ops for quant aware training approach
fallback_ops = {'op_wise': [], 'optype_wise': []}
if self.qat_optype_wise is not None:
for optype, optype_config in self.qat_optype_wise.items():
if 'weight' in optype_config and optype_config['weight']['dtype'] == ['fp32']:
fallback_ops['optype_wise'].append(optype)
if self.qat_op_wise is not None:
for op, op_config in self.qat_op_wise.items():
if 'weight' in op_config and op_config['weight']['dtype'] == ['fp32']:
fallback_ops['op_wise'].append(op)
return fallback_ops

def _remove_fallback_ops_for_qat(self, quantizable_ops):
# remove fallback ops from quantizable_ops for quant aware training approach
fallback_ops = self._get_fallback_ops_for_qat()
remove_ops = []
for (op_name, op_type) in quantizable_ops:
if op_name in fallback_ops['op_wise'] or op_type in fallback_ops['optype_wise']:
remove_ops.append((op_name, op_type))
for (op_name, op_type) in remove_ops:
quantizable_ops.remove((op_name, op_type))

def train(self, model, dataloader, optimizer_tuple, criterion_tuple, hooks, **kwargs):
"""Execute the train process on the specified model.

Expand Down
2 changes: 0 additions & 2 deletions neural_compressor/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -1936,7 +1936,6 @@ def __init__(
input_names=None,
output_names=None,
dynamic_axes=None,
recipe='QDQ_OP_FP32_BIAS',
**kwargs,
):
"""Init a Torch2ONNXConfig object."""
Expand All @@ -1949,7 +1948,6 @@ def __init__(
output_names=output_names,
dynamic_axes=dynamic_axes,
)
self.recipe = recipe
self.kwargs = kwargs


Expand Down
Loading