Skip to content

Commit

Permalink
Documentation refine for examples/quantization/pruning/orchestration (i…
Browse files Browse the repository at this point in the history
…ntel#988)

Co-authored-by: Wenxin Zhang <[email protected]>
Co-authored-by: hanwen.chang <[email protected]>
Co-authored-by: Tian, Feng <[email protected]>
Co-authored-by: hshen14 <[email protected]>
  • Loading branch information
5 people authored Jun 18, 2022
1 parent fe70e0b commit 16a4a12
Show file tree
Hide file tree
Showing 14 changed files with 2,363 additions and 1,571 deletions.
18 changes: 8 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,7 @@ Intel® Neural Compressor has been one of the critical AI software components in
# install stable version from from conda
conda install neural-compressor -c conda-forge -c intel
```
More installation methods can be found at [Installation Guide](./docs/installation_guide.md).
> **Note:**
> Run into installation issues, please check [FAQ](./docs/faq.md).
More installation methods can be found at [Installation Guide](./docs/installation_guide.md). Please check out our [FAQ](./docs/faq.md) for more details.

## Getting Started
* Quantization with Python API
Expand Down Expand Up @@ -122,8 +120,8 @@ Intel® Neural Compressor supports systems based on [Intel 64 architecture or co
</tbody>
</table>

> Note: 1.Starting from official TensorFlow 2.6.0, oneDNN has been default in the binary. Please set the environment variable TF_ENABLE_ONEDNN_OPTS=1 to enable the oneDNN optimizations.
> 2.Starting from official TensorFlow 2.9.0, oneDNN optimizations are enabled by default on CPUs with neural-network-focused hardware features such as AVX512_VNNI, AVX512_BF16, AMX, etc. No need to set environment variable.
> **Note:**
> Please set the environment variable TF_ENABLE_ONEDNN_OPTS=1 to enable oneDNN optimizations if you are using TensorFlow from v2.6 to v2.8. oneDNN has been fully default from TensorFlow v2.9.
### Validated Models
Intel® Neural Compressor validated 420+ [examples](./examples) with performance speedup geomean 2.2x and up to 4.2x on VNNI while minimizing the accuracy loss.
Expand All @@ -143,7 +141,7 @@ More details for validated models are available [here](docs/validated_model_list
</thead>
<tbody>
<tr>
<td colspan="3" align="center"><a href="docs/infrastructure.md">Infrastructure</a></td>
<td colspan="3" align="center"><a href="docs/design.md">Architecture</a></td>
<td colspan="2" align="center"><a href="docs/tutorial.md">Tutorial</a></td>
<td colspan="2" align="center"><a href="./examples">Examples</a></td>
<td colspan="1" align="center"><a href="docs/bench.md">GUI</a></td>
Expand Down Expand Up @@ -177,7 +175,7 @@ More details for validated models are available [here](docs/validated_model_list
<td colspan="2" align="center"><a href="docs/Quantization.md">Quantization</a></td>
<td colspan="1" align="center"><a href="docs/pruning.md">Pruning</a> <a href="docs/sparsity.md">(Sparsity)</a> </td>
<td colspan="3" align="center"><a href="docs/distillation.md">Knowledge Distillation</a></td>
<td colspan="3" align="center"><a href="docs/mixed_precision.md">Mixed precision</a></td>
<td colspan="3" align="center"><a href="docs/mixed_precision.md">Mixed Precision</a></td>
</tr>
<tr>
<td colspan="2" align="center"><a href="docs/benchmark.md">Benchmarking</a></td>
Expand Down Expand Up @@ -207,7 +205,7 @@ More details for validated models are available [here](docs/validated_model_list
* [Quantizing ONNX Models using Intel® Neural Compressor](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Quantizing-ONNX-Models-using-Intel-Neural-Compressor/post/1355237) (Feb 2022)
* [Quantize AI Model by Intel® oneAPI AI Analytics Toolkit on Alibaba Cloud](https://www.intel.com/content/www/us/en/developer/articles/technical/quantize-ai-by-oneapi-analytics-on-alibaba-cloud.html) (Feb 2022)

> View the [full publication list](docs/publication_list.md).
> Please check out our [full publication list](docs/publication_list.md).
## Additional Content

Expand All @@ -217,6 +215,6 @@ More details for validated models are available [here](docs/validated_model_list
* [Security Policy](docs/security_policy.md)
* [Intel® Neural Compressor Website](https://intel.github.io/neural-compressor)

## Hiring
## Hiring :star:

We are hiring. Please send your resume to [email protected] if you have interests in model compression techniques.
We are actively hiring. Please send your resume to [email protected] if you have interests in model compression techniques.
107 changes: 49 additions & 58 deletions docs/QAT.md
Original file line number Diff line number Diff line change
@@ -1,75 +1,56 @@
# QAT
# Quantization-aware Training

## Design

At its core, QAT simulates low-precision inference-time computation in the forward pass of the training process. With QAT, all weights and activations are "fake quantized" during both the forward and backward passes of training: that is, float values are rounded to mimic int8 values, but all computations are still done with floating point numbers. Thus, all the weight adjustments during training are made while "aware" of the fact that the model will ultimately be quantized; after quantizing, therefore, this method will usually yield higher accuracy than either dynamic quantization or post-training static quantization.
Quantization-aware training (QAT) simulates low-precision inference-time computation in the forward pass of the training process. With QAT, all weights and activations are "fake quantized" during both the forward and backward passes of training: that is, float values are rounded to mimic int8 values, but all computations are still done with floating point numbers. Thus, all the weight adjustments during training are made while "aware" of the fact that the model will ultimately be quantized; after quantizing, therefore, this method will usually yield higher accuracy than either dynamic quantization or post-training static quantization.

The overall workflow for actually performing QAT is very similar to Post-training static quantization (PTQ):

* We can use the same model as PTQ; no additional preparation is needed for quantization-aware training.
* We need to use a qconfig specifying what kind of fake-quantization is to be inserted after weights and activations, instead of specifying observers.
<img src="../docs/imgs/fake_quant.png" width=700 height=433 alt="fake quantize">

## Usage

### MobileNetV2 Model Architecture

Refer to the [PTQ Model Usage](PTQ.md#mobilenetv2-model-architecture).

### Helper Functions

Refer to [PTQ Helper Functions](PTQ.md#helper-functions).

### QAT

First, define a training function:
First, define a training function as below.
accuracy is in the

```python
def train_one_epoch(model, criterion, optimizer, data_loader, device, ntrain_batches):
model.train()
top1 = AverageMeter('Acc@1', ':6.2f')
top5 = AverageMeter('Acc@5', ':6.2f')
avgloss = AverageMeter('Loss', '1.5f')

cnt = 0
for image, target in data_loader:
start_time = time.time()
print('.', end = '')
cnt += 1
image, target = image.to(device), target.to(device)
output = model(image)
loss = criterion(output, target)
optimizer.zero_grad()
loss.backward()
optimizer.step()
acc1, acc5 = accuracy(output, target, topk=(1, 5))
top1.update(acc1[0], image.size(0))
top5.update(acc5[0], image.size(0))
avgloss.update(loss, image.size(0))
if cnt >= ntrain_batches:
print('Loss', avgloss.avg)

print('Training: * Acc@1 {top1.avg:.3f} Acc@5 {top5.avg:.3f}'
.format(top1=top1, top5=top5))
return

print('Full imagenet train set: * Acc@1 {top1.global_avg:.3f} Acc@5 {top5.global_avg:.3f}'
.format(top1=top1, top5=top5))
return
def training_func_for_nc(model):
epochs = 8
iters = 30
optimizer = torch.optim.SGD(model.parameters(), lr=0.0001)
for nepoch in range(epochs):
model.train()
cnt = 0
for image, target in train_loader:
print('.', end='')
cnt += 1
output = model(image)
loss = criterion(output, target)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if cnt >= iters:
break
if nepoch > 3:
# Freeze quantizer parameters
model.apply(torch.quantization.disable_observer)
if nepoch > 2:
# Freeze batch norm mean and variance estimates
model.apply(torch.nn.intrinsic.qat.freeze_bn_stats)
return model
```
Fuse modules as PTQ:
Fuse modules:
```python
model.fuse_model()
optimizer = torch.optim.SGD(model.parameters(), lr = 0.0001)
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
```
Finally, prepare_qat performs the "fake quantization", preparing the model for quantization-aware training:
Finally, prepare_qat performs the "fake quantization", preparing the model for quantization-aware training, this function already be implemented as a hook :
```python
torch.quantization.prepare_qat(model, inplace=True)
```
Training a quantized model with high accuracy requires accurate modeling of numerics at inference. For quantization-aware training, therefore, modify the training loop by doing the following:

Training a quantized model with high accuracy requires accurate modeling of numerics at inference. INC does the training loop by following:
* Switch batch norm to use running mean and variance towards the end of training to better match inference numerics.
* Freeze the quantizer parameters (scale and zero-point) and fine tune the weights.

```python
num_train_batches = 20
# Train and check accuracy after each epoch
Expand All @@ -88,6 +69,20 @@ for nepoch in range(8):
print('Epoch %d :Evaluation accuracy on %d images, %2.2f'%(nepoch, num_eval_batches * eval_batch_size, top1.avg))
```

When using QAT in INC, you just need to use these APIs:
```python
from neural_compressor.experimental import Quantization, common
quantizer = Quantization("./conf.yaml")
quantizer.model = common.Model(model)
quantizer.q_func = training_func_for_nc
quantizer.eval_dataloader = val_loader
q_model = quantizer.fit()
```

The quantizer.fit() function will return a best quantized model during timeout constrain.
<br>
The yaml define example: [The yaml example](/examples/pytorch/image_recognition/torchvision_models/quantization/qat/fx)

Here, we just perform quantization-aware training for a small number of epochs. Nevertheless, quantization-aware training yields an accuracy of over 71% on the entire imagenet dataset, which is close to the floating point accuracy of 71.9%.

More on quantization-aware training:
Expand All @@ -96,10 +91,6 @@ More on quantization-aware training:
* We can simulate the accuracy of a quantized model in floating points since we are using fake-quantization to model the numerics of actual quantized arithmetic.
* We can easily mimic post-training quantization.

Intel® Neural Compressor can support QAT calibration for
PyTorch models. Refer to the [QAT model](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/eager/image_recognition/imagenet/cpu/qat/README.md) for step-by-step tuning.

### Example
View a [QAT example of PyTorch resnet50](/examples/pytorch/image_recognition/torchvision_models/quantization/qat).

### Examples
For related examples, please refer to the [QAT models](/examples/README.md).

82 changes: 72 additions & 10 deletions docs/Quantization.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,77 @@
Quantization
============
# Quantization

Quantization refers to processes that enable lower precision inference and training by performing computations at fixed point integers that are lower than floating points. This often leads to smaller model sizes and faster inference time. Quantization is particularly useful in deep learning inference and training, where moving data more quickly and reducing bandwidth bottlenecks is optimal. Intel is actively working on techniques that use lower numerical precision by using training with 16-bit multipliers and inference with 8-bit or 16-bit multipliers. Refer to the Intel article on [lower numerical precision inference and training in deep learning](https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html).
Quantization is a widely-used model compression technique that can reduce model size while also improving inference and training latency.</br>
The full precision data converts to low-precision, there is little degradation in model accuracy, but the inference performance of quantized model can gain higher performance by saving the memory bandwidth and accelerating computations with low precision instructions. Intel provided several lower precision instructions (ex: 8-bit or 16-bit multipliers), both training and inference can get benefits from them.
Refer to the Intel article on [lower numerical precision inference and training in deep learning](https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html).

Quantization methods include the following three classes:
## Quantization Support Matrix

* [Post-Training Quantization (PTQ)](./PTQ.md)
* [Quantization-Aware Training (QAT)](./QAT.md)
* [Dynamic Quantization](./dynamic_quantization.md)
Quantization methods include the following three types:
<table class="center">
<thead>
<tr>
<th>Types</th>
<th>Quantization</th>
<th>Dataset Requirements</th>
<th>Framework</th>
<th>Backend</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3" align="center">Post-Training Static Quantization (PTQ)</td>
<td rowspan="3" align="center">weights and activations</td>
<td rowspan="3" align="center">calibration</td>
<td align="center">PyTorch</td>
<td align="center"><a href="https://pytorch.org/docs/stable/quantization.html#eager-mode-quantization">PyTorch Eager</a>/<a href="https://pytorch.org/docs/stable/quantization.html#prototype-fx-graph-mode-quantization">PyTorch FX</a>/<a href="https://github.com/intel/intel-extension-for-pytorch">IPEX</a></td>
</tr>
<tr>
<td align="center">TensorFlow</td>
<td align="center"><a href="https://github.com/tensorflow/tensorflow">TensorFlow</a>/<a href="https://github.com/Intel-tensorflow/tensorflow">Intel TensorFlow</a></td>
</tr>
<tr>
<td align="center">ONNX Runtime</td>
<td align="center"><a href="https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/quantize.py">QLinearops/QDQ</a></td>
</tr>
<tr>
<td rowspan="2" align="center">Post-Training Dynamic Quantization</td>
<td rowspan="2" align="center">weights</td>
<td rowspan="2" align="center">none</td>
<td align="center">PyTorch</td>
<td align="center"><a href="https://pytorch.org/docs/stable/quantization.html#eager-mode-quantization">PyTorch eager mode</a>/<a href="https://pytorch.org/docs/stable/quantization.html#prototype-fx-graph-mode-quantization">PyTorch fx mode</a>/<a href="https://github.com/intel/intel-extension-for-pytorch">IPEX</a></td>
</tr>
<tr>
<td align="center">ONNX Runtime</td>
<td align="center"><a href="https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/quantize.py">QIntegerops</a></td>
</tr>
<tr>
<td rowspan="2" align="center">Quantization-aware Training (QAT)</td>
<td rowspan="2" align="center">weights and activations</td>
<td rowspan="2" align="center">fine-tuning</td>
<td align="center">PyTorch</td>
<td align="center"><a href="https://pytorch.org/docs/stable/quantization.html#eager-mode-quantization">PyTorch eager mode</a>/<a href="https://pytorch.org/docs/stable/quantization.html#prototype-fx-graph-mode-quantization">PyTorch fx mode</a>/<a href="https://github.com/intel/intel-extension-for-pytorch">IPEX</a></td>
</tr>
<tr>
<td align="center">TensorFlow</td>
<td align="center"><a href="https://github.com/tensorflow/tensorflow">TensorFlow</a>/<a href="https://github.com/Intel-tensorflow/tensorflow">Intel TensorFlow</a></td>
</tr>
</tbody>
</table>
<br>
<br>

> **Note**
>
> Dynamic Quantization currently only supports the onnxruntime backend.

### [Post-Training Static Quantization](./PTQ.md) performs quantization on already trained models, it requires an additional pass over the dataset to work, only activations do calibration.
<img src="../docs/imgs/PTQ.png" width=256 height=129 alt="PTQ">
<br>

### [Post-Training Dynamic Quantization](./dynamic_quantization.md) simply multiplies input values by a scaling factor, then rounds the result to the nearest, it determines the scale factor for activations dynamically based on the data range observed at runtime. Weights are quantized ahead of time but the activations are dynamically quantized during inference.
<img src="../docs/imgs/dynamic_quantization.png" width=270 height=124 alt="Dynamic Quantization">
<br>

### [Quantization-aware Training (QAT)](./QAT.md) quantizes models during training and typically provides higher accuracy comparing with post-training quantization, but QAT may require additional hyper-parameter tuning and it may take more time to deployment.
<img src="../docs/imgs/QAT.png" width=244 height=147 alt="QAT">

## Examples of Quantization

For Quantization related examples, please refer to [Quantization examples](/examples/README.md)
2 changes: 1 addition & 1 deletion docs/infrastructure.md → docs/design.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Infrastructure
Design
=====
Intel® Neural Compressor features an architecture and workflow that aids in increasing performance and faster deployments across infrastructures.

Expand Down
Binary file added docs/imgs/PTQ.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/QAT.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/dynamic_quantization.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/fake_quant.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
57 changes: 57 additions & 0 deletions docs/orchestration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
Optimization Orchestration
============

## Introduction

Intel Neural Compressor supports arbitrary meaningful combinations of supported optimization methods under one-shot or multi-shot, such as pruning during quantization-aware training, or pruning and then post-training quantization,
pruning and then distillation and then quantization.

## Validated Orchestration Types

### One-shot

- Pruning during quantization-aware training
- Distillation with pattern lock pruning
- Distillation with pattern lock pruning and quantization-aware training

### Multi-shot

- Pruning and then post-training quantization
- Distillation and then post-training quantization

## Orchestration user facing API

Neural Compressor defines `Scheduler` class to automatically pipeline execute model optimization with one shot or multiple shots way.

User instantiates model optimization components, such as quantization, pruning, distillation, separately. After that, user could append
those separate optimization objects into scheduler's pipeline, the scheduler API executes them one by one.

In following example it executes the pruning and then post-training quantization with two-shot way.

```python
from neural_compressor.experimental import Quantization, Pruning, Scheduler
prune = Pruning(prune_conf)
quantizer = Quantization(post_training_quantization_conf)
scheduler = Scheduler()
scheduler.model = model
scheduler.append(prune)
scheduler.append(quantizer)
opt_model = scheduler.fit()
```

If user wants to execute the pruning and quantization-aware training with one-shot way, the code is like below.

```python
from neural_compressor.experimental import Quantization, Pruning, Scheduler
prune = Pruning(prune_conf)
quantizer = Quantization(quantization_aware_training_conf)
scheduler = Scheduler()
scheduler.model = model
combination = scheduler.combine(prune, quantizer)
scheduler.append(combination)
opt_model = scheduler.fit()
```

### Examples

For orchestration related examples, please refer to [Orchestration examples](../examples/README.md).
Loading

0 comments on commit 16a4a12

Please sign in to comment.