Documentation refine for examples/quantization/pruning/orchestration (i…

…ntel#988) Co-authored-by: Wenxin Zhang <[email protected]> Co-authored-by: hanwen.chang <[email protected]> Co-authored-by: Tian, Feng <[email protected]> Co-authored-by: hshen14 <[email protected]>
yiliu30 · Jun 18, 2022 · 16a4a12 · 16a4a12
1 parent fe70e0b
commit 16a4a12
Show file tree

Hide file tree

Showing 14 changed files with 2,363 additions and 1,571 deletions.
diff --git a/README.md b/README.md
@@ -37,9 +37,7 @@ Intel® Neural Compressor has been one of the critical AI software components in
   # install stable version from from conda
   conda install neural-compressor -c conda-forge -c intel 
   ```
-More installation methods can be found at [Installation Guide](./docs/installation_guide.md).  
-> **Note:**
-> Run into installation issues, please check [FAQ](./docs/faq.md). 
+More installation methods can be found at [Installation Guide](./docs/installation_guide.md). Please check out our [FAQ](./docs/faq.md) for more details.
 
 ## Getting Started
 * Quantization with Python API  
@@ -122,8 +120,8 @@ Intel® Neural Compressor supports systems based on [Intel 64 architecture or co
 </tbody>
 </table>
 
-> Note: 1.Starting from official TensorFlow 2.6.0, oneDNN has been default in the binary. Please set the environment variable TF_ENABLE_ONEDNN_OPTS=1 to enable the oneDNN optimizations.  
-> 2.Starting from official TensorFlow 2.9.0, oneDNN optimizations are enabled by default on CPUs with neural-network-focused hardware features such as AVX512_VNNI, AVX512_BF16, AMX, etc. No need to set environment variable.
+> **Note:**
+> Please set the environment variable TF_ENABLE_ONEDNN_OPTS=1 to enable oneDNN optimizations if you are using TensorFlow from v2.6 to v2.8. oneDNN has been fully default from TensorFlow v2.9.
 
 ### Validated Models
 Intel® Neural Compressor validated 420+ [examples](./examples) with performance speedup geomean 2.2x and up to 4.2x on VNNI while minimizing the accuracy loss. 
@@ -143,7 +141,7 @@ More details for validated models are available [here](docs/validated_model_list
   </thead>
   <tbody>
     <tr>
-      <td colspan="3" align="center"><a href="docs/infrastructure.md">Infrastructure</a></td>
+      <td colspan="3" align="center"><a href="docs/design.md">Architecture</a></td>
       <td colspan="2" align="center"><a href="docs/tutorial.md">Tutorial</a></td>
       <td colspan="2" align="center"><a href="./examples">Examples</a></td>
       <td colspan="1" align="center"><a href="docs/bench.md">GUI</a></td>
@@ -177,7 +175,7 @@ More details for validated models are available [here](docs/validated_model_list
         <td colspan="2" align="center"><a href="docs/Quantization.md">Quantization</a></td>
         <td colspan="1" align="center"><a href="docs/pruning.md">Pruning</a> <a href="docs/sparsity.md">(Sparsity)</a> </td> 
         <td colspan="3" align="center"><a href="docs/distillation.md">Knowledge Distillation</a></td>
-        <td colspan="3" align="center"><a href="docs/mixed_precision.md">Mixed precision</a></td>
+        <td colspan="3" align="center"><a href="docs/mixed_precision.md">Mixed Precision</a></td>
     </tr>
     <tr>
         <td colspan="2" align="center"><a href="docs/benchmark.md">Benchmarking</a></td>
@@ -207,7 +205,7 @@ More details for validated models are available [here](docs/validated_model_list
 * [Quantizing ONNX Models using Intel® Neural Compressor](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Quantizing-ONNX-Models-using-Intel-Neural-Compressor/post/1355237) (Feb 2022)
 * [Quantize AI Model by Intel® oneAPI AI Analytics Toolkit on Alibaba Cloud](https://www.intel.com/content/www/us/en/developer/articles/technical/quantize-ai-by-oneapi-analytics-on-alibaba-cloud.html) (Feb 2022)
 
-> View the [full publication list](docs/publication_list.md).
+> Please check out our [full publication list](docs/publication_list.md).
 
 ## Additional Content
 
@@ -217,6 +215,6 @@ More details for validated models are available [here](docs/validated_model_list
 * [Security Policy](docs/security_policy.md)
 * [Intel® Neural Compressor Website](https://intel.github.io/neural-compressor)
 
-## Hiring
+## Hiring :star:
 
-We are hiring. Please send your resume to [email protected] if you have interests in model compression techniques.
+We are actively hiring. Please send your resume to [email protected] if you have interests in model compression techniques.
diff --git a/docs/QAT.md b/docs/QAT.md
@@ -1,75 +1,56 @@
-# QAT
+# Quantization-aware Training
 
 ## Design
 
-At its core, QAT simulates low-precision inference-time computation in the forward pass of the training process. With QAT, all weights and activations are "fake quantized" during both the forward and backward passes of training: that is, float values are rounded to mimic int8 values, but all computations are still done with floating point numbers. Thus, all the weight adjustments during training are made while "aware" of the fact that the model will ultimately be quantized; after quantizing, therefore, this method will usually yield higher accuracy than either dynamic quantization or post-training static quantization.
+Quantization-aware training (QAT) simulates low-precision inference-time computation in the forward pass of the training process. With QAT, all weights and activations are "fake quantized" during both the forward and backward passes of training: that is, float values are rounded to mimic int8 values, but all computations are still done with floating point numbers. Thus, all the weight adjustments during training are made while "aware" of the fact that the model will ultimately be quantized; after quantizing, therefore, this method will usually yield higher accuracy than either dynamic quantization or post-training static quantization.
 
-The overall workflow for actually performing QAT is very similar to Post-training static quantization (PTQ):
-
-* We can use the same model as PTQ; no additional preparation is needed for quantization-aware training.
-* We need to use a qconfig specifying what kind of fake-quantization is to be inserted after weights and activations, instead of specifying observers.
+<img src="../docs/imgs/fake_quant.png" width=700 height=433 alt="fake quantize">
 
 ## Usage
 
-### MobileNetV2 Model Architecture
-
-Refer to the [PTQ Model Usage](PTQ.md#mobilenetv2-model-architecture).
-
-### Helper Functions
-
-Refer to [PTQ Helper Functions](PTQ.md#helper-functions).
-
-### QAT
-
-First, define a training function:
+First, define a training function as below.
+accuracy is in the 
 
 ```python
-def train_one_epoch(model, criterion, optimizer, data_loader, device, ntrain_batches):
-    model.train()
-    top1 = AverageMeter('Acc@1', ':6.2f')
-    top5 = AverageMeter('Acc@5', ':6.2f')
-    avgloss = AverageMeter('Loss', '1.5f')
-
-    cnt = 0
-    for image, target in data_loader:
-        start_time = time.time()
-        print('.', end = '')
-        cnt += 1
-        image, target = image.to(device), target.to(device)
-        output = model(image)
-        loss = criterion(output, target)
-        optimizer.zero_grad()
-        loss.backward()
-        optimizer.step()
-        acc1, acc5 = accuracy(output, target, topk=(1, 5))
-        top1.update(acc1[0], image.size(0))
-        top5.update(acc5[0], image.size(0))
-        avgloss.update(loss, image.size(0))
-        if cnt >= ntrain_batches:
-            print('Loss', avgloss.avg)
-
-            print('Training: * Acc@1 {top1.avg:.3f} Acc@5 {top5.avg:.3f}'
-                  .format(top1=top1, top5=top5))
-            return
-
-    print('Full imagenet train set:  * Acc@1 {top1.global_avg:.3f} Acc@5 {top5.global_avg:.3f}'
-          .format(top1=top1, top5=top5))
-    return
+def training_func_for_nc(model):
+    epochs = 8
+    iters = 30
+    optimizer = torch.optim.SGD(model.parameters(), lr=0.0001)
+    for nepoch in range(epochs):
+        model.train()
+        cnt = 0
+        for image, target in train_loader:
+            print('.', end='')
+            cnt += 1
+            output = model(image)
+            loss = criterion(output, target)
+            optimizer.zero_grad()
+            loss.backward()
+            optimizer.step()
+            if cnt >= iters:
+                break
+        if nepoch > 3:
+            # Freeze quantizer parameters
+            model.apply(torch.quantization.disable_observer)
+        if nepoch > 2:
+            # Freeze batch norm mean and variance estimates
+            model.apply(torch.nn.intrinsic.qat.freeze_bn_stats)
+    return model
 ```
-Fuse modules as PTQ:
+Fuse modules:
 ```python
 model.fuse_model()
 optimizer = torch.optim.SGD(model.parameters(), lr = 0.0001)
 model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
 ```
-Finally, prepare_qat performs the "fake quantization", preparing the model for quantization-aware training:
+Finally, prepare_qat performs the "fake quantization", preparing the model for quantization-aware training, this function already be implemented as a hook :
 ```python
 torch.quantization.prepare_qat(model, inplace=True)
 ```
-Training a quantized model with high accuracy requires accurate modeling of numerics at inference. For quantization-aware training, therefore, modify the training loop by doing the following:
-
+Training a quantized model with high accuracy requires accurate modeling of numerics at inference. INC does the training loop by following:
 * Switch batch norm to use running mean and variance towards the end of training to better match inference numerics.
 * Freeze the quantizer parameters (scale and zero-point) and fine tune the weights.
+
 ```python
 num_train_batches = 20
 # Train and check accuracy after each epoch
@@ -88,6 +69,20 @@ for nepoch in range(8):
     print('Epoch %d :Evaluation accuracy on %d images, %2.2f'%(nepoch, num_eval_batches * eval_batch_size, top1.avg))
 ```
 
+When using QAT in INC, you just need to use these APIs: 
+```python
+from neural_compressor.experimental import Quantization, common
+quantizer = Quantization("./conf.yaml")
+quantizer.model = common.Model(model)
+quantizer.q_func = training_func_for_nc
+quantizer.eval_dataloader = val_loader
+q_model = quantizer.fit()
+```
+
+The quantizer.fit() function will return a best quantized model during timeout constrain.
+<br>
+The yaml define example: [The yaml example](/examples/pytorch/image_recognition/torchvision_models/quantization/qat/fx)
+
 Here, we just perform quantization-aware training for a small number of epochs. Nevertheless, quantization-aware training yields an accuracy of over 71% on the entire imagenet dataset, which is close to the floating point accuracy of 71.9%.
 
 More on quantization-aware training:
@@ -96,10 +91,6 @@ More on quantization-aware training:
 * We can simulate the accuracy of a quantized model in floating points since we are using fake-quantization to model the numerics of actual quantized arithmetic.
 * We can easily mimic post-training quantization.
 
-Intel® Neural Compressor can support QAT calibration for
-PyTorch models. Refer to the [QAT model](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/eager/image_recognition/imagenet/cpu/qat/README.md) for step-by-step tuning.
-
-### Example
-View a [QAT example of PyTorch resnet50](/examples/pytorch/image_recognition/torchvision_models/quantization/qat).
-
+### Examples
+For related examples, please refer to the [QAT models](/examples/README.md).
 
diff --git a/docs/Quantization.md b/docs/Quantization.md
@@ -1,15 +1,77 @@
-Quantization
-============
+# Quantization
 
-Quantization refers to processes that enable lower precision inference and training by performing computations at fixed point integers that are lower than floating points. This often leads to smaller model sizes and faster inference time. Quantization is particularly useful in deep learning inference and training, where moving data more quickly and reducing bandwidth bottlenecks is optimal. Intel is actively working on techniques that use lower numerical precision by using training with 16-bit multipliers and inference with 8-bit or 16-bit multipliers. Refer to the Intel article on [lower numerical precision inference and training in deep learning](https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html).
+Quantization is a widely-used model compression technique that can reduce model size while also improving inference and training latency.</br>
+The full precision data converts to low-precision, there is little degradation in model accuracy, but the inference performance of quantized model can gain higher performance by saving the memory bandwidth and accelerating computations with low precision instructions. Intel provided several lower precision instructions (ex: 8-bit or 16-bit multipliers), both training and inference can get benefits from them.
+Refer to the Intel article on [lower numerical precision inference and training in deep learning](https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html).
 
-Quantization methods include the following three classes:
+## Quantization Support Matrix
 
-* [Post-Training Quantization (PTQ)](./PTQ.md)
-* [Quantization-Aware Training (QAT)](./QAT.md)
-* [Dynamic Quantization](./dynamic_quantization.md)
+Quantization methods include the following three types:
+<table class="center">
+    <thead>
+        <tr>
+            <th>Types</th>
+            <th>Quantization</th>
+            <th>Dataset Requirements</th>
+            <th>Framework</th>
+            <th>Backend</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr>
+            <td rowspan="3" align="center">Post-Training Static Quantization (PTQ)</td>
+            <td rowspan="3" align="center">weights and activations</td>
+            <td rowspan="3" align="center">calibration</td>
+            <td align="center">PyTorch</td>
+            <td align="center"><a href="https://pytorch.org/docs/stable/quantization.html#eager-mode-quantization">PyTorch Eager</a>/<a href="https://pytorch.org/docs/stable/quantization.html#prototype-fx-graph-mode-quantization">PyTorch FX</a>/<a href="https://github.com/intel/intel-extension-for-pytorch">IPEX</a></td>
+        </tr>
+        <tr>
+            <td align="center">TensorFlow</td>
+            <td align="center"><a href="https://github.com/tensorflow/tensorflow">TensorFlow</a>/<a href="https://github.com/Intel-tensorflow/tensorflow">Intel TensorFlow</a></td>
+        </tr>
+        <tr>
+            <td align="center">ONNX Runtime</td>
+            <td align="center"><a href="https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/quantize.py">QLinearops/QDQ</a></td>
+        </tr>
+        <tr>
+            <td rowspan="2" align="center">Post-Training Dynamic Quantization</td>
+            <td rowspan="2" align="center">weights</td>
+            <td rowspan="2" align="center">none</td>
+            <td align="center">PyTorch</td>
+            <td align="center"><a href="https://pytorch.org/docs/stable/quantization.html#eager-mode-quantization">PyTorch eager mode</a>/<a href="https://pytorch.org/docs/stable/quantization.html#prototype-fx-graph-mode-quantization">PyTorch fx mode</a>/<a href="https://github.com/intel/intel-extension-for-pytorch">IPEX</a></td>
+        </tr>
+        <tr>
+            <td align="center">ONNX Runtime</td>
+            <td align="center"><a href="https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/quantize.py">QIntegerops</a></td>
+        </tr>  
+        <tr>
+            <td rowspan="2" align="center">Quantization-aware Training (QAT)</td>
+            <td rowspan="2" align="center">weights and activations</td>
+            <td rowspan="2" align="center">fine-tuning</td>
+            <td align="center">PyTorch</td>
+            <td align="center"><a href="https://pytorch.org/docs/stable/quantization.html#eager-mode-quantization">PyTorch eager mode</a>/<a href="https://pytorch.org/docs/stable/quantization.html#prototype-fx-graph-mode-quantization">PyTorch fx mode</a>/<a href="https://github.com/intel/intel-extension-for-pytorch">IPEX</a></td>
+        </tr>
+        <tr>
+            <td align="center">TensorFlow</td>
+            <td align="center"><a href="https://github.com/tensorflow/tensorflow">TensorFlow</a>/<a href="https://github.com/Intel-tensorflow/tensorflow">Intel TensorFlow</a></td>
+        </tr>
+    </tbody>
+</table>
+<br>
+<br>
 
-> **Note** 
->
-> Dynamic Quantization currently only supports the onnxruntime backend.
 
+### [Post-Training Static Quantization](./PTQ.md) performs quantization on already trained models, it requires an additional pass over the dataset to work, only activations do calibration.
+<img src="../docs/imgs/PTQ.png" width=256 height=129 alt="PTQ">
+<br>
+
+### [Post-Training Dynamic Quantization](./dynamic_quantization.md) simply multiplies input values by a scaling factor, then rounds the result to the nearest, it determines the scale factor for activations dynamically based on the data range observed at runtime. Weights are quantized ahead of time but the activations are dynamically quantized during inference.
+<img src="../docs/imgs/dynamic_quantization.png" width=270 height=124 alt="Dynamic Quantization">
+<br>
+
+### [Quantization-aware Training (QAT)](./QAT.md) quantizes models during training and typically provides higher accuracy comparing with post-training quantization, but QAT may require additional hyper-parameter tuning and it may take more time to deployment.
+<img src="../docs/imgs/QAT.png" width=244 height=147 alt="QAT">
+
+## Examples of Quantization
+
+For Quantization related examples, please refer to [Quantization examples](/examples/README.md)
diff --git a/docs/infrastructure.md → docs/design.md b/docs/infrastructure.md → docs/design.md
@@ -1,4 +1,4 @@
-Infrastructure
+Design
 =====
 Intel® Neural Compressor features an architecture and workflow that aids in increasing performance and faster deployments across infrastructures. 
 

diff --git a/docs/imgs/PTQ.png b/docs/imgs/PTQ.png
diff --git a/docs/imgs/QAT.png b/docs/imgs/QAT.png
diff --git a/docs/imgs/dynamic_quantization.png b/docs/imgs/dynamic_quantization.png
diff --git a/docs/imgs/fake_quant.png b/docs/imgs/fake_quant.png
diff --git a/docs/orchestration.md b/docs/orchestration.md
@@ -0,0 +1,57 @@
+Optimization Orchestration
+============
+
+## Introduction
+
+Intel Neural Compressor supports arbitrary meaningful combinations of supported optimization methods under one-shot or multi-shot, such as pruning during quantization-aware training, or pruning and then post-training quantization,
+pruning and then distillation and then quantization.
+
+## Validated Orchestration Types
+
+### One-shot
+
+- Pruning during quantization-aware training
+- Distillation with pattern lock pruning
+- Distillation with pattern lock pruning and quantization-aware training
+
+### Multi-shot
+
+- Pruning and then post-training quantization
+- Distillation and then post-training quantization
+
+## Orchestration user facing API
+
+Neural Compressor defines `Scheduler` class to automatically pipeline execute model optimization with one shot or multiple shots way. 
+
+User instantiates model optimization components, such as quantization, pruning, distillation, separately. After that, user could append
+those separate optimization objects into scheduler's pipeline, the scheduler API executes them one by one.
+
+In following example it executes the pruning and then post-training quantization with two-shot way.
+
+```python
+from neural_compressor.experimental import Quantization, Pruning, Scheduler
+prune = Pruning(prune_conf)
+quantizer = Quantization(post_training_quantization_conf)
+scheduler = Scheduler()
+scheduler.model = model
+scheduler.append(prune)
+scheduler.append(quantizer)
+opt_model = scheduler.fit()
+```
+
+If user wants to execute the pruning and quantization-aware training with one-shot way, the code is like below.
+
+```python
+from neural_compressor.experimental import Quantization, Pruning, Scheduler
+prune = Pruning(prune_conf)
+quantizer = Quantization(quantization_aware_training_conf)
+scheduler = Scheduler()
+scheduler.model = model
+combination = scheduler.combine(prune, quantizer)
+scheduler.append(combination)
+opt_model = scheduler.fit()
+```
+
+### Examples
+
+For orchestration related examples, please refer to [Orchestration examples](../examples/README.md).