forked from intel/neural-compressor
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Documentation refine for examples/quantization/pruning/orchestration (i…
…ntel#988) Co-authored-by: Wenxin Zhang <[email protected]> Co-authored-by: hanwen.chang <[email protected]> Co-authored-by: Tian, Feng <[email protected]> Co-authored-by: hshen14 <[email protected]>
- Loading branch information
1 parent
fe70e0b
commit 16a4a12
Showing
14 changed files
with
2,363 additions
and
1,571 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -37,9 +37,7 @@ Intel® Neural Compressor has been one of the critical AI software components in | |
# install stable version from from conda | ||
conda install neural-compressor -c conda-forge -c intel | ||
``` | ||
More installation methods can be found at [Installation Guide](./docs/installation_guide.md). | ||
> **Note:** | ||
> Run into installation issues, please check [FAQ](./docs/faq.md). | ||
More installation methods can be found at [Installation Guide](./docs/installation_guide.md). Please check out our [FAQ](./docs/faq.md) for more details. | ||
|
||
## Getting Started | ||
* Quantization with Python API | ||
|
@@ -122,8 +120,8 @@ Intel® Neural Compressor supports systems based on [Intel 64 architecture or co | |
</tbody> | ||
</table> | ||
|
||
> Note: 1.Starting from official TensorFlow 2.6.0, oneDNN has been default in the binary. Please set the environment variable TF_ENABLE_ONEDNN_OPTS=1 to enable the oneDNN optimizations. | ||
> 2.Starting from official TensorFlow 2.9.0, oneDNN optimizations are enabled by default on CPUs with neural-network-focused hardware features such as AVX512_VNNI, AVX512_BF16, AMX, etc. No need to set environment variable. | ||
> **Note:** | ||
> Please set the environment variable TF_ENABLE_ONEDNN_OPTS=1 to enable oneDNN optimizations if you are using TensorFlow from v2.6 to v2.8. oneDNN has been fully default from TensorFlow v2.9. | ||
### Validated Models | ||
Intel® Neural Compressor validated 420+ [examples](./examples) with performance speedup geomean 2.2x and up to 4.2x on VNNI while minimizing the accuracy loss. | ||
|
@@ -143,7 +141,7 @@ More details for validated models are available [here](docs/validated_model_list | |
</thead> | ||
<tbody> | ||
<tr> | ||
<td colspan="3" align="center"><a href="docs/infrastructure.md">Infrastructure</a></td> | ||
<td colspan="3" align="center"><a href="docs/design.md">Architecture</a></td> | ||
<td colspan="2" align="center"><a href="docs/tutorial.md">Tutorial</a></td> | ||
<td colspan="2" align="center"><a href="./examples">Examples</a></td> | ||
<td colspan="1" align="center"><a href="docs/bench.md">GUI</a></td> | ||
|
@@ -177,7 +175,7 @@ More details for validated models are available [here](docs/validated_model_list | |
<td colspan="2" align="center"><a href="docs/Quantization.md">Quantization</a></td> | ||
<td colspan="1" align="center"><a href="docs/pruning.md">Pruning</a> <a href="docs/sparsity.md">(Sparsity)</a> </td> | ||
<td colspan="3" align="center"><a href="docs/distillation.md">Knowledge Distillation</a></td> | ||
<td colspan="3" align="center"><a href="docs/mixed_precision.md">Mixed precision</a></td> | ||
<td colspan="3" align="center"><a href="docs/mixed_precision.md">Mixed Precision</a></td> | ||
</tr> | ||
<tr> | ||
<td colspan="2" align="center"><a href="docs/benchmark.md">Benchmarking</a></td> | ||
|
@@ -207,7 +205,7 @@ More details for validated models are available [here](docs/validated_model_list | |
* [Quantizing ONNX Models using Intel® Neural Compressor](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Quantizing-ONNX-Models-using-Intel-Neural-Compressor/post/1355237) (Feb 2022) | ||
* [Quantize AI Model by Intel® oneAPI AI Analytics Toolkit on Alibaba Cloud](https://www.intel.com/content/www/us/en/developer/articles/technical/quantize-ai-by-oneapi-analytics-on-alibaba-cloud.html) (Feb 2022) | ||
|
||
> View the [full publication list](docs/publication_list.md). | ||
> Please check out our [full publication list](docs/publication_list.md). | ||
## Additional Content | ||
|
||
|
@@ -217,6 +215,6 @@ More details for validated models are available [here](docs/validated_model_list | |
* [Security Policy](docs/security_policy.md) | ||
* [Intel® Neural Compressor Website](https://intel.github.io/neural-compressor) | ||
|
||
## Hiring | ||
## Hiring :star: | ||
|
||
We are hiring. Please send your resume to [email protected] if you have interests in model compression techniques. | ||
We are actively hiring. Please send your resume to [email protected] if you have interests in model compression techniques. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,77 @@ | ||
Quantization | ||
============ | ||
# Quantization | ||
|
||
Quantization refers to processes that enable lower precision inference and training by performing computations at fixed point integers that are lower than floating points. This often leads to smaller model sizes and faster inference time. Quantization is particularly useful in deep learning inference and training, where moving data more quickly and reducing bandwidth bottlenecks is optimal. Intel is actively working on techniques that use lower numerical precision by using training with 16-bit multipliers and inference with 8-bit or 16-bit multipliers. Refer to the Intel article on [lower numerical precision inference and training in deep learning](https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html). | ||
Quantization is a widely-used model compression technique that can reduce model size while also improving inference and training latency.</br> | ||
The full precision data converts to low-precision, there is little degradation in model accuracy, but the inference performance of quantized model can gain higher performance by saving the memory bandwidth and accelerating computations with low precision instructions. Intel provided several lower precision instructions (ex: 8-bit or 16-bit multipliers), both training and inference can get benefits from them. | ||
Refer to the Intel article on [lower numerical precision inference and training in deep learning](https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html). | ||
|
||
Quantization methods include the following three classes: | ||
## Quantization Support Matrix | ||
|
||
* [Post-Training Quantization (PTQ)](./PTQ.md) | ||
* [Quantization-Aware Training (QAT)](./QAT.md) | ||
* [Dynamic Quantization](./dynamic_quantization.md) | ||
Quantization methods include the following three types: | ||
<table class="center"> | ||
<thead> | ||
<tr> | ||
<th>Types</th> | ||
<th>Quantization</th> | ||
<th>Dataset Requirements</th> | ||
<th>Framework</th> | ||
<th>Backend</th> | ||
</tr> | ||
</thead> | ||
<tbody> | ||
<tr> | ||
<td rowspan="3" align="center">Post-Training Static Quantization (PTQ)</td> | ||
<td rowspan="3" align="center">weights and activations</td> | ||
<td rowspan="3" align="center">calibration</td> | ||
<td align="center">PyTorch</td> | ||
<td align="center"><a href="https://pytorch.org/docs/stable/quantization.html#eager-mode-quantization">PyTorch Eager</a>/<a href="https://pytorch.org/docs/stable/quantization.html#prototype-fx-graph-mode-quantization">PyTorch FX</a>/<a href="https://github.com/intel/intel-extension-for-pytorch">IPEX</a></td> | ||
</tr> | ||
<tr> | ||
<td align="center">TensorFlow</td> | ||
<td align="center"><a href="https://github.com/tensorflow/tensorflow">TensorFlow</a>/<a href="https://github.com/Intel-tensorflow/tensorflow">Intel TensorFlow</a></td> | ||
</tr> | ||
<tr> | ||
<td align="center">ONNX Runtime</td> | ||
<td align="center"><a href="https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/quantize.py">QLinearops/QDQ</a></td> | ||
</tr> | ||
<tr> | ||
<td rowspan="2" align="center">Post-Training Dynamic Quantization</td> | ||
<td rowspan="2" align="center">weights</td> | ||
<td rowspan="2" align="center">none</td> | ||
<td align="center">PyTorch</td> | ||
<td align="center"><a href="https://pytorch.org/docs/stable/quantization.html#eager-mode-quantization">PyTorch eager mode</a>/<a href="https://pytorch.org/docs/stable/quantization.html#prototype-fx-graph-mode-quantization">PyTorch fx mode</a>/<a href="https://github.com/intel/intel-extension-for-pytorch">IPEX</a></td> | ||
</tr> | ||
<tr> | ||
<td align="center">ONNX Runtime</td> | ||
<td align="center"><a href="https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/quantize.py">QIntegerops</a></td> | ||
</tr> | ||
<tr> | ||
<td rowspan="2" align="center">Quantization-aware Training (QAT)</td> | ||
<td rowspan="2" align="center">weights and activations</td> | ||
<td rowspan="2" align="center">fine-tuning</td> | ||
<td align="center">PyTorch</td> | ||
<td align="center"><a href="https://pytorch.org/docs/stable/quantization.html#eager-mode-quantization">PyTorch eager mode</a>/<a href="https://pytorch.org/docs/stable/quantization.html#prototype-fx-graph-mode-quantization">PyTorch fx mode</a>/<a href="https://github.com/intel/intel-extension-for-pytorch">IPEX</a></td> | ||
</tr> | ||
<tr> | ||
<td align="center">TensorFlow</td> | ||
<td align="center"><a href="https://github.com/tensorflow/tensorflow">TensorFlow</a>/<a href="https://github.com/Intel-tensorflow/tensorflow">Intel TensorFlow</a></td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
<br> | ||
<br> | ||
|
||
> **Note** | ||
> | ||
> Dynamic Quantization currently only supports the onnxruntime backend. | ||
|
||
### [Post-Training Static Quantization](./PTQ.md) performs quantization on already trained models, it requires an additional pass over the dataset to work, only activations do calibration. | ||
<img src="../docs/imgs/PTQ.png" width=256 height=129 alt="PTQ"> | ||
<br> | ||
|
||
### [Post-Training Dynamic Quantization](./dynamic_quantization.md) simply multiplies input values by a scaling factor, then rounds the result to the nearest, it determines the scale factor for activations dynamically based on the data range observed at runtime. Weights are quantized ahead of time but the activations are dynamically quantized during inference. | ||
<img src="../docs/imgs/dynamic_quantization.png" width=270 height=124 alt="Dynamic Quantization"> | ||
<br> | ||
|
||
### [Quantization-aware Training (QAT)](./QAT.md) quantizes models during training and typically provides higher accuracy comparing with post-training quantization, but QAT may require additional hyper-parameter tuning and it may take more time to deployment. | ||
<img src="../docs/imgs/QAT.png" width=244 height=147 alt="QAT"> | ||
|
||
## Examples of Quantization | ||
|
||
For Quantization related examples, please refer to [Quantization examples](/examples/README.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
Optimization Orchestration | ||
============ | ||
|
||
## Introduction | ||
|
||
Intel Neural Compressor supports arbitrary meaningful combinations of supported optimization methods under one-shot or multi-shot, such as pruning during quantization-aware training, or pruning and then post-training quantization, | ||
pruning and then distillation and then quantization. | ||
|
||
## Validated Orchestration Types | ||
|
||
### One-shot | ||
|
||
- Pruning during quantization-aware training | ||
- Distillation with pattern lock pruning | ||
- Distillation with pattern lock pruning and quantization-aware training | ||
|
||
### Multi-shot | ||
|
||
- Pruning and then post-training quantization | ||
- Distillation and then post-training quantization | ||
|
||
## Orchestration user facing API | ||
|
||
Neural Compressor defines `Scheduler` class to automatically pipeline execute model optimization with one shot or multiple shots way. | ||
|
||
User instantiates model optimization components, such as quantization, pruning, distillation, separately. After that, user could append | ||
those separate optimization objects into scheduler's pipeline, the scheduler API executes them one by one. | ||
|
||
In following example it executes the pruning and then post-training quantization with two-shot way. | ||
|
||
```python | ||
from neural_compressor.experimental import Quantization, Pruning, Scheduler | ||
prune = Pruning(prune_conf) | ||
quantizer = Quantization(post_training_quantization_conf) | ||
scheduler = Scheduler() | ||
scheduler.model = model | ||
scheduler.append(prune) | ||
scheduler.append(quantizer) | ||
opt_model = scheduler.fit() | ||
``` | ||
|
||
If user wants to execute the pruning and quantization-aware training with one-shot way, the code is like below. | ||
|
||
```python | ||
from neural_compressor.experimental import Quantization, Pruning, Scheduler | ||
prune = Pruning(prune_conf) | ||
quantizer = Quantization(quantization_aware_training_conf) | ||
scheduler = Scheduler() | ||
scheduler.model = model | ||
combination = scheduler.combine(prune, quantizer) | ||
scheduler.append(combination) | ||
opt_model = scheduler.fit() | ||
``` | ||
|
||
### Examples | ||
|
||
For orchestration related examples, please refer to [Orchestration examples](../examples/README.md). |
Oops, something went wrong.