Intel® Neural Compressor v2.2 Release
- Highlights
- Features
- Improvement
- Productivity
- Bug Fixes
- Examples
- External Contributes
Highlights
- Expanded SmoothQuant support on mainstream frameworks including PyTorch/IPEX, TensorFlow/ITEX, ONNX Runtime, and validated popular large language models (LLMs) such as GPT-J, LLaMA, OPT, BLOOM, Dolly, MPT, LaMini-LM and RedPajama-INCITE.
- Innovated two productivity components Neural Solution for distributed quantization and Neural Insights for quantization accuracy debugging.
- Successfully integrated Intel Neural Compressor into MSFT Olive (#157) and DeepSpeed (#3300).
Features
- [Quantization] Support TensorFlow SmoothQuant (1f4127)
- [Quantization] Support ITEX SmoothQuant (1f4127)
- [Quantization] Support PyTorch FX SmoothQuant (6a39f6, 603811)
- [Quantization] Support ONNX Runtime SmoothQuant (3df647, 1e1d70)
- [Quantization] Support dictionary inputs for IPEX quantization (4ba233)
- [Quantization] Enable calibration algorithm Entropy/KL & Percentile for ONNX Runtime (dae494)
- [MixedPrecision] Support mixed precision op name/type dict option (a9c2cb)
- [Strategy] Support block wise tuning (9c26ed)
- [Strategy] Enable mse_v2 for ONNX Runtime (62122d)
- [Pruning] Support retrain free sparse (d29aa0)
- [Pruning] Support TensorFlow pruning with 2.x API (072c13)
Improvement
- [Quantization] Enhance Keras functional model quantization with Keras model in, quantized Keras model out (699751)
- [Quantization] Enhance MatMul and Gather quantization for ONNX Runtime (1f9c4f)
- [Quantization] Add new recipe for ONNX Runtime NLP models (10d82c)
- [MixedPrecision] Add more FP16 OPs support for ONNX Runtime (15d551)
- [MixedPrecision] Add more BF16 OPs support for TensorFlow (369b9d)
- [Pruning] Enhance multihead-attention slim (f3de50)
- [Pruning] Enable progressive pruning in N:M pattern (483e80)
- [Model Export] Refine PT2ONNX export (877adb)
- Remove redundant classes for quantization, benchmark and mixed precision (c51096)
Productivity
- [Neural Solution] Support multi-node distribute tuning model-level parallelism (ee049c)
- [Neural Insights] Support quantization and benchmark diagnosis with GUI (5dc9ea, 3bde2e, 898344)
- [Neural Coder] Migrate Neural Coder support into 2.x API (113ca1, e74a8a)
- [Ecosystem] MSFT Olive integration (#157)
- [Ecosystem] MSFT DeepSpeed integration (#3300)
- Support ITEX 1.2 (5519e2)
- Support Python 3.11 (6fa053)
- Enhance documentations for mixed precision, diagnosis, dataloader, metric, etc.
Bug Fixes
- Fix ONNX Runtime SmoothQuant issues (85c6a0, 1b26c0)
- Fix bug in IPEX fallback (b4f9c7)
- Fix ITEX quantize/dequantize before BN u8 issue (5519e2)
- Fix example inputs issue for IPEX smoothquant (c8b753)
- Fix IPEX mixed precision (d1e734)
- Fix inspect tensor (8f5f5d)
- Fix PyTorch model peleenet, 3dunet accuracy issue after migrate into 2.x API
- Fix CVEs (04c482, efcd98, 6e9f7b, 7abe32)
Examples
- Enable 4 ONNX Runtime examples, layoutlmv3, layoutlmft, deberta-v3, GPTJ-6B.
- Enable 2 TensorFlow LLMs with SmoothQuant, facebook-opt-125m, gpt2-medium.
External Contributes
- Add a mathematical check for SmoothQuant transform (5c04ac)
- Fix mismatch absorb layers due to tracing and named modules for SmoothQuant (bccc89)
- Fix trace issue when input is dictionary for SmoothQuant (6a3c64)
- Allow dictionary model inputs for ONNX export (17b642)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04
- Python 3.7, 3.8, 3.9, 3.10, 3.11
- TensorFlow 2.10.0, 2.11.0, 2.12.0
- ITEX 1.1.0, 1.2.0
- PyTorch/IPEX 1.12.1+cpu, 1.13.0+cpu, 2.0.1+cpu
- ONNX Runtime 1.13.1, 1.14.1, 1.15.0
- MXNet 1.9.1