Skip to content

Intel® Neural Compressor v2.1 Release

Compare
Choose a tag to compare
@chensuyue chensuyue released this 29 Mar 10:17
· 1104 commits to master since this release
  • Highlights
  • Features
  • Improvement
  • Bug Fixes
  • Examples
  • Documentations

Highlights

  • Support and enhance SmoothQuant on popular large language models (LLMs) (e.g., BLOOM-176B, OPT-30B, GPT-J-6B, etc.)
  • Support native Keras model quantization (Keras model as input, and quantized Keras model as output)
  • Provide auto-tuning strategy to improve quantization productivity
  • Support model conversion from TensorFlow INT8 to ONNX INT8 model
  • Polish documentations to help the user be easier to get started

Features

  • [Quantization] Support SmoothQuant and verify with LLMs (commit cbb5cf) (commit 08e255) (commit 12c101)
  • [Quantization] Support Keras functional model quantization with Keras model in, quantized Keras model out (commit efd737)
  • [Strategy] Add auto quantization level as the default tuning process (commit cdfb99)
  • [Strategy] Integrate quantization recipes into tuning strategy (commit 44d176)
  • [Strategy] Extend the strategy capability for adding the new data type (commit d0059c)
  • [Strategy] Enable tuning strategy level multi-node distribute quantization (commit e1fe50)
  • [AMP] Support ONNX Runtime with FP16 (commit 108c24)
  • [Productivity] Export TensorFlow models into ONNX QDQ mode on both fp32 and int8 precision (commit 33a235)
  • [Productivity] Support PT/IPEX v2.0 (commit dbf138)
  • [Productivity] Support ONNX Runtime v1.14.1 (commit 146759)
  • [Productivity] GitHub IO docs support history versions

Improvement

  • Remove the dependency on experimental API (commit 6e10ef)
  • Enhance GUI diagnosis function on model graph and tensor histogram showing style (commit 9f0891)
  • Optimize memory usage for PyTorch adaptor (commit c295a7), ONNX adaptor (commit 8cbf2e), TensorFlow adaptor (commit ad0f1e), and tuning strategy (commit c49300) to support LLM
  • Refine ONNX Runtime QDQ quantization graph (commit c64a5b)
  • Enable ONNX model quantization with NVidia GPU TRT EP (commit ba42d0)
  • Improve code line coverage to 85%

Bug Fixes

  • Fix mix precision config setting (commit 4b71a8)
  • Fix multi-instance benchmark on Windows (commit 1f89aa)
  • Fix domain detection for large ONNX model (commit 70a566)

Examples

  • Migrate examples with INC v2.0 API
  • Enable LLMs (e.g., GPT-NeoX, T5 Large, BLOOM-176B, OPT-30B, GPT-J-6B, etc.)
  • Enable examples for Keras in Keras out (commit efd737)
  • Enable multi-node training examples on CPU (e.g., RN50 distillation, QAT, pruning examples)
  • Add 15+ Huggingface (HF) examples with ONNX Runtime backend and update quantized models into HF (commit a4228d)
  • Add 2 examples for PT2ONNX model export (commit 26db4a)

Documentations

  • Polish documentations with simplified GitHub main page, easy to read IO Docs structure, hands on API migrate user guide, more detailed new API instruction, refreshed API docs template, etc.

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04
  • Python 3.7, 3.8, 3.9, 3.10
  • TensorFlow 2.10.1, 2.11.0, 2.12.0
  • ITEX 1.0.0, 1.1.0
  • PyTorch/IPEX 1.12.1+cpu, 1.13.0+cpu, 2.0.0+cpu
  • ONNX Runtime 1.12.1, 1.13.1, 1.14.1
  • MXNet 1.9.1