Release Intel® Neural Compressor v2.1 Release · intel/neural-compressor

Highlights

Support and enhance SmoothQuant on popular large language models (LLMs) (e.g., BLOOM-176B, OPT-30B, GPT-J-6B, etc.)
Support native Keras model quantization (Keras model as input, and quantized Keras model as output)
Provide auto-tuning strategy to improve quantization productivity
Support model conversion from TensorFlow INT8 to ONNX INT8 model
Polish documentations to help the user be easier to get started

Features

[Quantization] Support SmoothQuant and verify with LLMs (commit cbb5cf) (commit 08e255) (commit 12c101)
[Quantization] Support Keras functional model quantization with Keras model in, quantized Keras model out (commit efd737)
[Strategy] Add auto quantization level as the default tuning process (commit cdfb99)
[Strategy] Integrate quantization recipes into tuning strategy (commit 44d176)
[Strategy] Extend the strategy capability for adding the new data type (commit d0059c)
[Strategy] Enable tuning strategy level multi-node distribute quantization (commit e1fe50)
[AMP] Support ONNX Runtime with FP16 (commit 108c24)
[Productivity] Export TensorFlow models into ONNX QDQ mode on both fp32 and int8 precision (commit 33a235)
[Productivity] Support PT/IPEX v2.0 (commit dbf138)
[Productivity] Support ONNX Runtime v1.14.1 (commit 146759)
[Productivity] GitHub IO docs support history versions

Improvement

Remove the dependency on experimental API (commit 6e10ef)
Enhance GUI diagnosis function on model graph and tensor histogram showing style (commit 9f0891)
Optimize memory usage for PyTorch adaptor (commit c295a7), ONNX adaptor (commit 8cbf2e), TensorFlow adaptor (commit ad0f1e), and tuning strategy (commit c49300) to support LLM
Refine ONNX Runtime QDQ quantization graph (commit c64a5b)
Enable ONNX model quantization with NVidia GPU TRT EP (commit ba42d0)
Improve code line coverage to 85%

Bug Fixes

Examples

Migrate examples with INC v2.0 API
Enable LLMs (e.g., GPT-NeoX, T5 Large, BLOOM-176B, OPT-30B, GPT-J-6B, etc.)
Enable examples for Keras in Keras out (commit efd737)
Enable multi-node training examples on CPU (e.g., RN50 distillation, QAT, pruning examples)
Add 15+ Huggingface (HF) examples with ONNX Runtime backend and update quantized models into HF (commit a4228d)
Add 2 examples for PT2ONNX model export (commit 26db4a)

Documentations

Polish documentations with simplified GitHub main page, easy to read IO Docs structure, hands on API migrate user guide, more detailed new API instruction, refreshed API docs template, etc.

Validated Configurations

Provide feedback