Neural Compressor aims to provide popular model compression techniques inherited from Intel Neural Compressor yet focused on ONNX model quantization such as SmoothQuant, weight-only quantization through ONNX Runtime. In particular, the tool provides the key features, typical examples, and open collaborations as below:
-
Support a wide range of Intel hardware such as Intel Xeon Scalable Processors and AIPC
-
Validate popular LLMs such as LLama2, Llama3, Qwen2 and broad models such as BERT-base, and ResNet50 from popular model hubs such as Hugging Face, ONNX Model Zoo, by leveraging automatic accuracy-driven quantization strategies
-
Collaborate with software platforms such as Microsoft Olive, and open AI ecosystem such as Hugging Face, ONNX and ONNX Runtime
git clone https://github.com/onnx/neural-compressor.git
cd neural-compressor
pip install -r requirements.txt
pip install .
Note: Further installation methods can be found under Installation Guide.
Setting up the environment:
pip install onnx-neural-compressor "onnxruntime>=1.17.0" onnx
After successfully installing these packages, try your first quantization program.
Notes: please install from source before the formal pypi release.
Following example code demonstrates Weight-Only Quantization on LLMs, device will be selected for efficiency automatically when multiple devices are available.
Run the example:
from onnx_neural_compressor.quantization import matmul_nbits_quantizer
algo_config = matmul_nbits_quantizer.RTNWeightOnlyQuantConfig()
quant = matmul_nbits_quantizer.MatMulNBitsQuantizer(
model,
n_bits=4,
block_size=32,
is_symmetric=True,
algo_config=algo_config,
)
quant.process()
best_model = quant.model
from onnx_neural_compressor.quantization import quantize, config
from onnx_neural_compressor import data_reader
class DataReader(data_reader.CalibrationDataReader):
def __init__(self):
self.encoded_list = []
# append data into self.encoded_list
self.iter_next = iter(self.encoded_list)
def get_next(self):
return next(self.iter_next, None)
def rewind(self):
self.iter_next = iter(self.encoded_list)
data_reader = DataReader()
qconfig = config.StaticQuantConfig(calibration_data_reader=data_reader)
quantize(model, output_model_path, qconfig)
Overview | ||||||||
---|---|---|---|---|---|---|---|---|
Architecture | Workflow | Examples | ||||||
Feature | ||||||||
Quantization | SmoothQuant | |||||||
Weight-Only Quantization (INT8/INT4) | Layer-Wise Quantization |
- GitHub Issues: mainly for bug reports, new feature requests, question asking, etc.
- Email: welcome to raise any interesting research ideas on model compression techniques by email for collaborations.