Intel® auto-round v0.1 Release
Overview
AutoRound introduces an innovative weight-only quantization algorithm designed specifically for low-bit LLM inference, approaching near-lossless compression for a range of popular models including gemma-7B, Mistral-7b, Mixtral-8x7B-v0.1, Mixtral-8x7B-Instruct-v0.1, Phi2, LLAMA2 and more at W4G128. AutoRound consistently outperforms established methods across the majority of scenarios at W4G128, W4G-1, W3G128, and W2G128 .
Key Features
- Wide Model Support: AutoRound caters to a diverse range of model families. About 20 model families have been verified.
- Export Flexibility: Effortlessly export quantized models to ITREX[1] and AutoGPTQ[2] formats for seamless deployment on Intel CPU and Nvidia GPU platforms respectively.
- Device Compatibility: Compatible with tuning devices including Intel CPUs, Intel Guadi2, and Nvidia GPUs.
- Dataset Flexibility: AutoRound supports calibration with Pile10k and MBPP datasets, with easy extensibility to incorporate additional datasets.
Examples
- Explore language modeling and code generation examples to unlock the full potential of AutoRound.
Additional Benefits
- PreQuantized Models: Access a variety of pre-quantized models on Hugging Face for immediate integration into your projects, with more models under review and coming soon.
- Comprehensive Accuracy Data: Simplified user deployment with extensive accuracy data provided.
Known issues:
- baichuan-inc/Baichuan2-13B-Chat has some issues, we will support it soon
Reference:
[1] https://github.com/intel/intel-extension-for-transformers