Merge branch 'master' of https://github.com/intel/neural-compressor i…

…nto jianyu_3.0_onlinedoc
intel · Jul 19, 2024 · 64b7e4b · 64b7e4b
2 parents 74fdd2b + 437c8e7
commit 64b7e4b
Show file tree

Hide file tree

Showing 58 changed files with 450 additions and 7,022 deletions.
diff --git a/.azure-pipelines/scripts/install_nc.sh b/.azure-pipelines/scripts/install_nc.sh
@@ -10,10 +10,6 @@ elif [[ $1 = *"3x_tf"* ]]; then
     python -m pip install --no-cache-dir -r requirements_tf.txt
     python setup.py tf bdist_wheel
     pip install dist/neural_compressor*.whl --force-reinstall
-elif [[ $1 = *"3x_ort" ]]; then
-    python -m pip install --no-cache-dir -r requirements_ort.txt
-    python setup.py ort bdist_wheel
-    pip install dist/neural_compressor*.whl --force-reinstall
 else
     python -m pip install --no-cache-dir -r requirements.txt
     python setup.py bdist_wheel

diff --git a/.azure-pipelines/scripts/ut/3x/coverage.3x_ort b/.azure-pipelines/scripts/ut/3x/coverage.3x_ort
diff --git a/.azure-pipelines/scripts/ut/3x/run_3x_ort.sh b/.azure-pipelines/scripts/ut/3x/run_3x_ort.sh
diff --git a/.azure-pipelines/ut-3x-ort.yml b/.azure-pipelines/ut-3x-ort.yml
diff --git a/.github/checkgroup.yml b/.github/checkgroup.yml
@@ -140,16 +140,3 @@ subprojects:
       - "UT-3x-Torch (Coverage Compare CollectDatafiles)"
       - "UT-3x-Torch (Unit Test 3x Torch Unit Test 3x Torch)"
       - "UT-3x-Torch (Unit Test 3x Torch baseline Unit Test 3x Torch baseline)"
-
-  - id: "Unit Tests 3x-ONNXRT workflow"
-    paths:
-      - "neural_compressor/common/**"
-      - "neural_compressor/onnxrt/**"
-      - "test/3x/onnxrt/**"
-      - "setup.py"
-      - "requirements_ort.txt"
-    checks:
-      - "UT-3x-ONNXRT"
-      - "UT-3x-ONNXRT (Coverage Compare CollectDatafiles)"
-      - "UT-3x-ONNXRT (Unit Test 3x ONNXRT Unit Test 3x ONNXRT)"
-      - "UT-3x-ONNXRT (Unit Test 3x ONNXRT baseline Unit Test 3x ONNXRT baseline)"
diff --git a/README.md b/README.md
@@ -19,20 +19,25 @@ Intel® Neural Compressor aims to provide popular model compression techniques s
 as well as Intel extensions such as [Intel Extension for TensorFlow](https://github.com/intel/intel-extension-for-tensorflow) and [Intel Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch).
 In particular, the tool provides the key features, typical examples, and open collaborations as below:
 
-* Support a wide range of Intel hardware such as [Intel Xeon Scalable Processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable.html), [Intel Xeon CPU Max Series](https://www.intel.com/content/www/us/en/products/details/processors/xeon/max-series.html), [Intel Data Center GPU Flex Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/flex-series.html), and [Intel Data Center GPU Max Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series.html) with extensive testing; support AMD CPU, ARM CPU, and NVidia GPU through ONNX Runtime with limited testing
+* Support a wide range of Intel hardware such as [Intel Gaudi Al Accelerators](https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi-overview.html), [Intel Core Ultra Processors](https://www.intel.com/content/www/us/en/products/details/processors/core-ultra.html), [Intel Xeon Scalable Processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable.html), [Intel Xeon CPU Max Series](https://www.intel.com/content/www/us/en/products/details/processors/xeon/max-series.html), [Intel Data Center GPU Flex Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/flex-series.html), and [Intel Data Center GPU Max Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series.html) with extensive testing; 
+support AMD CPU, ARM CPU, and NVidia GPU through ONNX Runtime with limited testing; support NVidia GPU for some WOQ algorithms like AutoRound and HQQ. 
 
 * Validate popular LLMs such as [LLama2](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), [Falcon](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), [GPT-J](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), [Bloom](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), [OPT](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), and more than 10,000 broad models such as [Stable Diffusion](/examples/pytorch/nlp/huggingface_models/text-to-image/quantization), [BERT-Large](/examples/pytorch/nlp/huggingface_models/text-classification/quantization/ptq_static/fx), and [ResNet50](/examples/pytorch/image_recognition/torchvision_models/quantization/ptq/cpu/fx) from popular model hubs such as [Hugging Face](https://huggingface.co/), [Torch Vision](https://pytorch.org/vision/stable/index.html), and [ONNX Model Zoo](https://github.com/onnx/models#models), with automatic [accuracy-driven](/docs/source/design.md#workflow) quantization strategies
 
 * Collaborate with cloud marketplaces such as [Google Cloud Platform](https://console.cloud.google.com/marketplace/product/bitnami-launchpad/inc-tensorflow-intel?project=verdant-sensor-286207), [Amazon Web Services](https://aws.amazon.com/marketplace/pp/prodview-yjyh2xmggbmga#pdp-support), and [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/bitnami.inc-tensorflow-intel), software platforms such as [Alibaba Cloud](https://www.intel.com/content/www/us/en/developer/articles/technical/quantize-ai-by-oneapi-analytics-on-alibaba-cloud.html), [Tencent TACO](https://new.qq.com/rain/a/20221202A00B9S00) and [Microsoft Olive](https://github.com/microsoft/Olive), and open AI ecosystem such as [Hugging Face](https://huggingface.co/blog/intel), [PyTorch](https://pytorch.org/tutorials/recipes/intel_neural_compressor_for_pytorch.html), [ONNX](https://github.com/onnx/models#models), [ONNX Runtime](https://github.com/microsoft/onnxruntime), and [Lightning AI](https://github.com/Lightning-AI/lightning/blob/master/docs/source-pytorch/advanced/post_training_quantization.rst)
 
 ## What's New
-* [2024/03] A new SOTA approach [AutoRound](https://github.com/intel/auto-round) Weight-Only Quantization on [Intel Gaudi2 AI accelerator](https://habana.ai/products/gaudi2/) is available for LLMs.
+* [2024/07] From 3.0 release, framework extension API is recommended to be used for quantization.
+* [2024/07] Performance optimizations and usability improvements on [client-side](https://github.com/intel/neural-compressor/blob/master/docs/3x/client_quant.md).
 
 ## Installation
 
 ### Install from pypi
 ```Shell
-pip install neural-compressor
+# Install 2.X API + Framework extension API + PyTorch dependency
+pip install neural-compressor[pt] 
+# Install 2.X API + Framework extension API + TensorFlow dependency
+pip install neural-compressor[tf]
 ```
 > **Note**:
 > Further installation methods can be found under [Installation Guide](https://github.com/intel/neural-compressor/blob/master/docs/source/installation_guide.md). check out our [FAQ](https://github.com/intel/neural-compressor/blob/master/docs/source/faq.md) for more details.

diff --git a/docs/3x/client_quant.md b/docs/3x/client_quant.md
@@ -0,0 +1,50 @@
+Quantization on Client
+==========================================
+
+1. [Introduction](#introduction)
+2. [Get Started](#get-started) \
+   2.1 [Get Default Algorithm Configuration](#get-default-algorithm-configuration)\
+   2.2 [Optimal Performance and Peak Memory Usage](#optimal-performance-and-peak-memory-usage)
+
+
+## Introduction
+
+For `RTN`, `GPTQ`, and `Auto-Round` algorithms, we provide default algorithm configurations for different processor types (`client` and `sever`). Generally, lightweight configurations are tailored specifically for client devices to enhance performance and efficiency.
+
+
+## Get Started
+
+### Get Default Algorithm Configuration
+
+Here, we take the `RTN` algorithm as example to demonstrate the usage on a client machine.
+
+```python
+from neural_compressor.torch.quantization import get_default_rtn_config, convert, prepare
+from neural_compressor.torch import load_empty_model
+
+model_state_dict_path = "/path/to/model/state/dict"
+float_model = load_empty_model(model_state_dict_path)
+quant_config = get_default_rtn_config()
+prepared_model = prepare(float_model, quant_config)
+quantized_model = convert(prepared_model)
+```
+
+> [!TIP]
+> By default, the appropriate configuration is determined based on hardware information, but users can explicitly specify `processor_type` as either `client` or `server` when calling `get_default_rtn_config`.
+
+
+For Windows machines, run the following command to utilize all available cores automatically:
+
+```bash
+python main.py
+```
+
+> [!TIP]
+> For Linux systems, users need to configure the environment variables appropriately to achieve optimal performance. For example, set the `OMP_NUM_THREADS` explicitly. For processors with hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores using `taskset`.
+
+### Optimal Performance and Peak Memory Usage
+
+Below are approximate performance and memory usage figures conducted on a client machine with 24 cores and 32GB of RAM. These figures provide a rough estimate for quick reference and may vary based on specific hardware and configurations.
+
+- 7B models (e.g., [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)): the quantization process takes about 65 seconds, with a peak memory usage of around 6GB.
+- 1.5B models (e.g., [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct)),  the quantization process takes about 20 seconds, with a peak memory usage of around 5GB.
diff --git a/docs/source/3x/PT_WeightOnlyQuant.md b/docs/source/3x/PT_WeightOnlyQuant.md
@@ -15,6 +15,7 @@ PyTorch Weight Only Quantization
     - [HQQ](#hqq)
   - [Specify Quantization Rules](#specify-quantization-rules)
   - [Saving and Loading](#saving-and-loading)
+- [Efficient Usage on Client-Side](#efficient-usage-on-client-side)
 - [Examples](#examples)
 
 ## Introduction
@@ -276,6 +277,11 @@ loaded_model = load(
 )  # Please note that the original_model parameter passes the original model.
 ```
 
+## Efficient Usage on Client-Side
+
+For client machines with limited RAM and cores, we offer optimizations to reduce computational overhead and minimize memory usage. For detailed information, please refer to [Quantization on Client](https://github.com/intel/neural-compressor/blob/master/docs/3x/client_quant.md).
+
+
 ## Examples
 
 Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only) on how to quantize a  model with WeightOnlyQuant.

diff --git a/docs/source/3x/TensorFlow.md b/docs/source/3x/TensorFlow.md
@@ -23,14 +23,15 @@ Intel(R) Neural Compressor provides `quantize_model` and `autotune` as main inte
 
 **quantize_model**
 
-The design philosophy of the `quantize_model` interface is easy-of-use. With minimal parameters requirement, including `model`, `quant_config`, `calib_dataloader` and `calib_iteration`, it offers a straightforward choice of quantizing TF model in one-shot.
+The design philosophy of the `quantize_model` interface is easy-of-use. With minimal parameters requirement, including `model`, `quant_config`, `calib_dataloader`, `calib_iteration`, it offers a straightforward choice of quantizing TF model in one-shot.
 
 ```python
 def quantize_model(
     model: Union[str, tf.keras.Model, BaseModel],
     quant_config: Union[BaseConfig, list],
     calib_dataloader: Callable = None,
     calib_iteration: int = 100,
+    calib_func: Callable = None,
 ):
 ```
 `model` should be a string of the model's location, the object of Keras model or INC TF model wrapper class.
@@ -41,6 +42,9 @@ def quantize_model(
 
 `calib_iteration` is used to decide how many iterations the calibration process will be run.
 
+`calib_func` is a substitution for `calib_dataloader` when the built-in calibration function of INC does not work for model inference.
+
+
 Here is a simple example of using `quantize_model` interface with a dummy calibration dataloader and the default `StaticQuantConfig`:
 ```python
 from neural_compressor.tensorflow import StaticQuantConfig, quantize_model
@@ -68,6 +72,7 @@ def autotune(
     eval_args: Optional[Tuple[Any]] = None,
     calib_dataloader: Callable = None,
     calib_iteration: int = 100,
+    calib_func: Callable = None,
 ) -> Optional[BaseModel]:
 ```
 `model` should be a string of the model's location, the object of Keras model or INC TF model wrapper class.
@@ -82,6 +87,8 @@ def autotune(
 
 `calib_iteration` is used to decide how many iterations the calibration process will be run.
 
+`calib_func` is a substitution for `calib_dataloader` when the built-in calibration function of INC does not work for model inference.
+
 Here is a simple example of using `autotune` interface with different quantization rules defined by a list of  `StaticQuantConfig`:
 ```python
 from neural_compressor.common.base_tuning import TuningConfig