diff --git a/README.md b/README.md index ad97ad451fc..4aac08639b2 100644 --- a/README.md +++ b/README.md @@ -28,7 +28,7 @@ support AMD CPU, ARM CPU, and NVidia GPU through ONNX Runtime with limited testi ## What's New * [2024/07] From 3.0 release, framework extension API is recommended to be used for quantization. -* [2024/07] Performance optimizations and usability improvements on [client-side](https://github.com/intel/neural-compressor/blob/master/docs/3x/client_quant.md). +* [2024/07] Performance optimizations and usability improvements on [client-side](https://github.com/intel/neural-compressor/blob/master/docs/source/3x/client_quant.md). ## Installation diff --git a/docs/source/3x/PT_WeightOnlyQuant.md b/docs/source/3x/PT_WeightOnlyQuant.md index 65d9367e9fa..1b7a6e760ff 100644 --- a/docs/source/3x/PT_WeightOnlyQuant.md +++ b/docs/source/3x/PT_WeightOnlyQuant.md @@ -15,6 +15,7 @@ PyTorch Weight Only Quantization - [HQQ](#hqq) - [Specify Quantization Rules](#specify-quantization-rules) - [Saving and Loading](#saving-and-loading) +- [Layer Wise Quantization](#layer-wise-quantization) - [Efficient Usage on Client-Side](#efficient-usage-on-client-side) - [Examples](#examples) @@ -277,9 +278,33 @@ loaded_model = load( ) # Please note that the original_model parameter passes the original model. ``` +## Layer Wise Quantization + +As the size of LLMs continues to grow, loading the entire model into a single GPU card or the RAM of a client machine becomes impractical. To address this challenge, we introduce Layer-wise Quantization (LWQ), a method that quantizes LLMs layer by layer or block by block. This approach significantly reduces memory consumption. The diagram below illustrates the LWQ process. + + + +*Figure 1: The process of layer-wise quantization for PyTorch model. The color grey means empty parameters and the color blue represents parameters need to be quantized. Every rectangle inside model represents one layer.* + + +Currently, we support LWQ for `RTN`, `AutoRound`, and `GPTQ`. + +Here, we take the `RTN` algorithm as example to demonstrate the usage of LWQ. + +```python +from neural_compressor.torch.quantization import RTNConfig, convert, prepare +from neural_compressor.torch import load_empty_model + +model_state_dict_path = "/path/to/model/state/dict" +float_model = load_empty_model(model_state_dict_path) +quant_config = RTNConfig(use_layer_wise=True) +prepared_model = prepare(float_model, quant_config) +quantized_model = convert(prepared_model) +``` + ## Efficient Usage on Client-Side -For client machines with limited RAM and cores, we offer optimizations to reduce computational overhead and minimize memory usage. For detailed information, please refer to [Quantization on Client](https://github.com/intel/neural-compressor/blob/master/docs/3x/client_quant.md). +For client machines with limited RAM and cores, we offer optimizations to reduce computational overhead and minimize memory usage. For detailed information, please refer to [Quantization on Client](https://github.com/intel/neural-compressor/blob/master/docs/source/3x/client_quant.md). ## Examples diff --git a/docs/3x/client_quant.md b/docs/source/3x/client_quant.md similarity index 51% rename from docs/3x/client_quant.md rename to docs/source/3x/client_quant.md index 181834caf23..9921560c798 100644 --- a/docs/3x/client_quant.md +++ b/docs/source/3x/client_quant.md @@ -2,20 +2,15 @@ Quantization on Client ========================================== 1. [Introduction](#introduction) -2. [Get Started](#get-started) \ - 2.1 [Get Default Algorithm Configuration](#get-default-algorithm-configuration)\ - 2.2 [Optimal Performance and Peak Memory Usage](#optimal-performance-and-peak-memory-usage) - +2. [Get Started](#get-started) ## Introduction -For `RTN`, `GPTQ`, and `Auto-Round` algorithms, we provide default algorithm configurations for different processor types (`client` and `sever`). Generally, lightweight configurations are tailored specifically for client devices to enhance performance and efficiency. +For `RTN`, and `GPTQ` algorithms, we provide default algorithm configurations for different processor types (`client` and `sever`). Generally, lightweight configurations are tailored specifically for client devices to enhance performance and efficiency. ## Get Started -### Get Default Algorithm Configuration - Here, we take the `RTN` algorithm as example to demonstrate the usage on a client machine. ```python @@ -42,9 +37,4 @@ python main.py > [!TIP] > For Linux systems, users need to configure the environment variables appropriately to achieve optimal performance. For example, set the `OMP_NUM_THREADS` explicitly. For processors with hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores using `taskset`. -### Optimal Performance and Peak Memory Usage - -Below are approximate performance and memory usage figures conducted on a client machine with 24 cores and 32GB of RAM. These figures provide a rough estimate for quick reference and may vary based on specific hardware and configurations. - -- 7B models (e.g., [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)): the quantization process takes about 65 seconds, with a peak memory usage of around 6GB. -- 1.5B models (e.g., [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct)), the quantization process takes about 20 seconds, with a peak memory usage of around 5GB. +RTN quantization is a quick process, finishing in tens of seconds and using several GB of RAM when working with 7B models, e.g.,[meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). However, for the higher accuracy, GPTQ algorithm is recommended, but be prepared for a longer quantization time. diff --git a/docs/source/3x/imgs/lwq.png b/docs/source/3x/imgs/lwq.png new file mode 100644 index 00000000000..b2e75bc5d8e Binary files /dev/null and b/docs/source/3x/imgs/lwq.png differ