Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update doc for client-usage and LWQ #1947

Merged
merged 7 commits into from
Jul 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ support AMD CPU, ARM CPU, and NVidia GPU through ONNX Runtime with limited testi

## What's New
* [2024/07] From 3.0 release, framework extension API is recommended to be used for quantization.
* [2024/07] Performance optimizations and usability improvements on [client-side](https://github.com/intel/neural-compressor/blob/master/docs/3x/client_quant.md).
* [2024/07] Performance optimizations and usability improvements on [client-side](https://github.com/intel/neural-compressor/blob/master/docs/source/3x/client_quant.md).

## Installation

Expand Down
27 changes: 26 additions & 1 deletion docs/source/3x/PT_WeightOnlyQuant.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ PyTorch Weight Only Quantization
- [HQQ](#hqq)
- [Specify Quantization Rules](#specify-quantization-rules)
- [Saving and Loading](#saving-and-loading)
- [Layer Wise Quantization](#layer-wise-quantization)
- [Efficient Usage on Client-Side](#efficient-usage-on-client-side)
- [Examples](#examples)

Expand Down Expand Up @@ -277,9 +278,33 @@ loaded_model = load(
) # Please note that the original_model parameter passes the original model.
```

## Layer Wise Quantization

As the size of LLMs continues to grow, loading the entire model into a single GPU card or the RAM of a client machine becomes impractical. To address this challenge, we introduce Layer-wise Quantization (LWQ), a method that quantizes LLMs layer by layer or block by block. This approach significantly reduces memory consumption. The diagram below illustrates the LWQ process.

<img src="./imgs/lwq.png" width=780 height=429>

*Figure 1: The process of layer-wise quantization for PyTorch model. The color grey means empty parameters and the color blue represents parameters need to be quantized. Every rectangle inside model represents one layer.*


Currently, we support LWQ for `RTN`, `AutoRound`, and `GPTQ`.

Here, we take the `RTN` algorithm as example to demonstrate the usage of LWQ.

```python
from neural_compressor.torch.quantization import RTNConfig, convert, prepare
from neural_compressor.torch import load_empty_model

model_state_dict_path = "/path/to/model/state/dict"
float_model = load_empty_model(model_state_dict_path)
quant_config = RTNConfig(use_layer_wise=True)
prepared_model = prepare(float_model, quant_config)
quantized_model = convert(prepared_model)
```

## Efficient Usage on Client-Side

For client machines with limited RAM and cores, we offer optimizations to reduce computational overhead and minimize memory usage. For detailed information, please refer to [Quantization on Client](https://github.com/intel/neural-compressor/blob/master/docs/3x/client_quant.md).
For client machines with limited RAM and cores, we offer optimizations to reduce computational overhead and minimize memory usage. For detailed information, please refer to [Quantization on Client](https://github.com/intel/neural-compressor/blob/master/docs/source/3x/client_quant.md).


## Examples
Expand Down
16 changes: 3 additions & 13 deletions docs/3x/client_quant.md → docs/source/3x/client_quant.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,15 @@ Quantization on Client
==========================================

1. [Introduction](#introduction)
2. [Get Started](#get-started) \
2.1 [Get Default Algorithm Configuration](#get-default-algorithm-configuration)\
2.2 [Optimal Performance and Peak Memory Usage](#optimal-performance-and-peak-memory-usage)

2. [Get Started](#get-started)

## Introduction

For `RTN`, `GPTQ`, and `Auto-Round` algorithms, we provide default algorithm configurations for different processor types (`client` and `sever`). Generally, lightweight configurations are tailored specifically for client devices to enhance performance and efficiency.
For `RTN`, and `GPTQ` algorithms, we provide default algorithm configurations for different processor types (`client` and `sever`). Generally, lightweight configurations are tailored specifically for client devices to enhance performance and efficiency.


## Get Started

### Get Default Algorithm Configuration

Here, we take the `RTN` algorithm as example to demonstrate the usage on a client machine.

```python
Expand All @@ -42,9 +37,4 @@ python main.py
> [!TIP]
> For Linux systems, users need to configure the environment variables appropriately to achieve optimal performance. For example, set the `OMP_NUM_THREADS` explicitly. For processors with hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores using `taskset`.

### Optimal Performance and Peak Memory Usage

Below are approximate performance and memory usage figures conducted on a client machine with 24 cores and 32GB of RAM. These figures provide a rough estimate for quick reference and may vary based on specific hardware and configurations.

- 7B models (e.g., [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)): the quantization process takes about 65 seconds, with a peak memory usage of around 6GB.
- 1.5B models (e.g., [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct)), the quantization process takes about 20 seconds, with a peak memory usage of around 5GB.
RTN quantization is a quick process, finishing in tens of seconds and using several GB of RAM when working with 7B models, e.g.,[meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). However, for the higher accuracy, GPTQ algorithm is recommended, but be prepared for a longer quantization time.
Binary file added docs/source/3x/imgs/lwq.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading