Skip to content

Commit

Permalink
Update woq tuning document (#1347)
Browse files Browse the repository at this point in the history
* update woq docs

Signed-off-by: Kaihui-intel <[email protected]>

* fix spell

Signed-off-by: Kaihui-intel <[email protected]>

* add onnxrt docs and tuning summary

Signed-off-by: Kaihui-intel <[email protected]>

* weight only -> weight-only

Signed-off-by: Kaihui-intel <[email protected]>

* update woq tune doc

Signed-off-by: yuwenzho <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo

Signed-off-by: yuwenzho <[email protected]>

---------

Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: yuwenzho <[email protected]>
Co-authored-by: yuwenzho <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
3 people authored Oct 26, 2023
1 parent 6187c2e commit fe16829
Show file tree
Hide file tree
Showing 2 changed files with 32 additions and 0 deletions.
30 changes: 30 additions & 0 deletions docs/source/quantization_weight_only.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,36 @@ torch.save(compressed_model.state_dict(), "compressed_model.pt")

The saved_results folder contains two files: `best_model.pt` and `qconfig.json`, and the generated q_model is a fake quantized model.


### **WOQ algorithms tuning**

To find the best algorithm, users can omit specifying a particular algorithm. In comparison to setting a specific algorithm, this tuning process will traverse through a set of pre-defined WOQ configurations and identify the optimal one with the best result. For details usage, please refer to the [tuning strategy](./tuning_strategies.md#Basic).

> **Note:** Currently, this behavior is specific to the `ONNX Runtime` backend.
**Pre-defined configurations**

| WOQ configurations | setting |
|:------------------:|:-------:|
|RTN_G32ASYM| {"algorithm": "RTN", "group_size": 32, "scheme": "asym"}|
|GPTQ_G32ASYM| {"algorithm": "GPTQ", "group_size": 32, "scheme": "asym"}|
|GPTQ_G32ASYM_DISABLE_LAST_MATMUL| {"algorithm": "GPTQ", "group_size": 32, "scheme": "asym"} <br> & disable last MatMul|
|GPTQ_G128ASYM| {"algorithm": "GPTQ", "group_size": 128, "scheme": "asym"}|
|AWQ_G32ASYM| {"algorithm": "AWQ", "group_size": 32, "scheme": "asym"}|

**User code example**

```python
conf = PostTrainingQuantConfig(
approach="weight_only",
quant_level="auto", # quant_level supports "auto" or 1 for woq config tuning
)
q_model = quantization.fit(model, conf, eval_func=eval_func, calib_dataloader=dataloader)
q_model.save("saved_results")
```

Refer to this [link](../../examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/weight_only) for an example of WOQ algorithms tuning on ONNX Llama models.

## Layer Wise Quantization

Large language models (LLMs) have shown exceptional performance across various tasks, meanwhile, the substantial parameter size poses significant challenges for deployment. Layer-wise quantization(LWQ) can greatly reduce the memory footprint of LLMs, usually 80-90% reduction, which means that users can quantize LLMs even on single node using GPU or CPU. We can quantize the model under memory-constrained devices, therefore making the huge-sized LLM quantization possible.
Expand Down
2 changes: 2 additions & 0 deletions docs/source/tuning_strategies.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,8 @@ flowchart TD
> For [smooth quantization](./smooth_quant.md), users can tune the smooth quantization alpha by providing a list of scalars for the `alpha` item. The tuning process will take place at the **start stage** of the tuning procedure. For details usage, please refer to the [smooth quantization example](./smooth_quant.md#Example).
> For [weight-only quantization](./quantization_weight_only.md), users can tune the weight-only algorithms from the available [pre-defined configurations](./quantization_weight_only.md#woq-algorithms-tuning). The tuning process will take place at the **start stage** of the tuning procedure, preceding the smooth quantization alpha tuning. For details usage, please refer to the [weight-only quantization example](./quantization_weight_only.md#woq-algorithms-tuning).
*Please note that this behavior is specific to the `ONNX Runtime` backend.*

**1.** Default quantization

Expand Down

0 comments on commit fe16829

Please sign in to comment.