Update woq tuning document (#1347)

* update woq docs Signed-off-by: Kaihui-intel <[email protected]> * fix spell Signed-off-by: Kaihui-intel <[email protected]> * add onnxrt docs and tuning summary Signed-off-by: Kaihui-intel <[email protected]> * weight only -> weight-only Signed-off-by: Kaihui-intel <[email protected]> * update woq tune doc Signed-off-by: yuwenzho <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix typo Signed-off-by: yuwenzho <[email protected]> --------- Signed-off-by: Kaihui-intel <[email protected]> Signed-off-by: yuwenzho <[email protected]> Co-authored-by: yuwenzho <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
intel · Oct 26, 2023 · fe16829 · fe16829
1 parent 6187c2e
commit fe16829
Show file tree

Hide file tree

Showing 2 changed files with 32 additions and 0 deletions.
diff --git a/docs/source/quantization_weight_only.md b/docs/source/quantization_weight_only.md
@@ -129,6 +129,36 @@ torch.save(compressed_model.state_dict(), "compressed_model.pt")
 
 The saved_results folder contains two files: `best_model.pt` and `qconfig.json`, and the generated q_model is a fake quantized model.
 
+
+### **WOQ algorithms tuning**
+
+To find the best algorithm, users can omit specifying a particular algorithm. In comparison to setting a specific algorithm, this tuning process will traverse through a set of pre-defined WOQ configurations and identify the optimal one with the best result. For details usage, please refer to the [tuning strategy](./tuning_strategies.md#Basic).
+
+> **Note:** Currently, this behavior is specific to the `ONNX Runtime` backend.
+
+**Pre-defined configurations**
+
+| WOQ configurations | setting |
+|:------------------:|:-------:|
+|RTN_G32ASYM| {"algorithm": "RTN", "group_size": 32, "scheme": "asym"}|
+|GPTQ_G32ASYM| {"algorithm": "GPTQ", "group_size": 32, "scheme": "asym"}|
+|GPTQ_G32ASYM_DISABLE_LAST_MATMUL| {"algorithm": "GPTQ", "group_size": 32, "scheme": "asym"} <br> & disable last MatMul|
+|GPTQ_G128ASYM| {"algorithm": "GPTQ", "group_size": 128, "scheme": "asym"}|
+|AWQ_G32ASYM| {"algorithm": "AWQ", "group_size": 32, "scheme": "asym"}|
+
+**User code example**
+
+```python
+conf = PostTrainingQuantConfig(
+    approach="weight_only",
+    quant_level="auto",  # quant_level supports "auto" or 1 for woq config tuning
+)
+q_model = quantization.fit(model, conf, eval_func=eval_func, calib_dataloader=dataloader)
+q_model.save("saved_results")
+```
+
+Refer to this [link](../../examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/weight_only) for an example of WOQ algorithms tuning on ONNX Llama models.
+
 ## Layer Wise Quantization
 
 Large language models (LLMs) have shown exceptional performance across various tasks, meanwhile, the substantial parameter size poses significant challenges for deployment. Layer-wise quantization(LWQ) can greatly reduce the memory footprint of LLMs, usually 80-90% reduction, which means that users can quantize LLMs even on single node using GPU or CPU.  We can quantize the model under memory-constrained devices, therefore making the huge-sized LLM quantization possible.

diff --git a/docs/source/tuning_strategies.md b/docs/source/tuning_strategies.md
@@ -181,6 +181,8 @@ flowchart TD
 
 > For [smooth quantization](./smooth_quant.md), users can tune the smooth quantization alpha by providing a list of scalars for the `alpha` item. The tuning process will take place at the **start stage** of the tuning procedure. For details usage, please refer to the [smooth quantization example](./smooth_quant.md#Example).
 
+> For [weight-only quantization](./quantization_weight_only.md), users can tune the weight-only  algorithms from the available [pre-defined configurations](./quantization_weight_only.md#woq-algorithms-tuning). The tuning process will take place at the **start stage** of the tuning procedure, preceding the smooth quantization alpha tuning. For details usage, please refer to the [weight-only quantization example](./quantization_weight_only.md#woq-algorithms-tuning).
+*Please note that this behavior is specific to the `ONNX Runtime` backend.*
 
 **1.** Default quantization