Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch 3x document #1824

Merged
merged 44 commits into from
Jun 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
a655138
draft for document
xin3he May 28, 2024
ebc260b
Update PT_MXQuant.md
mengniwang95 May 31, 2024
004af16
Update PT_MXQuant.md
mengniwang95 May 31, 2024
b2056de
add doc for sq
violetch24 May 31, 2024
c73145e
update PT_WeightOnlyQuant.md
Kaihui-intel May 31, 2024
ab0dde4
update PT_WeightOnlyQuant.md
Kaihui-intel May 31, 2024
36282b3
add static quant doc
violetch24 May 31, 2024
fd4aeed
minor fix
violetch24 May 31, 2024
dc68e5d
upload PT_WeightOnlyQuant.md
Kaihui-intel May 31, 2024
1a0cfad
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 31, 2024
bdecc7c
Merge branch 'pt_doc' of https://github.com/intel/neural-compressor i…
Kaihui-intel May 31, 2024
ed055b3
update Torch.md
xin3he May 31, 2024
c62c6f0
add quantization.md
xin3he May 31, 2024
7eb0574
add mix-precision docs
zehao-intel May 31, 2024
884c100
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 31, 2024
14d0c54
Merge remote-tracking branch 'origin/master' into pt_doc
xin3he Jun 3, 2024
ea6849c
merge TF docs and move folder
xin3he Jun 3, 2024
1cc9afd
docs refine
violetch24 Jun 3, 2024
fd649bc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 3, 2024
bb2111f
Add pt dynamic quant (#1833)
yiliu30 Jun 3, 2024
213c60f
add save
xin3he Jun 3, 2024
c51a39c
remove style
xin3he Jun 3, 2024
0b0ee32
correct filename
yiliu30 Jun 3, 2024
036838a
remove file
yiliu30 Jun 3, 2024
990b576
update per review
xin3he Jun 3, 2024
c8bdcca
update PT_WeightOnlyQuant.md
Kaihui-intel Jun 3, 2024
4d0e9d8
Merge branch 'pt_doc' of https://github.com/intel/neural-compressor i…
Kaihui-intel Jun 3, 2024
6a50b93
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 3, 2024
b8d0d28
add pt2e static
yiliu30 Jun 3, 2024
68e436c
fix for comment
violetch24 Jun 3, 2024
f7836dd
Merge branch 'pt_doc' of https://github.com/intel/neural-compressor i…
violetch24 Jun 3, 2024
b869599
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 3, 2024
86f9957
update PT_WeightOnlyQuant.md
Kaihui-intel Jun 3, 2024
191f191
Merge branch 'pt_doc' of https://github.com/intel/neural-compressor i…
Kaihui-intel Jun 3, 2024
633e522
update PT_WeightOnlyQuant.md layer-wise
Kaihui-intel Jun 3, 2024
b518efe
Update PT_MXQuant.md
mengniwang95 Jun 3, 2024
a66456f
dix typo
yiliu30 Jun 3, 2024
e581516
update imgs for 3x pt
xin3he Jun 4, 2024
a7429f4
update per review
xin3he Jun 4, 2024
bdca2be
Update PT_MXQuant.md
mengniwang95 Jun 4, 2024
dac08cf
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 4, 2024
28b3fdc
Merge branch 'master' into pt_doc
xin3he Jun 4, 2024
006879c
refine mix-precision
zehao-intel Jun 4, 2024
40ddaed
add note for ipex
yiliu30 Jun 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .coverage
Binary file not shown.
42 changes: 42 additions & 0 deletions docs/3x/PT_DynamicQuant.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
Dynamic Quantization
===============

1. [Introduction](#introduction)
2. [Getting Started with Dynamic Quantization](#Getting-Started-with-Dynamic-Quantization)
3. [Examples](#examples)


## Introduction
Quantization is the process of converting floating point weights and activations to lower bitwidth tensors by multiplying the floating point values by a scale factor and rounding the results to whole numbers. Dynamic quantization determines the scale factor for activations dynamically based on the data range observed at runtime. We support W8A8 (quantizing weights and activations into 8 bits) dynamic quantization by leveraging torch's [`X86InductorQuantizer`](https://pytorch.org/tutorials/prototype/pt2e_quant_x86_inductor.html?highlight=x86inductorquantizer).


## Getting Started with Dynamic Quantization
There are four steps to perform W8A8 dynamic quantization: `export`, `prepare`, `convert` and `compile`.

```python
import torch
from neural_compressor.torch.export import export
from neural_compressor.torch.quantization import DynamicQuantConfig, prepare, convert

# Prepare the float model and example inputs for export model
model = UserFloatModel()
example_inputs = ...

# Export eager model into FX graph model
exported_model = export(model=model, example_inputs=example_inputs)
# Quantize the model
quant_config = DynamicQuantConfig()
prepared_model = prepare(exported_model, quant_config=quant_config)
q_model = convert(prepared_model)
# Compile the quantized model and replace the Q/DQ pattern with Q-operator
from torch._inductor import config

config.freezing = True
opt_model = torch.compile(q_model)
```

> Note: The `set_local` of `DynamicQuantConfig` will be supported after the torch 2.4 release.


## Examples
Example will be added later.
107 changes: 107 additions & 0 deletions docs/3x/PT_MXQuant.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
Microscaling Quantization
===============

1. [Introduction](#introduction)
2. [Get Started with Microscaling Quantization API](#get-start-with-microscaling-quantization-api)
3. [Examples](#examples)
4. [Reference](#reference)

## Introduction

Numerous breakthroughs have emerged across various fields, such as text analysis, language translation and chatbot technologies, fueled by the development of large language models (LLMs). Nevertheless, their increasing power comes with the challenge of explosive growth in parameters, posing obstacles for practical use. To balance memory limits and accuracy preservation for AI models, the Microscaling (MX) specification was promoted from the well-known Microsoft Floating Point (MSFP) data type [1, 2]:

<table>
<tr>
<th>Format Name</th>
<th>Element Data type</th>
<th>Element Bits</th>
<th>Scaling Block Size</th>
<th>Scale Data Type</th>
<th>Scale Bits</th>
</tr>
<tr>
<td rowspan="2">MXFP8</td>
<td>FP8 (E5M2)</td>
<td rowspan="2">8</td>
<td rowspan="2">32</td>
<td rowspan="2">E8M0</td>
<td rowspan="2">8</td>
</tr>
<tr>
<td>FP8 (E4M3)</td>
</tr>
<tr>
<td rowspan="2">MXFP6</td>
<td>FP6 (E3M2)</td>
<td rowspan="2">6</td>
<td rowspan="2">32</td>
<td rowspan="2">E8M0</td>
<td rowspan="2">8</td>
</tr>
<tr>
<td>FP6 (E2M3)</td>
</tr>
<tr>
<td>MXFP4</td>
<td>FP4 (E2M1)</td>
<td>4</td>
<td>32</td>
<td>E8M0</td>
<td>8</td>
</tr>
<tr>
<td>MXINT8</td>
<td>INT8</td>
<td>8</td>
<td>32</td>
<td>E8M0</td>
<td>8</td>
</tr>
</table>


At an equivalent accuracy level, the MX data type demonstrates the ability to occupy a smaller area and incur lower energy costs for multiply-accumulate compared to other conventional data types on the same silicon [1].

Neural Compressor seamlessly applies the MX data type to post-training quantization, offering meticulously crafted recipes to empower users to quantize LLMs without sacrificing accuracy. The workflow is shown as below.

<a target="_blank" href="./imgs/mx_workflow.png" text-align:left>
<left>
<img src="./imgs/mx_workflow.png" alt="Workflow of MX Quant (source [3])" height=120>
</left>
</a>

The memory and computational limits of LLMs are more severe than other general neural networks, so our exploration focuses on LLMs first. The following table shows the basic MX quantization recipes in Neural Compressor and enumerates distinctions among various data types. The MX data type replaces general float scale with powers of two to be more hardware-friendly. It adapts a granularity falling between per-channel and per-tensor to balance accuracy and memory consumption.

| | MX Format | INT8 | FP8 |
|------------|--------------|------------|------------|
| Scale | $2^{exp}$ | $\frac{MAX}{amax}$ | $\frac{MAX}{amax}$ |
| Zero point | 0 (None) | $2^{bits - 1}$ or $-min * scale$ | 0 (None) |
| Granularity | per-block (default blocksize is 32) | per-channel or per-tensor | per-channel or per-tensor |

The exponent (exp) is equal to torch.floor(torch.log2(amax)), MAX is the representation range of the data type, amax is the max absolute value of per-block tensor, and rmin is the minimum value of the per-block tensor.


## Get Started with Microscaling Quantization API

To get a model quantized with Microscaling Data Types, users can use the Microscaling Quantization API as follows.

```python
from neural_compressor.torch.quantization import MXQuantConfig, prepare, convert

quant_config = MXQuantConfig(w_dtype=args.w_dtype, act_dtype=args.act_dtype, weight_only=args.woq)
user_model = prepare(model=user_model, quant_config=quant_config)
user_model = convert(model=user_model)
```

## Examples

- PyTorch [huggingface models](/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/mx)


## Reference

[1]: Darvish Rouhani, Bita, et al. "Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point." Advances in neural information processing systems 33 (2020): 10271-10281

[2]: OCP Microscaling Formats (MX) Specification

[3]: Rouhani, Bita Darvish, et al. "Microscaling Data Formats for Deep Learning." arXiv preprint arXiv:2310.10537 (2023).
103 changes: 103 additions & 0 deletions docs/3x/PT_MixPrecision.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
PyTorch Mixed Precision
========================================

1. [Introduction](#introduction)
2. [Mixed Precision Support Matrix](#mixed-precision-support-matrix)
3. [Get Started](#get-start)
4. [Examples](#examples)

## Introduction

The recent growth of Deep Learning has driven the development of more complex models that require significantly more compute and memory capabilities. Several low precision numeric formats have been proposed to address the problem. Google's [bfloat16](https://cloud.google.com/tpu/docs/bfloat16) and the [FP16: IEEE](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) half-precision format are two of the most widely used sixteen bit formats. [Mixed precision](https://arxiv.org/abs/1710.03740) training and inference using low precision formats have been developed to reduce compute and bandwidth requirements.

The 3rd Gen Intel® Xeon® Scalable processor (codenamed Cooper Lake), featuring Intel® Deep Learning Boost, is the first general-purpose x86 CPU to support the bfloat16 format. Specifically, three new bfloat16 instructions are added as a part of the AVX512_BF16 extension within Intel Deep Learning Boost: VCVTNE2PS2BF16, VCVTNEPS2BF16, and VDPBF16PS. The first two instructions allow converting to and from bfloat16 data type, while the last one performs a dot product of bfloat16 pairs. Further details can be found in the [hardware numerics document](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-deep-learning-boost-new-instruction-bfloat16.html) published by Intel.

The 4th Gen Intel® Xeon® Scalable processor supports FP16 instruction set architecture (ISA) for Intel®
Advanced Vector Extensions 512 (Intel® AVX-512). The new ISA supports a wide range of general-purpose numeric
operations for 16-bit half-precision IEEE-754 floating-point and complements the existing 32-bit and 64-bit floating-point instructions already available in the Intel Xeon processor based products. Further details can be found in the [hardware numerics document](https://www.intel.com/content/www/us/en/content-details/669773/intel-avx-512-fp16-instruction-set-for-intel-xeon-processor-based-products-technology-guide.html) published by Intel.

<p align="center" width="100%">
<img src="./imgs/data_format.png" alt="Architecture" height=230>
</p>

## Mixed Precision Support Matrix

<table class="center">
<thead>
<tr>
<th>Framework</th>
<th>Backend</th>
<th>Backend Library</th>
<th>Backend Value</th>
<th>Support Device(cpu as default)</th>
<th>Support BF16</th>
<th>Support FP16</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1" align="left">PyTorch</td>
<td align="left">FX</td>
xin3he marked this conversation as resolved.
Show resolved Hide resolved
<td align="left">FBGEMM</td>
<td align="left">"default"</td>
<td align="left">cpu</td>
<td align="left">&#10004;</td>
<td align="left">&#10004;</td>
</tr>
</tbody>
</table>


### Hardware and Software requests for **BF16**
- PyTorch
1. Hardware: CPU supports `avx512_bf16` instruction set.
2. Software: torch >= [1.11.0](https://download.pytorch.org/whl/torch_stable.html).


### Hardware and Software requests for **FP16**
- PyTorch
1. Hardware: CPU supports `avx512_fp16` instruction set.
2. Software: torch >= [1.11.0](https://download.pytorch.org/whl/torch_stable.html).


### Accuracy-driven mixed precision
BF16/FP16 conversion may lead to accuracy drop. Intel® Neural Compressor provides an accuracy-driven tuning function to reduce accuracy loss,
which could fallback converted ops to FP32, if set in config, to get better accuracy. To enable this function, users only to provide
`eval_fn` and `eval_args` for `autotune`.
To be noticed, IPEX backend doesn't support accuracy-driven mixed precision.

## Get Started with autotune API

To get a bf16/fp16 model, users can use the `autotune` interface with `MixPrecisionConfig` as follows.

- BF16:

```python
from neural_compressor.torch.quantization import MixPrecisionConfig, TuningConfig, autotune

def eval_acc_fn(model):
......
return acc

# modules might be fallback to fp32 to get better accuracy
custom_tune_config = TuningConfig(config_set=[MixPrecisionConfig(dtype=["bf16", "fp32"])], max_trials=3)
best_model = autotune(model=build_torch_model(), tune_config=custom_tune_config, eval_fn=eval_acc_fn)
```

- FP16:

```python
from neural_compressor.torch.quantization import MixPrecisionConfig, TuningConfig, autotune

def eval_acc_fn(model):
......
return acc

# modules might be fallback to fp32 to get better accuracy
custom_tune_config = TuningConfig(config_set=[MixPrecisionConfig(dtype=["fp16", "fp32"])], max_trials=3)
best_model = autotune(model=build_torch_model(), tune_config=custom_tune_config, eval_fn=eval_acc_fn)
```

## Examples

Example will be added later.
112 changes: 112 additions & 0 deletions docs/3x/PT_SmoothQuant.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
PyTorch Smooth Quantization
========================================

1. [Introduction](#Introduction)
2. [Usage](#Usage)
3. [Validated Models](#Validated-Models)
4. [Supported Framework Matrix](#Supported-Framework-Matrix)


## Introduction
Quantization is a common compression operation to reduce memory and accelerate inference by converting the floating point matrix to an integer matrix. For large language models (LLMs) with gigantic parameters, the systematic outliers make quantification of activations difficult. [SmoothQuant](https://arxiv.org/abs/2211.10438), a training free post-training quantization (PTQ) solution, offline migrates this difficulty from activations to weights with a mathematically equivalent transformation.


## Usage
### Fixed Alpha
To set a fixed alpha for the entire model, users can follow this example:

```python
from neural_compressor.torch.quantization import SmoothQuantConfig, convert, prepare


def run_fn(model):
model(example_inputs)


quant_config = SmoothQuantConfig(alpha=0.5)
prepared_model = prepare(fp32_model, quant_config=quant_config, example_inputs=example_inputs)
run_fn(prepared_model)
q_model = convert(prepared_model)
```
`SmoothQuantConfig` description:

`alpha`: a smooth factor to calculate the conversion per-channel scale and balance the quantization difficulty of activation and weight. Float value, default is 0.5.

> **Note:** Alpha="auto" and alpha auto-tuning was supported in old API, please stay tuned for the new API's support for auto alpha.

### Specify Quantization Rules
Intel(R) Neural Compressor support specify quantization rules by operator type for Smooth Quantization. Users can use `set_local` to fallback op type in `SmoothQuantConfig` to achieve the above purpose.

Here we don't quantize `Linear` layers.
```python
# fallback by op_type
quant_config.set_local("Linear", SmoothQuantConfig(w_dtype="fp32", act_dtype="fp32"))
prepared_model = prepare(model, quant_config=quant_config, example_inputs=example_inputs)
run_fn(prepared_model)
q_model = convert(prepared_model)
```

To get more information, please refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/llm).


## Validated Models
Neural Compressor: 2.1

IPEX (Intel Extension for PyTorch): 2.0/2.1

Dataset: lambada_openai

Task: text-generation provided by [ITREX](https://github.com/intel/intel-extension-for-transformers/tree/main/examples/huggingface/pytorch/text-generation/quantization)

alpha [0.4, 0.6] is sweet spot region in SmoothQuant paper.

A list of models that achieved a <1% accuracy drop is shown below.

| Model/Last token accuracy | FP32 Accuracy | INT8 (w/ SmoothQuant) | Notes |
|:----------:|:------:|:------:|-----------------------------------|
| bigscience/bloom-560m | 0.354 | 0.3542 | alpha=0.5, Ipex 2.1 |
| bigscience/bloom-1b7 | 0.4634 | 0.4936 | alpha=0.5, Ipex 2.0 |
| bigscience/bloom-3b | 0.518 | 0.5185 | alpha=0.8, Ipex 2.1 |
| bigscience/bloom-7b1 | 0.5764 | 0.5977 | alpha=0.5, Ipex 2.0 |
| bigscience/bloomz-560m | 0.3947 | 0.3930 | alpha=0.8, Ipex 2.1 |
| bigscience/bloomz-1b7 | 0.4828 | 0.4906 | alpha=0.5, Ipex 2.1 |
| bigscience/bloomz-3b | 0.5018 | 0.4980 | alpha=0.5, Ipex 2.1 |
| bigscience/bloomz-7b1 | 0.5593 | 0.5552 | alpha=0.5, Ipex 2.1 |
| facebook/opt-125m | 0.379 | 0.3757 | alpha=0.5, Ipex 2.1 |
| facebook/opt-350m | 0.4516 | 0.4533 | alpha=0.8, Ipex 2.1 |
| facebook/opt-1.3b | 0.5789 | 0.5742 | alpha=0.8, Ipex 2.0 |
| facebook/opt-2.7b | 0.6365 | 0.6404 | alpha=0.5, Ipex 2.0 |
| facebook/opt-6.7b | 0.6769 | 0.6804 | alpha=0.5, Ipex 2.0 |
| facebook/opt-13b | 0.6872 | 0.6814 | alpha=0.5, Ipex 2.1 |
| facebook/opt-30b | 0.7149 | 0.7128 | alpha=0.5, Ipex 2.1 |
| facebook/opt-66b | 0.7398 | 0.7326 | alpha=0.5, Ipex 2.1 |
| LLaMa-7b | 0.7361 | 0.7357 | alpha=0.8, Ipex 2.1 |
| LLaMa-13b | 0.7627 | 0.7590 | alpha=0.7, Ipex 2.1 |
| LLaMa-30b | 0.7759 | 0.7840 | alpha=0.7, Ipex 2.1 |
| LLaMa-65b | 0.7908 | 0.7957 | alpha=0.9, Ipex 2.1 |
| EleutherAI/gpt-j-6B* | 0.6831 | 0.6821 | alpha=1.0, Ipex 2.1 |
| MBZUAI/LaMini-GPT-124m | 0.3804 | 0.3887 | alpha=0.5, Ipex 2.1 |
| MBZUAI/LaMini-GPT-774m | 0.5048 | 0.5057 | alpha=0.5, Ipex 2.1 |
| MBZUAI/LaMini-GPT-1.5b | 0.5443 | 0.5436 | alpha=0.5, Ipex 2.1 |
| mosaicml/mpt-7b-chat | 0.655 | 0.6499 | alpha=0.7, Ipex 2.1 |
| stabilityai/stablelm-base-alpha-3b | 0.4172 | 0.4149 | alpha=0.6, Ipex 2.1 |
| togethercomputer/RedPajama-INCITE-Base-3B-v1 | 0.6542 | 0.6735 | alpha=0.5, Ipex 2.1 |
| togethercomputer/RedPajama-INCITE-Chat-3B-v1* | 0.6718 | 0.6740 | alpha=0.5, Ipex 2.0 |
| togethercomputer/RedPajama-INCITE-Instruct-3B-v1* | 0.6569 | 0.6621 | alpha=0.5, Ipex 2.0 |
| togethercomputer/RedPajama-INCITE-Base-7B-v0.1* | 0.7143 | 0.7221 | alpha=0.5, Ipex 2.0 |
| togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1* | 0.6895 | 0.6953 | alpha=0.5, Ipex 2.0 |
| databricks/dolly-v1-6b* | 0.6866 | 0.6895 | alpha=0.8, Ipex 2.1 |
| databricks/dolly-v2-3b* | 0.6297 | 0.6247 | alpha=0.5, Ipex 2.1 |
| tiiuae/falcon-7b-instruct | 0.6437 | 0.6392 | alpha=0.7, Pytorch |

Please refer to the step-by-step [instruction](../../examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/ipex/README.md) for details.

Please note that for models with asterisk(*), we have set all add ops to FP32 during quantization step to achieve desirable results.


## Supported Framework Matrix

| Framework | Alpha | Folding |
|:---------:|--------------|------------|
| PyTorch | [0-1] | False |
| IPEX | [0-1] | True / False(Version>2.1) |
Loading
Loading