Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry pick v1.17.0 #1964

Merged
merged 58 commits into from
Aug 10, 2024
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
a9df7da
[SW-184941] INC CI, CD and Promotion
Yantom1 May 19, 2024
14f031e
[SW-183320]updated setup.py
RonBenMosheHabana Jun 6, 2024
ee7e5c8
[SW-177474] add HQT FP8 porting code
zyuwen-habana May 22, 2024
ca1444b
[SW-189361] Fix white list extend
ulivne Jun 19, 2024
dfec104
[SW-191317] Raise exception according to hqt config object
ulivne Jul 3, 2024
216d94b
[SW-184714] Port HQT code into INC
ulivne Jul 6, 2024
96bffd9
[SW-184714] Add internal folder to fp8 quant
ulivne Jul 7, 2024
90838a4
[SW-177468] Removed unused code + cleanup
HolyFalafel Jun 20, 2024
b76f002
Fix errors in regression_detection
smarkovichgolan Jul 3, 2024
90b10d3
[SW-187731] Save orig module as member of patched module
ulivne Jun 23, 2024
62026c2
[SW-190899] Install packages according to configuration
ulivne Jul 8, 2024
4a0d704
[SW-184689] use finalize_calibration intrenaly for one step flow
ulivne Jul 9, 2024
dfa8833
[SW-191945] align requirement_pt.txt in gerrit INC with Github INC
Jul 9, 2024
604d664
[SW-192358] Remove HQT reference in INC
ulivne Jul 11, 2024
a493d7c
[SW-191415] update fp8 maxAbs observer using torch.copy_
dudilester Jul 11, 2024
8803808
[SW-184943] Enhance INC WOQ model loading
zyuwen-habana Jun 13, 2024
a14c5c6
[SW-190303] Implement HPUWeightOnlyLinear class in INC
Yantom1 Jul 9, 2024
35b5bd2
[SW-192809] fix json_file bug when instantiating FP8Config class
zyuwen-habana Jul 15, 2024
f45e0aa
[SW-192931] align setup.py with github INC and remove fp8_convert
Jul 16, 2024
165ce63
[SW-192917] Update all HQT logic files with pre-commit check
Jul 16, 2024
853bb8d
update docstring
yuwenzho Jul 26, 2024
86d8dfa
add fp8 example and document (#1639)
xin3he Jul 29, 2024
051fee8
Update settings to be compatible with gerrit
xin3he Jul 30, 2024
2737870
enhance ut
yuwenzho Jul 30, 2024
402c16f
move fp8 sample to helloworld folder
yuwenzho Aug 1, 2024
34855fa
update torch version of habana docker
Aug 6, 2024
8c57adb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 6, 2024
7fbceaf
update readme demo
Aug 6, 2024
3e5552e
update WeightOnlyLinear to INCWeightOnlyLinear
Aug 6, 2024
e200364
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 6, 2024
2496153
Merge branch 'master' into cherry_pick_v1.17.0
xin3he Aug 6, 2024
ec45a27
add docstring for FP8Config
Aug 6, 2024
ab212c8
fix pylint
Aug 6, 2024
bfde945
Merge branch 'master' into cherry_pick_v1.17.0
xin3he Aug 7, 2024
d42660a
update fp8 test scripts
chensuyue Aug 7, 2024
b675220
Merge branch 'cherry_pick_v1.17.0' of https://github.com/intel/neural…
chensuyue Aug 7, 2024
1b01e61
delete deps
chensuyue Aug 8, 2024
60b98f8
update container into v1.17.0
chensuyue Aug 8, 2024
8437a65
update docker version
Aug 8, 2024
7fadea9
update pt ut
chensuyue Aug 8, 2024
8fe70fe
Merge branch 'cherry_pick_v1.17.0' of https://github.com/intel/neural…
chensuyue Aug 8, 2024
5a055c1
add lib path
chensuyue Aug 9, 2024
45406db
fix dir issue
Aug 9, 2024
cb4735b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 9, 2024
f5b74b9
Merge branch 'cherry_pick_v1.17.0' of https://github.com/intel/neural…
chensuyue Aug 9, 2024
62addf2
update fp8 test scope
chensuyue Aug 9, 2024
f569f21
Merge branch 'master' into cherry_pick_v1.17.0
chensuyue Aug 9, 2024
0df24a6
fix typo
Aug 9, 2024
93e4aa0
update fp8 test scope
chensuyue Aug 9, 2024
a586ae3
Merge branch 'cherry_pick_v1.17.0' of https://github.com/intel/neural…
chensuyue Aug 9, 2024
f087298
update pre-commit-ci
chensuyue Aug 9, 2024
a858bab
work around for hpu
Aug 9, 2024
9ab10e6
fix UT
Aug 9, 2024
6642962
fix parameter
chensuyue Aug 9, 2024
02d490e
omit some test
chensuyue Aug 9, 2024
5e02321
update main page example to llm loading
Aug 9, 2024
084e244
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 9, 2024
3763723
fix autotune
Aug 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 32 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,67 +68,52 @@ pip install "neural-compressor>=2.3" "transformers>=4.34.0" torch torchvision
```
After successfully installing these packages, try your first quantization program.

### Weight-Only Quantization (LLMs)
Following example code demonstrates Weight-Only Quantization on LLMs, it supports Intel CPU, Intel Gaudi2 AI Accelerator, Nvidia GPU, best device will be selected automatically.
### [FP8 Quantization](./examples/3.x_api/pytorch/cv/fp8_quant/)
Following example code demonstrates FP8 Quantization, it is supported by Intel Gaudi2 AI Accelerator.

To try on Intel Gaudi2, docker image with Gaudi Software Stack is recommended, please refer to following script for environment setup. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#launch-docker-image-that-was-built).
```bash
# Run a container with an interactive shell
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.14.0/ubuntu22.04/habanalabs/pytorch-installer-2.1.1:latest

# Install the optimum-habana
pip install --upgrade-strategy eager optimum[habana]

# Install INC/auto_round
pip install neural-compressor auto_round
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
```
Run the example:
```python
from transformers import AutoModel, AutoTokenizer

from neural_compressor.config import PostTrainingQuantConfig
from neural_compressor.quantization import fit
from neural_compressor.adaptor.torch_utils.auto_round import get_dataloader

model_name = "EleutherAI/gpt-neo-125m"
float_model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
dataloader = get_dataloader(tokenizer, seqlen=2048)

woq_conf = PostTrainingQuantConfig(
approach="weight_only",
op_type_dict={
".*": { # match all ops
"weight": {
"dtype": "int",
"bits": 4,
"algorithm": "AUTOROUND",
},
}
},
from neural_compressor.torch.quantization import (
FP8Config,
prepare,
convert,
)
quantized_model = fit(model=float_model, conf=woq_conf, calib_dataloader=dataloader)
import torchvision.models as models

model = models.resnet18()
qconfig = FP8Config(fp8_config="E4M3")
model = prepare(model, qconfig)
# customer defined calibration
calib_func(model)
model = convert(model)
```
**Note:**

To try INT4 model inference, please directly use [Intel Extension for Transformers](https://github.com/intel/intel-extension-for-transformers), which leverages Intel Neural Compressor for model quantization.
### [Weight-Only Quantization (LLMs)](./examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/)

### Static Quantization (Non-LLMs)
Following example code demonstrates Weight-Only Quantization on LLMs, it supports Intel CPU, Intel Gaudi2 AI Accelerator, Nvidia GPU, best device will be selected automatically.

```python
from torchvision import models
from neural_compressor.torch.quantization import prepare, convert, AutoRoundConfig

from neural_compressor.config import PostTrainingQuantConfig
from neural_compressor.data import DataLoader, Datasets
from neural_compressor.quantization import fit
model_name = "EleutherAI/gpt-neo-125m"
model = AutoModel.from_pretrained(model_name)

float_model = models.resnet18()
dataset = Datasets("pytorch")["dummy"](shape=(1, 3, 224, 224))
calib_dataloader = DataLoader(framework="pytorch", dataset=dataset)
static_quant_conf = PostTrainingQuantConfig()
quantized_model = fit(model=float_model, conf=static_quant_conf, calib_dataloader=calib_dataloader)
quant_config = AutoRoundConfig()
model = prepare(model, quant_config)
# customer defined calibration
run_fn(model) # calibration
model = convert(model)
```

**Note:**

To try INT4 model inference, please directly use [Intel Extension for Transformers](https://github.com/intel/intel-extension-for-transformers), which leverages Intel Neural Compressor for model quantization.

## Documentation

<table class="docutils">
Expand All @@ -154,12 +139,13 @@ quantized_model = fit(model=float_model, conf=static_quant_conf, calib_dataloade
<tbody>
<tr>
<td colspan="2" align="center"><a href="./docs/source/3x/PyTorch.md">Overview</a></td>
<td colspan="2" align="center"><a href="./docs/source/3x/PT_StaticQuant.md">Static Quantization</a></td>
<td colspan="2" align="center"><a href="./docs/source/3x/PT_DynamicQuant.md">Dynamic Quantization</a></td>
<td colspan="2" align="center"><a href="./docs/source/3x/PT_StaticQuant.md">Static Quantization</a></td>
<td colspan="2" align="center"><a href="./docs/source/3x/PT_SmoothQuant.md">Smooth Quantization</a></td>
</tr>
<tr>
<td colspan="4" align="center"><a href="./docs/source/3x/PT_WeightOnlyQuant.md">Weight-Only Quantization</a></td>
<td colspan="2" align="center"><a href="./docs/source/3x/PT_WeightOnlyQuant.md">Weight-Only Quantization</a></td>
<td colspan="2" align="center"><a href="./docs/3x/PT_FP8Quant.md">FP8 Quantization</a></td>
<td colspan="2" align="center"><a href="./docs/source/3x/PT_MXQuant.md">MX Quantization</a></td>
<td colspan="2" align="center"><a href="./docs/source/3x/PT_MixedPrecision.md">Mixed Precision</a></td>
</tr>
Expand Down
113 changes: 113 additions & 0 deletions docs/3x/PT_FP8Quant.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
FP8 Quantization
=======

1. [Introduction](#introduction)
2. [Supported Parameters](#supported-parameters)
3. [Get Start with FP8 Quantization](#get-start-with-fp8-quantization)
4. [Examples](#examples)

## Introduction

Float point 8 (FP8) is a promising data type for low precision quantization which provides a data distribution that is completely different from INT8 and it's shown as below.

<div align="center">
<img src="./imgs/fp8_dtype.png" height="250"/>
</div>

Intel Gaudi2, also known as HPU, provides this data type capability for low precision quantization, which includes `E4M3` and `E5M2`. For more information about these two data type, please refer to [link](https://arxiv.org/abs/2209.05433).

Intel Neural Compressor provides general quantization APIs to leverage HPU FP8 capability. with simple with lower memory usage and lower compute cost, 8 bit model

## Supported Parameters

<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-fymr{border-color:inherit;font-weight:bold;text-align:left;vertical-align:top}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg"><thead>
<tr>
<th class="tg-fymr">Attribute</th>
<th class="tg-fymr">Description</th>
<th class="tg-fymr">Values</th>
</tr></thead>
<tbody>
<tr>
<td class="tg-0pky">fp8_config</td>
<td class="tg-0pky">The target data type of FP8 quantization.</td>
<td class="tg-0pky">E4M3 (default) - As Fig. 2<br>E5M2 - As Fig. 1.</td>
</tr>
<tr>
<td class="tg-0pky">hp_dtype</td>
<td class="tg-0pky">The high precision data type of non-FP8 operators.</td>
<td class="tg-0pky">bf16 (default) - torch.bfloat16<br>fp16 - torch.float16.<br>fp32 - torch.float32.</td>
</tr>
<tr>
<td class="tg-0pky">observer</td>
<td class="tg-0pky">The observer to measure the statistics.</td>
<td class="tg-0pky">maxabs (default), saves all tensors to files.</td>
</tr>
<tr>
<td class="tg-0pky">allowlist</td>
<td class="tg-0pky">List of nn.Module names or types to quantize. When setting an empty list, all the supported modules will be quantized by default. See Supported Modules. Not setting the list at all is not recommended as it will set the allowlist to these modules only: torch.nn.Linear, torch.nn.Conv2d, and BMM.</td>
<td class="tg-0pky">Default = {'names': [], 'types': <span title=["Matmul","Linear","FalconLinear","KVCache","Conv2d","LoRACompatibleLinear","LoRACompatibleConv","Softmax","ModuleFusedSDPA","LinearLayer","LinearAllreduce","ScopedLinearAllReduce","LmHeadLinearAllreduce"]>FP8_WHITE_LIST}</span></td>
</tr>
<tr>
<td class="tg-0pky">blocklist</td>
<td class="tg-0pky">List of nn.Module names or types not to quantize. Defaults to empty list, so you may omit it from the config file.</td>
<td class="tg-0pky">Default = {'names': [], 'types': ()}</td>
</tr>
<tr>
<td class="tg-0pky">mode</td>
<td class="tg-0pky">The mode, measure or quantize, to run HQT with.</td>
<td class="tg-0pky">MEASURE - Measure statistics of all modules and emit the results to dump_stats_path.<br>QUANTIZE - Quantize and run the model according to the provided measurements.<br>AUTO (default) - Select from [MEASURE, QUANTIZE] automatically.</td>
</tr>
<tr>
<td class="tg-0pky">dump_stats_path</td>
<td class="tg-0pky">The path to save and load the measurements. The path is created up until the level before last "/". The string after the last / will be used as prefix to all the measurement files that will be created.</td>
<td class="tg-0pky">Default = "./hqt_output/measure"</td>
</tr>
<tr>
<td class="tg-0pky">scale_method</td>
<td class="tg-0pky">The method for calculating the scale from the measurement.</td>
<td class="tg-0pky">- without_scale - Convert to/from FP8 without scaling.<br>- unit_scale - Always use scale of 1.<br>- maxabs_hw (default) - Scale is calculated to stretch/compress the maxabs measurement to the full-scale of FP8 and then aligned to the corresponding HW accelerated scale.<br>- maxabs_pow2 - Scale is calculated to stretch/compress the maxabs measurement to the full-scale of FP8 and then rounded to the power of 2.<br>- maxabs_hw_opt_weight - Scale of model params (weights) is chosen as the scale that provides minimal mean-square-error between quantized and non-quantized weights, from all possible HW accelerated scales. Scale of activations is calculated the same as maxabs_hw.<br>- act_maxabs_pow2_weights_pcs_opt_pow2 - Scale of model params (weights) is calculated per-channel of the params tensor. The scale per-channel is calculated the same as maxabs_hw_opt_weight. Scale of activations is calculated the same as maxabs_pow2.<br>- act_maxabs_hw_weights_pcs_maxabs_pow2 - Scale of model params (weights) is calculated per-channel of the params tensor. The scale per-channel is calculated the same as maxabs_pow2. Scale of activations is calculated the same as maxabs_hw.</td>
</tr>
<tr>
<td class="tg-0pky">measure_exclude</td>
<td class="tg-0pky">If this attribute is not defined, the default is OUTPUT. Since most models do not require measuring output tensors, you can exclude it to speed up the measurement process.</td>
<td class="tg-0pky">NONE - All tensors are measured.<br>OUTPUT (default) - Excludes measurement of output tensors.</td>
</tr>
</tbody></table>

## Get Start with FP8 Quantization

### Demo Usage

```python
from neural_compressor.torch.quantization import (
FP8Config,
prepare,
convert,
)
import torchvision.models as models

model = models.resnet18()
qconfig = FP8Config(fp8_config="E4M3")
model = prepare(model, qconfig)
# customer defined calibration
calib_func(model)
model = convert(model)
```

## Examples

| Task | Example |
|----------------------|---------|
| Computer Vision (CV) | [Link](../../examples/3.x_api/pytorch/cv/fp8_quant/) |
| Large Language Model (LLM) | [Link](https://github.com/HabanaAI/optimum-habana-fork/tree/habana-main/examples/text-generation#running-with-fp8) |

> Note: For LLM, Optimum-habana provides higher performance based on modified modeling files, so here the Link of LLM goes to Optimum-habana, which utilize Intel Neural Compressor for FP8 quantization internally.
7 changes: 7 additions & 0 deletions examples/.config/model_params_pytorch_3x.json
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,13 @@
"main_script": "main.py",
"batch_size": 1
},
"resnet18_fp8_static":{
"model_src_dir": "cv/fp8_quant",
"dataset_location": "/tf_dataset/pytorch/ImageNet/raw",
"input_model": "",
"main_script": "main.py",
"batch_size": 1
},
"opt_125m_pt2e_static":{
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/static_quant/pt2e",
"dataset_location": "",
Expand Down
28 changes: 28 additions & 0 deletions examples/3.x_api/pytorch/cv/fp8_quant/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# ImageNet FP8 Quantization

This implements FP8 quantization of popular model architectures, such as ResNet on the ImageNet dataset, which is supported by Intel Gaudi2 AI Accelerator.

## Requirements

To try on Intel Gaudi2, docker image with Gaudi Software Stack is recommended, please refer to following script for environment setup. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#launch-docker-image-that-was-built).
```bash
# Run a container with an interactive shell
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
```

- Install requirements
- `pip install -r requirements.txt`
- Download the ImageNet dataset from http://www.image-net.org/
- Then, move and extract the training and validation images to labeled subfolders, using [the following shell script](extract_ILSVRC.sh)

## Quantizaiton

To quant a model and validate accaracy, run `main.py` with the desired model architecture and the path to the ImageNet dataset:

```bash
python main.py --pretrained -t -a resnet50 -b 30 /path/to/imagenet
```
or
```bash
bash run_quant.sh --input_model=resnet50 --dataset_location=/path/to/imagenet
```
80 changes: 80 additions & 0 deletions examples/3.x_api/pytorch/cv/fp8_quant/extract_ILSVRC.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
#!/bin/bash
#
# script to extract ImageNet dataset
# ILSVRC2012_img_train.tar (about 138 GB)
# ILSVRC2012_img_val.tar (about 6.3 GB)
# make sure ILSVRC2012_img_train.tar & ILSVRC2012_img_val.tar in your current directory
#
# Adapted from:
# https://github.com/facebook/fb.resnet.torch/blob/master/INSTALL.md
# https://gist.github.com/BIGBALLON/8a71d225eff18d88e469e6ea9b39cef4
#
# imagenet/train/
# ├── n01440764
# │ ├── n01440764_10026.JPEG
# │ ├── n01440764_10027.JPEG
# │ ├── ......
# ├── ......
# imagenet/val/
# ├── n01440764
# │ ├── ILSVRC2012_val_00000293.JPEG
# │ ├── ILSVRC2012_val_00002138.JPEG
# │ ├── ......
# ├── ......
#
#
# Make imagnet directory
#
mkdir imagenet
#
# Extract the training data:
#
# Create train directory; move .tar file; change directory
mkdir imagenet/train && mv ILSVRC2012_img_train.tar imagenet/train/ && cd imagenet/train
# Extract training set; remove compressed file
tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
#
# At this stage imagenet/train will contain 1000 compressed .tar files, one for each category
#
# For each .tar file:
# 1. create directory with same name as .tar file
# 2. extract and copy contents of .tar file into directory
# 3. remove .tar file
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
#
# This results in a training directory like so:
#
# imagenet/train/
# ├── n01440764
# │ ├── n01440764_10026.JPEG
# │ ├── n01440764_10027.JPEG
# │ ├── ......
# ├── ......
#
# Change back to original directory
cd ../..
#
# Extract the validation data and move images to subfolders:
#
# Create validation directory; move .tar file; change directory; extract validation .tar; remove compressed file
mkdir imagenet/val && mv ILSVRC2012_img_val.tar imagenet/val/ && cd imagenet/val && tar -xvf ILSVRC2012_img_val.tar && rm -f ILSVRC2012_img_val.tar
# get script from soumith and run; this script creates all class directories and moves images into corresponding directories
wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
#
# This results in a validation directory like so:
#
# imagenet/val/
# ├── n01440764
# │ ├── ILSVRC2012_val_00000293.JPEG
# │ ├── ILSVRC2012_val_00002138.JPEG
# │ ├── ......
# ├── ......
#
#
# Check total files after extract
#
# $ find train/ -name "*.JPEG" | wc -l
# 1281167
# $ find val/ -name "*.JPEG" | wc -l
# 50000
#
Loading
Loading