Emulated FP8 Quantization

Introduction
Supported Framework
Get Start with FP8 Quantization
3.1. Old API Configuration
3.2. New API Configuration
3.2. Automatic Tuning Strategy
3.2. Global Environment Variables
Examples

Introduction

Float point 8(FP8) is a promising data type for low precision quantization. In Intel Neural Compressor, the emulated FP8 quantization is supported in branch fp8_adaptor. With specifing precision(fp8_e5m2, fp8_e4m3, fp8_e3m4), users can validate the accuracy of the quantized FP8 model.

Supported Framework

Framework	Emulated FP8 Quantization
PyTorch	✔
ONNX	✔

Note: FP8 Emulation Toolkit is needed to be installed.

### install mpemu
git clone https://github.com/IntelLabs/FP8-Emulation-Toolkit.git
cd FP8-Emulation-Toolkit  
python setup.py install

### install neural compressor
git clone https://github.com/intel/neural-compressor.git
cd neural-compressor
git checkout fp8_adaptor
python setup.py install

Get Start with FP8 Quantization

Comparing with the INT8 quantization, only one parameter: precision(fp8_e5m2/fp8_e4m3/fp8_e3m4) is added.

Also, for models with BatchNorm, it is recommanded to calibrate its statistics in train mode with FP8 data type before quantization.

Old API Configuration for Intel Neural Compressor 1.x

model:
    name: xxx
    framework: pytorch

quantization:
    approach: post_training_static_quant    # no need for fp8_e5m2
    precision: fp8_e4m3    # allowed precision is fp8_e5m2, fp8_e4m3, fp8_e3m4
    calibration:
        batchnorm_sampling_size: 3000    # only needed for models w/ BatchNorm
        sampling_size: 300

tuning:
    accuracy_criterion:
        relative:  0.01
    exit_strategy:
        timeout: 0
    random_seed: 9527

New API Configuration for Intel Neural Compressor 2.0

quant_conf = PostTrainingQuantConfig(
    precision="fp8_e5m2",
    calibration_sampling_size=[300],
    batchnorm_calibration_sampling_size=[3000],
)

Automatic Tuning Strategy

Unlike the INT8 base strategy, the FP8 auto tuning strategy will attempt per operation type tuning. We first aggressively quantize all op types. If the accuracy requirement is missed, the strategy will try to quantize one op type and accumulates them together. Finally, the user will get the following information.

[INFO] Suggested op types with KL algorithm are: ['Matmul', 'LayerNorm', 'Linear']
[INFO] Suggested FP8 op types are: ['Matmul', 'Embedding', 'LayerNorm', 'Linear']; Accuracy is 0.5560059529291749

Global Environment Variables

In order to facilitate customer customization, some global environment variables are used.

Framework	Usage	Supported Values
FP8_OP_TYPE_LIST	To specify module type range of emulated FP8 quantization	'linear', 'conv2d', 'bmm', 'amm', 'mm','add', 'mul', 'div', 'embedding', 'embeddingbag', 'layernorm'
DISABLE_FIRST_CONV	Whether quantize the first convolution layer	True/False
DISABLE_LAST_LINEAR	Whether quantize the last linear layer	True/False
MIX_PRECISION	Whether allow mix precision and auto select data type	True/False
E4M3_SCALE	Whether fix the scale to 1, which means cast fp32 to fp8_e4m3	1/-

Examples

quantizer = Quantization("fake.yaml")
quantizer.model = model
quantizer.calib_dataloader = self.cv_dataloader
q_model = quantizer.fit()

or

quant_conf = PostTrainingQuantConfig(
    precision="fp8_e5m2",
    calibration_sampling_size=[300],
    batchnorm_calibration_sampling_size=[3000],
)
q_model = quantization.fit(
    model,
    quant_conf,
    eval_func = eval_func,
    calib_dataloader=self.cv_dataloader
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fp8.md

fp8.md

Emulated FP8 Quantization

Introduction

Supported Framework

Get Start with FP8 Quantization

Old API Configuration for Intel Neural Compressor 1.x

New API Configuration for Intel Neural Compressor 2.0

Automatic Tuning Strategy

Global Environment Variables

Examples

Files

fp8.md

Latest commit

History

fp8.md

File metadata and controls

Emulated FP8 Quantization

Introduction

Supported Framework

Get Start with FP8 Quantization

Old API Configuration for Intel Neural Compressor 1.x

New API Configuration for Intel Neural Compressor 2.0

Automatic Tuning Strategy

Global Environment Variables

Examples