INT8EntropyCalibrator2 implicit quantization superseded by explicit quantization #4095

adaber · 2024-08-23T23:56:34Z

Description

Hi,

I have been using the INT8 Entropy Calibrator 2 for INT8 quantization in Python and it’s been working well (TensorRT 10.0.1). The example of how I use the INT8 Entropy Calibrator 2 can be found in the official TRT GitHub repo (TensorRT/samples/python/efficientdet/build_engine.py at release/10.0 · NVIDIA/TensorRT · GitHub)

The warning I’ve been getting starting with TensorRT 10.1 is that the INT8 Entropy Calibrator 2 implicit quantization has been deprecated and superseded by explicit quantization.

I’ve read the official document on the difference between the implicit and explicit quantization processes (Developer Guide :: NVIDIA Deep Learning TensorRT Documentation) and they seem to work differently. The explicit quantization seems to expect a network to have QuantizeLayer and DequantizeLayer layers which my networks don’t. The implicit quantization can be used when those layers are not present in a network. Therefore, I am confused about how the implicit quantization can be superseded by the explicit quantization since they seem to work differently.

So, my question is what needs to be modified in the standard INT8 Calibrator 2 quantization method (TensorRT/samples/python/efficientdet/build_engine.py at release/10.0 · NVIDIA/TensorRT · GitHub) for the deprecation warning not to show up ? Or what is the proper way to implement the INT8 Calibrator 2 implicit quantization now that the current one is deprecated ? Couldn’t find any example using a newer TensorRT version (10.1 and up)

Thank you!

Environment

TensorRT Version: 10.1

NVIDIA GPU: 3090

Operating System: Windows 10

Python Version: 3.9.19

moraxu · 2024-08-30T21:51:32Z

Does your project require the use of the specific calibrator from our samples or have you tried looking into Model Optimizer (https://github.com/NVIDIA/TensorRT-Model-Optimizer) for the calibration purposes (it should ensure your model has Q/DQ nodes for the explicit quant approach)? If the former, we can look into updating that sample code.

adaber · 2024-09-03T04:59:28Z

Hey moraxu,

Sorry for the late response. I missed your message.

Haven't looked at or work with Model Optimizer yet. Thanks for letting me know.

Not sure if I understood the question but I apply the INT8 Entropy Calibrator 2 quantization to my CNN models. It gives the best results. Is that what you meant ?

So, will the implicit quantization still be supported or you, guys, are completely switching to the explicit one only ? I would like to avoid changing things in my processing pipeline (training models in Pytorch, converting to ONNX, implicit PTQ quantization) due to possible incompatibility issues and such.

Therefore, it would be great if you could update the INT8 Entropy Calibrator 2 sample code. Would that be possible ?

Thanks!

moraxu · 2024-09-03T18:19:03Z

So, will the implicit quantization still be supported or you, guys, are completely switching to the explicit one only ? I would like to avoid changing things in my processing pipeline (training models in Pytorch, converting to ONNX, implicit PTQ quantization) due to possible incompatibility issues and such.

Therefore, it would be great if you could update the INT8 Entropy Calibrator 2 sample code. Would that be possible ?

Thanks for clarifying, in that case I'll request the sample code to be updated and provide more updates in this ticket - thanks for reporting this.

adaber · 2024-09-03T21:21:10Z

That sounds great, moraxu! Thank you.

CoinCheung · 2024-09-04T01:31:30Z

@moraxu Hi, does this mean that, after tensorrt 10, the recommended method is: train-pytorch-model -> int8-with-model-optimizer -> export-onnx -> compile-with-tensorrt ?

moraxu · 2024-09-04T17:17:31Z

@CoinCheung , yes, if you're using PyTorch then Model Optimizer can automate more parts of that flow: https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_pytorch_quantization.html in contrast to manually configuring the optimization and quantization steps within TRT API (unless you need to do that for more control)

adaber · 2024-09-05T06:14:56Z

@moraxu Got a few question about Model Optimizer if you don't mind.

What Quantization Config would yield similar results to INT8 Entropy Calibration 2 ?
Say I INT8 quantize a neural net using Model Optimizer and convert it to ONNX.
a) How does TRT handle the ONNX model when converting it to a TRT engine ?
b) How do different TRT flags affect the engine building process in this case (FP16 and such)
c) In general, I am not sure if/how TRT changes an already quantized model using Model Optimizer during the TRT engine conversion process.
d) Not sure if you're familiar with the following but how is this already quantized model converted to a TRT timing cache in ONNX Runtime ?

Sorry if this is too many questions but the more libraries involved the higher the chance something will be incompatible or not work from my experience. Hence, I would like to know if Model Optimizer can give me exactly the same results as the pipelines that I mentioned before:

Pytorch FP 32-> ONNX-> FP16 TRT engine
Pytorch FP 32-> ONNX-> INT8 TRT engine
Pytorch FP 32-> ONNX-> FP16 TRT timing cache (ONNX Runtime)
Pytorch FP 32-> ONNX-> INT8 TRT timing cache (ONNX Runtime))

Thanks!

P.S. I would still appreciate if you guys could update the INT8 quantization sample code so I can use that while investigating and testing Model Optimizer. Thanks!

moraxu · 2024-09-05T17:48:55Z

Hi @adaber,

I'm not up to speed with that sample, I'd have to check how to translate that calibrator: https://github.com/NVIDIA/TensorRT/blob/release/10.0/samples/python/efficientdet/build_engine.py#L37 to Model Optimizer's config. If this is an immediate blocker, could you please check on their forum: https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues ?

a) You mean how does it handle a quantized ONNX? TRT reads the scale factors from the ONNX file (for INT8 layers). These scale factors define how the weights and activations are quantized. If these scale factors are provided, TRT will use them directly.
b)

FP16 flag: TRT will maintain the INT8 quantization for layers that are pre-quantized but may choose to use FP16 for layers that were not quantized or for activations that can benefit from FP16 precision.
INT8 flag: For models already quantized using Model Optimizer, the flag ensures that INT8 execution will be used for the quantized layers. TRT will use the existing quantization scales if provided in the ONNX model.

c) In practice, TRT should not drastically change the quantization itself (i.e., INT8 scale factors should remain intact). Still, the engine-building process may involve optimizations that improve inference performance while maintaining accuracy.
d) I don't believe there's any difference for quantized models for the timing caches, but now the process includes determining how best to execute quantized layers (e.g., which kernels to use for INT8) etc.

@akhilg-nv , please correct me or add anything, since I believe you've worked with quantized workflows more than me.

adaber · 2024-09-05T21:00:38Z

Hi @moraxu

Firstly, thank you so much for responding so quickly. It is very appreciated!

Thanks for the links. This is very important since INT8 Entropy Calibration 2 is recommended for CNNs and works really well from my experience. I've seen a few posts, for example, where people couldn't achieve the same INT8 quantization results using the ONNX Runtime TRT execution provider. Therefore I am inclined to just stick to the TensorRT API since I would have to spend some additional time to figure out how to achieve the same results (INT8 quantization in this case) with a different library/wrapper using TensorRT.

I will definitely check their forums. Please, let me know if you manage to find out what Model Optimizer's quantization config instance would yield similar results to INT8 Entropy Calibration 2 by talking to, I assume, your coworkers (Model Optimizer team).

Thank you for providing the info on the TRT engine-creation process and timing caches, too.

Thanks!

riyadshairi979 · 2024-09-05T22:20:31Z

@adaber please follow the ModelOpt example here or python API's to quantize an ONNX model. Note that, modelopt.onnx.quantization supports ONNXRuntime provided entropy calibration, see the command line help for other options.

Then you can compile the output explicit ONNX model with TensorRT tool: trtexec --onnx=model.quant.onnx --best or use build_engine.py like python APIs.

adaber · 2024-09-06T01:29:11Z

Hi @riyadshairi979,

Thanks for your input. It is appreciated.

I'm familiar with these approaches, however, my concern is that there are posts where people complain about not getting as good of a result when compared to the TensorRT Int8EntropyCalibrator2 based INT8 quantization (both ONNX Runtime and ModelOpt). You even mentioned something similar (NVIDIA/TensorRT-Model-Optimizer#46).

I do appreciate that you guys have been working on ModeOpt. It seems like a really good tool. I am just trying to get familiar with some aspects of ModelOpt before I commit to spend time to try to get it to work and incorporate it in my processing pipeline.

@riyadshairi979 Quick question. You mention to use build_engine.p but I assume I don't need to include the implicit IInt8EntropyCalibrator2 since the model had already been INT8 quantized and IInt8EntropyCalibrator2 is deprecated ?

@riyadshairi979 @moraxu Thank you again for your prompt responses and willingness to help!

riyadshairi979 · 2024-09-06T07:23:27Z

people complain about not getting as good of a result

It means, sometimes TensorRT deployed EQ network latency > IQ network latency. ModelOpt team is actively working with TensorRT team to minimize this type of gap for various models.

when compared to the TensorRT Int8EntropyCalibrator2

Choice of calibrator might have impact on the accuracy of the model but not latency. If you see accuracy regression with modelopt quantization, please file a bug here with reproducible model and commands.

I assume I don't need to include the implicit IInt8EntropyCalibrator2

Right.

CoinCheung · 2024-09-06T14:01:36Z

@moraxu Hi, can we use explicit int8 quantization now ? I mean we use ModelOpt to quantize the pytorch model into int8, and then export it to onnx from pytorch, then use tensorrt to build it into tensorrt engine?

I ask this because I just had a try of that, and I got the error message like this:

As for my code, I just comment out the part about int8 calibration in the tensorrt side, and set the calibrator to nullptr:

Do you know how could I make this work ?

moraxu · 2024-09-06T16:58:59Z

I mean we use ModelOpt to quantize the pytorch model into int8, and then export it to onnx from pytorch, then use tensorrt to build it into tensorrt engine?

I ask this because I just had a try of that, and I got the error message like this

Could you share your entire full code snippet, @CoinCheung ? I can open a bug internally and have someone update that calibrator code at the same time. Are you using ModelOpt in your code or not?

CoinCheung · 2024-09-07T00:50:25Z

@moraxu Hi, I packed up the code and the associated onnx file, it is accessable here:

https://github.com/CoinCheung/eewee/releases/download/0.0.0/code.zip

There is a readme file in the zip file, it describes the step how I trigger this error.

Are you using ModelOpt in your code or not?

Yes, I used ModelOpt to quantize the model in the pytorch side, and then export it to onnx. Then I used the onnx file in the tensorrt side.

adaber · 2024-09-07T18:11:54Z

@moraxu @riyadshairi979

Got a quick question. Does Model Optimizer work on Windows ?

It says Linux but there is one post asking about an issue with the Python version and the listed environment is Windows. The person who helped with this issue didn't make any comments on that and so I assume it works on Windows, too. (NVIDIA/TensorRT-Model-Optimizer#26)

Thanks again for all the help!

moraxu · 2024-09-09T17:47:38Z

@moraxu Hi, I packed up the code and the associated onnx file, it is accessable here:

https://github.com/CoinCheung/eewee/releases/download/0.0.0/code.zip

There is a readme file in the zip file, it describes the step how I trigger this error.

Are you using ModelOpt in your code or not?

Yes, I used ModelOpt to quantize the model in the pytorch side, and then export it to onnx. Then I used the onnx file in the tensorrt side.

@riyadshairi979 , would you be able to check if @CoinCheung 's ModelOpt code in his zipped export_onnx.py file is correct? If so, I'd open an internal bug on TRT Side for someone to look at it

moraxu · 2024-09-09T17:48:46Z

@adaber , I don't see Windows in https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/2_installation.html so will let @riyadshairi979 to confirm

ckolluru · 2024-09-16T22:29:18Z

I have a somewhat related question on this topic. For the two pipelines described below, is there good evidence that inference times for #2 are significantly faster than #1? I assume gpu memory requirements are lower for #2.

Pytorch FP 32-> ONNX-> FP16 TRT engine
Pytorch FP 32-> ONNX-> INT8 TRT engine

I'm working with a CNN-based segmentation model, and the inference times I get for the two pipelines are similar.

Also, if there is a complete, end-to-end tutorial script for pipeline #2 for a simple CNN model, can that please be shared?

moraxu · 2024-09-20T21:15:21Z

@ckolluru , in theory, INT8 inference should generally offer better performance than FP16 due to lower precision calculations, which use less computational power and memory bandwidth. However, I believe for CNN-based models, convolution operations might not be as bottlenecked by precision reduction so the speed improvement might not be that visible.

Another possible reason for similar inference times between FP16/INT8 pipelines could be poor INT8 calibration?

GPU memory consumption should indeed be lower with INT8 than FP16 because INT8 uses 1 byte per weight/activation, while FP16 uses 2 bytes.

Also, if there is a complete, end-to-end tutorial script for pipeline #2 for a simple CNN model, can that please be shared?

If https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_pytorch_quantization.html is not sufficient then please check on https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues

maisa32 · 2024-12-05T20:18:54Z

@moraxu Hi, I have a somewhat related question. The TensorRT developer guide mentions that implicit quantization is deprecated:

Section 7.1.2: Explicit vs Implicit Quantization

Note: Implicit quantization is deprecated. It is recommended to use TensorRT’s Quantization Toolkit to create models with explicit quantization.

However, when working with DLA, the same guide mentions that DLA do not support explicit quantization:

Section 13: Working with DLA

It does not support Explicit Quantization.

Is it your plan to eventually support explicit quantization with DLAs and in the meantime we have to use implicit quantization (which is deprecated)?

moraxu · 2024-12-13T19:52:18Z

@maisa32 sorry for the late reply, I've just checked with the PM team and:

Long term solution: DLA3 will support both EQ and IQ
Short/medium term solution:
- TRT team is still discussing the options; please stay tuned
- goal: we will offer a solution for customers until they get DLA3

moraxu added triaged Issue has been triaged by maintainers Quantization: PTQ labels Aug 30, 2024

CoinCheung mentioned this issue Sep 21, 2024

Int8 calculation problem NVIDIA/TensorRT-Model-Optimizer#76

Open

kevinch-nv removed the Quantization: PTQ label Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INT8EntropyCalibrator2 implicit quantization superseded by explicit quantization #4095

INT8EntropyCalibrator2 implicit quantization superseded by explicit quantization #4095

adaber commented Aug 23, 2024 •

edited

Loading

moraxu commented Aug 30, 2024

adaber commented Sep 3, 2024 •

edited

Loading

moraxu commented Sep 3, 2024

adaber commented Sep 3, 2024

CoinCheung commented Sep 4, 2024

moraxu commented Sep 4, 2024

adaber commented Sep 5, 2024 •

edited

Loading

moraxu commented Sep 5, 2024

adaber commented Sep 5, 2024

riyadshairi979 commented Sep 5, 2024 •

edited

Loading

adaber commented Sep 6, 2024 •

edited

Loading

riyadshairi979 commented Sep 6, 2024 •

edited

Loading

CoinCheung commented Sep 6, 2024

moraxu commented Sep 6, 2024 •

edited

Loading

CoinCheung commented Sep 7, 2024 •

edited

Loading

adaber commented Sep 7, 2024

moraxu commented Sep 9, 2024

moraxu commented Sep 9, 2024

ckolluru commented Sep 16, 2024

moraxu commented Sep 20, 2024

maisa32 commented Dec 5, 2024 •

edited

Loading

moraxu commented Dec 13, 2024

INT8EntropyCalibrator2 implicit quantization superseded by explicit quantization #4095

INT8EntropyCalibrator2 implicit quantization superseded by explicit quantization #4095

Comments

adaber commented Aug 23, 2024 • edited Loading

Description

Environment

moraxu commented Aug 30, 2024

adaber commented Sep 3, 2024 • edited Loading

moraxu commented Sep 3, 2024

adaber commented Sep 3, 2024

CoinCheung commented Sep 4, 2024

moraxu commented Sep 4, 2024

adaber commented Sep 5, 2024 • edited Loading

moraxu commented Sep 5, 2024

adaber commented Sep 5, 2024

riyadshairi979 commented Sep 5, 2024 • edited Loading

adaber commented Sep 6, 2024 • edited Loading

riyadshairi979 commented Sep 6, 2024 • edited Loading

CoinCheung commented Sep 6, 2024

moraxu commented Sep 6, 2024 • edited Loading

CoinCheung commented Sep 7, 2024 • edited Loading

adaber commented Sep 7, 2024

moraxu commented Sep 9, 2024

moraxu commented Sep 9, 2024

ckolluru commented Sep 16, 2024

moraxu commented Sep 20, 2024

maisa32 commented Dec 5, 2024 • edited Loading

moraxu commented Dec 13, 2024

adaber commented Aug 23, 2024 •

edited

Loading

adaber commented Sep 3, 2024 •

edited

Loading

adaber commented Sep 5, 2024 •

edited

Loading

riyadshairi979 commented Sep 5, 2024 •

edited

Loading

adaber commented Sep 6, 2024 •

edited

Loading

riyadshairi979 commented Sep 6, 2024 •

edited

Loading

moraxu commented Sep 6, 2024 •

edited

Loading

CoinCheung commented Sep 7, 2024 •

edited

Loading

maisa32 commented Dec 5, 2024 •

edited

Loading