Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INT8EntropyCalibrator2 implicit quantization superseded by explicit quantization #4095

Open
adaber opened this issue Aug 23, 2024 · 22 comments
Labels
triaged Issue has been triaged by maintainers

Comments

@adaber
Copy link

adaber commented Aug 23, 2024

Description

Hi,

I have been using the INT8 Entropy Calibrator 2 for INT8 quantization in Python and it’s been working well (TensorRT 10.0.1). The example of how I use the INT8 Entropy Calibrator 2 can be found in the official TRT GitHub repo (TensorRT/samples/python/efficientdet/build_engine.py at release/10.0 · NVIDIA/TensorRT · GitHub)

The warning I’ve been getting starting with TensorRT 10.1 is that the INT8 Entropy Calibrator 2 implicit quantization has been deprecated and superseded by explicit quantization.

I’ve read the official document on the difference between the implicit and explicit quantization processes (Developer Guide :: NVIDIA Deep Learning TensorRT Documentation) and they seem to work differently. The explicit quantization seems to expect a network to have QuantizeLayer and DequantizeLayer layers which my networks don’t. The implicit quantization can be used when those layers are not present in a network. Therefore, I am confused about how the implicit quantization can be superseded by the explicit quantization since they seem to work differently.

So, my question is what needs to be modified in the standard INT8 Calibrator 2 quantization method (TensorRT/samples/python/efficientdet/build_engine.py at release/10.0 · NVIDIA/TensorRT · GitHub) for the deprecation warning not to show up ? Or what is the proper way to implement the INT8 Calibrator 2 implicit quantization now that the current one is deprecated ? Couldn’t find any example using a newer TensorRT version (10.1 and up)

Thank you!

Environment

TensorRT Version: 10.1

NVIDIA GPU: 3090

Operating System: Windows 10

Python Version: 3.9.19

@moraxu
Copy link
Collaborator

moraxu commented Aug 30, 2024

Does your project require the use of the specific calibrator from our samples or have you tried looking into Model Optimizer (https://github.com/NVIDIA/TensorRT-Model-Optimizer) for the calibration purposes (it should ensure your model has Q/DQ nodes for the explicit quant approach)? If the former, we can look into updating that sample code.

@moraxu moraxu added triaged Issue has been triaged by maintainers Quantization: PTQ labels Aug 30, 2024
@adaber
Copy link
Author

adaber commented Sep 3, 2024

Hey moraxu,

Sorry for the late response. I missed your message.

Haven't looked at or work with Model Optimizer yet. Thanks for letting me know.

Not sure if I understood the question but I apply the INT8 Entropy Calibrator 2 quantization to my CNN models. It gives the best results. Is that what you meant ?

So, will the implicit quantization still be supported or you, guys, are completely switching to the explicit one only ? I would like to avoid changing things in my processing pipeline (training models in Pytorch, converting to ONNX, implicit PTQ quantization) due to possible incompatibility issues and such.

Therefore, it would be great if you could update the INT8 Entropy Calibrator 2 sample code. Would that be possible ?

Thanks!

@moraxu
Copy link
Collaborator

moraxu commented Sep 3, 2024

So, will the implicit quantization still be supported or you, guys, are completely switching to the explicit one only ? I would like to avoid changing things in my processing pipeline (training models in Pytorch, converting to ONNX, implicit PTQ quantization) due to possible incompatibility issues and such.

Therefore, it would be great if you could update the INT8 Entropy Calibrator 2 sample code. Would that be possible ?

Thanks for clarifying, in that case I'll request the sample code to be updated and provide more updates in this ticket - thanks for reporting this.

@adaber
Copy link
Author

adaber commented Sep 3, 2024

That sounds great, moraxu! Thank you.

@CoinCheung
Copy link

@moraxu Hi, does this mean that, after tensorrt 10, the recommended method is: train-pytorch-model -> int8-with-model-optimizer -> export-onnx -> compile-with-tensorrt ?

@moraxu
Copy link
Collaborator

moraxu commented Sep 4, 2024

@CoinCheung , yes, if you're using PyTorch then Model Optimizer can automate more parts of that flow: https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_pytorch_quantization.html in contrast to manually configuring the optimization and quantization steps within TRT API (unless you need to do that for more control)

@adaber
Copy link
Author

adaber commented Sep 5, 2024

@moraxu Got a few question about Model Optimizer if you don't mind.

  1. What Quantization Config would yield similar results to INT8 Entropy Calibration 2 ?

  2. Say I INT8 quantize a neural net using Model Optimizer and convert it to ONNX.
    a) How does TRT handle the ONNX model when converting it to a TRT engine ?
    b) How do different TRT flags affect the engine building process in this case (FP16 and such)
    c) In general, I am not sure if/how TRT changes an already quantized model using Model Optimizer during the TRT engine conversion process.
    d) Not sure if you're familiar with the following but how is this already quantized model converted to a TRT timing cache in ONNX Runtime ?

Sorry if this is too many questions but the more libraries involved the higher the chance something will be incompatible or not work from my experience. Hence, I would like to know if Model Optimizer can give me exactly the same results as the pipelines that I mentioned before:

  1. Pytorch FP 32-> ONNX-> FP16 TRT engine
  2. Pytorch FP 32-> ONNX-> INT8 TRT engine
  3. Pytorch FP 32-> ONNX-> FP16 TRT timing cache (ONNX Runtime)
  4. Pytorch FP 32-> ONNX-> INT8 TRT timing cache (ONNX Runtime))

Thanks!

P.S. I would still appreciate if you guys could update the INT8 quantization sample code so I can use that while investigating and testing Model Optimizer. Thanks!

@moraxu
Copy link
Collaborator

moraxu commented Sep 5, 2024

Hi @adaber,

  1. I'm not up to speed with that sample, I'd have to check how to translate that calibrator: https://github.com/NVIDIA/TensorRT/blob/release/10.0/samples/python/efficientdet/build_engine.py#L37 to Model Optimizer's config. If this is an immediate blocker, could you please check on their forum: https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues ?

a) You mean how does it handle a quantized ONNX? TRT reads the scale factors from the ONNX file (for INT8 layers). These scale factors define how the weights and activations are quantized. If these scale factors are provided, TRT will use them directly.
b)

  • FP16 flag: TRT will maintain the INT8 quantization for layers that are pre-quantized but may choose to use FP16 for layers that were not quantized or for activations that can benefit from FP16 precision.
  • INT8 flag: For models already quantized using Model Optimizer, the flag ensures that INT8 execution will be used for the quantized layers. TRT will use the existing quantization scales if provided in the ONNX model.

c) In practice, TRT should not drastically change the quantization itself (i.e., INT8 scale factors should remain intact). Still, the engine-building process may involve optimizations that improve inference performance while maintaining accuracy.
d) I don't believe there's any difference for quantized models for the timing caches, but now the process includes determining how best to execute quantized layers (e.g., which kernels to use for INT8) etc.

@akhilg-nv , please correct me or add anything, since I believe you've worked with quantized workflows more than me.

@adaber
Copy link
Author

adaber commented Sep 5, 2024

Hi @moraxu

Firstly, thank you so much for responding so quickly. It is very appreciated!

  1. Thanks for the links. This is very important since INT8 Entropy Calibration 2 is recommended for CNNs and works really well from my experience. I've seen a few posts, for example, where people couldn't achieve the same INT8 quantization results using the ONNX Runtime TRT execution provider. Therefore I am inclined to just stick to the TensorRT API since I would have to spend some additional time to figure out how to achieve the same results (INT8 quantization in this case) with a different library/wrapper using TensorRT.

I will definitely check their forums. Please, let me know if you manage to find out what Model Optimizer's quantization config instance would yield similar results to INT8 Entropy Calibration 2 by talking to, I assume, your coworkers (Model Optimizer team).

Thank you for providing the info on the TRT engine-creation process and timing caches, too.

Thanks!

@riyadshairi979
Copy link

riyadshairi979 commented Sep 5, 2024

@adaber please follow the ModelOpt example here or python API's to quantize an ONNX model. Note that, modelopt.onnx.quantization supports ONNXRuntime provided entropy calibration, see the command line help for other options.

Then you can compile the output explicit ONNX model with TensorRT tool: trtexec --onnx=model.quant.onnx --best or use build_engine.py like python APIs.

@adaber
Copy link
Author

adaber commented Sep 6, 2024

Hi @riyadshairi979,

Thanks for your input. It is appreciated.

I'm familiar with these approaches, however, my concern is that there are posts where people complain about not getting as good of a result when compared to the TensorRT Int8EntropyCalibrator2 based INT8 quantization (both ONNX Runtime and ModelOpt). You even mentioned something similar (NVIDIA/TensorRT-Model-Optimizer#46).

I do appreciate that you guys have been working on ModeOpt. It seems like a really good tool. I am just trying to get familiar with some aspects of ModelOpt before I commit to spend time to try to get it to work and incorporate it in my processing pipeline.

@riyadshairi979 Quick question. You mention to use build_engine.p but I assume I don't need to include the implicit IInt8EntropyCalibrator2 since the model had already been INT8 quantized and IInt8EntropyCalibrator2 is deprecated ?

@riyadshairi979 @moraxu Thank you again for your prompt responses and willingness to help!

@riyadshairi979
Copy link

riyadshairi979 commented Sep 6, 2024

people complain about not getting as good of a result

It means, sometimes TensorRT deployed EQ network latency > IQ network latency. ModelOpt team is actively working with TensorRT team to minimize this type of gap for various models.

when compared to the TensorRT Int8EntropyCalibrator2

Choice of calibrator might have impact on the accuracy of the model but not latency. If you see accuracy regression with modelopt quantization, please file a bug here with reproducible model and commands.

I assume I don't need to include the implicit IInt8EntropyCalibrator2

Right.

@CoinCheung
Copy link

@moraxu Hi, can we use explicit int8 quantization now ? I mean we use ModelOpt to quantize the pytorch model into int8, and then export it to onnx from pytorch, then use tensorrt to build it into tensorrt engine?

I ask this because I just had a try of that, and I got the error message like this:

Image

As for my code, I just comment out the part about int8 calibration in the tensorrt side, and set the calibrator to nullptr:

Image

Do you know how could I make this work ?

@moraxu
Copy link
Collaborator

moraxu commented Sep 6, 2024

I mean we use ModelOpt to quantize the pytorch model into int8, and then export it to onnx from pytorch, then use tensorrt to build it into tensorrt engine?

I ask this because I just had a try of that, and I got the error message like this

Could you share your entire full code snippet, @CoinCheung ? I can open a bug internally and have someone update that calibrator code at the same time. Are you using ModelOpt in your code or not?

@CoinCheung
Copy link

CoinCheung commented Sep 7, 2024

@moraxu Hi, I packed up the code and the associated onnx file, it is accessable here:

https://github.com/CoinCheung/eewee/releases/download/0.0.0/code.zip

There is a readme file in the zip file, it describes the step how I trigger this error.

Are you using ModelOpt in your code or not?

Yes, I used ModelOpt to quantize the model in the pytorch side, and then export it to onnx. Then I used the onnx file in the tensorrt side.

@adaber
Copy link
Author

adaber commented Sep 7, 2024

@moraxu @riyadshairi979

Got a quick question. Does Model Optimizer work on Windows ?

It says Linux but there is one post asking about an issue with the Python version and the listed environment is Windows. The person who helped with this issue didn't make any comments on that and so I assume it works on Windows, too. (NVIDIA/TensorRT-Model-Optimizer#26)

Thanks again for all the help!

@moraxu
Copy link
Collaborator

moraxu commented Sep 9, 2024

@moraxu Hi, I packed up the code and the associated onnx file, it is accessable here:

https://github.com/CoinCheung/eewee/releases/download/0.0.0/code.zip

There is a readme file in the zip file, it describes the step how I trigger this error.

Are you using ModelOpt in your code or not?

Yes, I used ModelOpt to quantize the model in the pytorch side, and then export it to onnx. Then I used the onnx file in the tensorrt side.

@riyadshairi979 , would you be able to check if @CoinCheung 's ModelOpt code in his zipped export_onnx.py file is correct? If so, I'd open an internal bug on TRT Side for someone to look at it

@moraxu
Copy link
Collaborator

moraxu commented Sep 9, 2024

@ckolluru
Copy link

I have a somewhat related question on this topic. For the two pipelines described below, is there good evidence that inference times for #2 are significantly faster than #1? I assume gpu memory requirements are lower for #2.

  1. Pytorch FP 32-> ONNX-> FP16 TRT engine
  2. Pytorch FP 32-> ONNX-> INT8 TRT engine

I'm working with a CNN-based segmentation model, and the inference times I get for the two pipelines are similar.

Also, if there is a complete, end-to-end tutorial script for pipeline #2 for a simple CNN model, can that please be shared?

@moraxu
Copy link
Collaborator

moraxu commented Sep 20, 2024

@ckolluru , in theory, INT8 inference should generally offer better performance than FP16 due to lower precision calculations, which use less computational power and memory bandwidth. However, I believe for CNN-based models, convolution operations might not be as bottlenecked by precision reduction so the speed improvement might not be that visible.

Another possible reason for similar inference times between FP16/INT8 pipelines could be poor INT8 calibration?

GPU memory consumption should indeed be lower with INT8 than FP16 because INT8 uses 1 byte per weight/activation, while FP16 uses 2 bytes.

Also, if there is a complete, end-to-end tutorial script for pipeline #2 for a simple CNN model, can that please be shared?

If https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_pytorch_quantization.html is not sufficient then please check on https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues

@maisa32
Copy link

maisa32 commented Dec 5, 2024

@moraxu Hi, I have a somewhat related question. The TensorRT developer guide mentions that implicit quantization is deprecated:

Section 7.1.2: Explicit vs Implicit Quantization

Note: Implicit quantization is deprecated. It is recommended to use TensorRT’s Quantization Toolkit to create models with explicit quantization.

However, when working with DLA, the same guide mentions that DLA do not support explicit quantization:

Section 13: Working with DLA

It does not support Explicit Quantization.

Is it your plan to eventually support explicit quantization with DLAs and in the meantime we have to use implicit quantization (which is deprecated)?

@moraxu
Copy link
Collaborator

moraxu commented Dec 13, 2024

@maisa32 sorry for the late reply, I've just checked with the PM team and:

  • Long term solution: DLA3 will support both EQ and IQ
  • Short/medium term solution:
    • TRT team is still discussing the options; please stay tuned
    • goal: we will offer a solution for customers until they get DLA3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

7 participants