Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Latency failure for FP16, INT8, mixed precision (FP16+INT8) models of TensorRT 8.6 when running trtexec on GPU A100 #3593

Open
bcd8697 opened this issue Jan 12, 2024 · 16 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@bcd8697
Copy link

bcd8697 commented Jan 12, 2024

Description

Hello,
I'm trying to do a torch -> onnx -> trt model conversion.
I am doing operations to convert to fp16, to int8 and to mixed precision (fp16 + int8). However, after the conversion is completed, the latency of the fp16 model turns out to be the smallest. Which means fp16 model is faster than int8 and mixed-precision models. Why is that?

Environment

TensorRT Version: 8.6

NVIDIA GPU: A100

NVIDIA Driver Version: 530.30.02

CUDA Version: 12.1

CUDNN Version:

Operating System: Ubuntu 22.04

Python Version (if applicable): 3.10

PyTorch Version (if applicable): 2.1

Baremetal or Container (if so, version): nvcr.io/nvidia/pytorch:23.08-py3

Relevant Files

Model link: "vit_base_patch32_224_clip_laion2b" model from timm.models

Steps To Reproduce

  1. Using the pytorch_quantization library we do:
quant_modules.initialize()

quant_desc = QuantDescriptor(num_bits=16)
quant_nn.QuantConv2d.set_default_quant_desc_input(quant_desc)
quant_nn.QuantLinear.set_default_quant_desc_input(quant_desc)
quant_nn.QuantConv2d.set_default_quant_desc_weight(quant_desc)
quant_nn.QuantLinear.set_default_quant_desc_weight(quant_desc)
  1. Create a model object in Python (FakeQuant nodes are added automatically because of quant_modules.initialize() line).
m_name = "vit_base_patch32_224_clip_laion2b"
qat_model = create_model(m_name, num_classes=8, exportable=True)

(optionally)
If precision is not fp16, but int8, then specify num_bits=8 in point 1 like that:

quant_desc = QuantDescriptor(num_bits=8)

(optionally)
If the situation is with mixed_precision, then initially we create num_bits=16, then selectively for individual layers we change the values of input_quantizer and weight_quantizer to 8-bit like this:

qat_model.patch_embed.proj._input_quantizer = TensorQuantizer(quant_desc=QuantDescriptor(num_bits=8))
  1. We calibrate FakeQuant nodes and do QAT.

  2. Do torch.onnx.export.

  3. Simplify the onnx model through

onnx_model = onnx.load(os.path.join(SAVE_PATH, "<model_name>.onnx"))
model_simp, check = onnx_simplifier.simplify(onnx_model, check_n=0)
onnx.save(model_simp, os.path.join(SAVE_PATH, "<model_name>.onnx"))
  1. Then we convert onnx to trt using the trtexec utility.
    If it is fp16 or int8 precision, then as follows:
    fp16:
trtexec\
     --onnx={os.path.join(SAVE_PATH, '<model_name>.onnx')} \
     --minShapes=input:1x3x224x224 \
     --optShapes=input:10x3x224x224 \
     --maxShapes=input:64x3x224x224 \
     --explicitBatch\
     --saveEngine={os.path.join(SAVE_PATH, '<model_name>.trt')} \
     --exportTimes={os.path.join(SAVE_PATH, 'timing_results.json')} \
     --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16

int8:

trtexec\
     --onnx={os.path.join(SAVE_PATH, '<model_name>.onnx')} \
     --minShapes=input:1x3x224x224 \
     --optShapes=input:10x3x224x224 \
     --maxShapes=input:64x3x224x224 \
     --explicitBatch\
     --saveEngine={os.path.join(SAVE_PATH, '<model_name>.trt')} \
     --exportTimes={os.path.join(SAVE_PATH, 'timing_results.json')} \
     --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --int8

If this is mixed-precision, then first we create the str variable "LAYERS_PRECISION" and collect precision for layers in it, iterating over the onnx layers of the model.
The result is something like:
LAYERS_PRECISION="layer1:int8,layer2:int8,layer3:fp16,...,layerN:fp16,"
And then we execute the following command

trtexec\
     --onnx={os.path.join(SAVE_PATH, '<model_name>.onnx')} \
     --fp16 --int8 \
     --precisionConstraints=obey --layerPrecisions={LAYERS_PRECISION} \
     --minShapes=input:1x3x224x224 \
     --optShapes=input:10x3x224x224 \
     --maxShapes=input:64x3x224x224 \
     --explicitBatch\
     --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw \
     --saveEngine={os.path.join(SAVE_PATH, '<model_name>.trt')}

Having done all of the above, we get trt files, which, when checked both through trtexec and through the model-analyzer utility for trt-server, show that the operating speed of the int8 and mixed-precision models is worse than that of the fp16 model.

Commands or scripts: see above

Have you tried the latest release?: yes

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt): N/A

@zerollzeng
Copy link
Collaborator

How is the perf of

trtexec\
     --onnx={os.path.join(SAVE_PATH, '<model_name>.onnx')} \
     --fp16 --int8 \
     --minShapes=input:1x3x224x224 \
     --optShapes=input:10x3x224x224 \
     --maxShapes=input:64x3x224x224 \
     --explicitBatch\
     --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw \
     --saveEngine={os.path.join(SAVE_PATH, '<model_name>.trt')}

@zerollzeng zerollzeng self-assigned this Jan 15, 2024
@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label Jan 15, 2024
@bcd8697
Copy link
Author

bcd8697 commented Jan 15, 2024

@zerollzeng I tried this but the latency is still higher than just fp16...

@zerollzeng
Copy link
Collaborator

Could you please share the onnx here? If it's a QAT model, --int8 should be required otherwise TRT will throw an error.

@bcd8697
Copy link
Author

bcd8697 commented Jan 18, 2024

@zerollzeng
Here is the link to zip-archive with 2 my onnx-models: FP16 and mixed precision (FP16-INT8) generated without
--precisionConstraints=obey --layerPrecisions={LAYERS_PRECISION} flags, as you proposed earlier.
https://drive.google.com/file/d/1dfIufa2aOnLKg2z1zwxd491730mMZcMt/view?usp=sharing

After converting to trt files, FP16 turns out to be faster in execution speed than the mixed precision model.

BTW, what exactly is an issue in TRT 8.6 which is fixed in TRT 9.2?

Thanks

@zerollzeng
Copy link
Collaborator

You just hit a bug that fix in TRT 9.2 :-)

@bcd8697
Copy link
Author

bcd8697 commented Jan 22, 2024

@zerollzeng
Thanks
I have installed and tried TRT 9.2.
It seems that it doesn't help and the latency of FP16 is still smaller than mixed-precision (FP16 + INT8).

Maybe any other suggestions?

@zerollzeng
Copy link
Collaborator

Could you please share the onnx that can reproduce this issue?

@bcd8697
Copy link
Author

bcd8697 commented Jan 27, 2024

@zerollzeng
yes, sure
here you are
https://drive.google.com/drive/folders/1DPb0HigtNiI9Z8TCn7z0HTL0PYsPYwJ4?usp=sharing

I’m also interested to know when will TRT v9.2 be released in docker images?

@zerollzeng
Copy link
Collaborator

zerollzeng commented Jan 27, 2024

We didn't release it in the official docker image since it's a limited EA release. but you can build the docker by using https://github.com/NVIDIA/TensorRT/blob/release/9.2/docker/ubuntu-20.04.Dockerfile

@zerollzeng
Copy link
Collaborator

https://drive.google.com/drive/folders/1DPb0HigtNiI9Z8TCn7z0HTL0PYsPYwJ4?usp=sharing

May I ask why I see 2 onnx models here?

@bcd8697
Copy link
Author

bcd8697 commented Jan 27, 2024

@zerollzeng
One onnx is for FP16 precision and the second one is for mixed precision (FP16 + INT8)

@zerollzeng
Copy link
Collaborator

That's weird, you should only need 1 onnx. What if you compare the perf using only 1 onnx? just set full fp16 and set mixed precision separately.

@bcd8697
Copy link
Author

bcd8697 commented Jan 28, 2024

@zerollzeng
When FakeQuant nodes are created using

quant_modules.initialize()
quant_desc = QuantDescriptor(num_bits=16)

for further QAT, then I must specify the num_bits parameter.

Shouldn't I make it so that when all layers have num_bits=16, then I only specify the --fp16 flag?
And when I change some FakeQuant nodes in INT8, then with trtexec I specify both: --fp16 and --int8 ?

I mean, how does trtexec know which layers I need to convert to int8 precision if I don't specify it anywhere?

I have two onnx because to convert to FP16 I specify num_bits=16, then I do QAT, then I convert to ONNX, and then to a TRT file.
To convert to Mixed precision, I specify num_bits=16 and then manually specify the layers I want and specify num_bits=8 for them. Then also QAT, ONNX and TRT stages.
If Mixed precision needs to be done differently, then please tell me.

@zerollzeng
Copy link
Collaborator

@ttyio for above questions.

@bcd8697
Copy link
Author

bcd8697 commented Feb 15, 2024

@zerollzeng @ttyio
Should I wait for any answer about the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

2 participants