GPU Latency failure for FP16, INT8, mixed precision (FP16+INT8) models of TensorRT 8.6 when running trtexec on GPU A100 #3593

bcd8697 · 2024-01-12T04:17:01Z

Description

Hello,
I'm trying to do a torch -> onnx -> trt model conversion.
I am doing operations to convert to fp16, to int8 and to mixed precision (fp16 + int8). However, after the conversion is completed, the latency of the fp16 model turns out to be the smallest. Which means fp16 model is faster than int8 and mixed-precision models. Why is that?

Environment

TensorRT Version: 8.6

NVIDIA GPU: A100

NVIDIA Driver Version: 530.30.02

CUDA Version: 12.1

CUDNN Version:

Operating System: Ubuntu 22.04

Python Version (if applicable): 3.10

PyTorch Version (if applicable): 2.1

Baremetal or Container (if so, version): nvcr.io/nvidia/pytorch:23.08-py3

Relevant Files

Model link: "vit_base_patch32_224_clip_laion2b" model from timm.models

Steps To Reproduce

Using the pytorch_quantization library we do:

quant_modules.initialize()

quant_desc = QuantDescriptor(num_bits=16)
quant_nn.QuantConv2d.set_default_quant_desc_input(quant_desc)
quant_nn.QuantLinear.set_default_quant_desc_input(quant_desc)
quant_nn.QuantConv2d.set_default_quant_desc_weight(quant_desc)
quant_nn.QuantLinear.set_default_quant_desc_weight(quant_desc)

Create a model object in Python (FakeQuant nodes are added automatically because of quant_modules.initialize() line).

m_name = "vit_base_patch32_224_clip_laion2b"
qat_model = create_model(m_name, num_classes=8, exportable=True)

(optionally)
If precision is not fp16, but int8, then specify num_bits=8 in point 1 like that:

quant_desc = QuantDescriptor(num_bits=8)

(optionally)
If the situation is with mixed_precision, then initially we create num_bits=16, then selectively for individual layers we change the values of input_quantizer and weight_quantizer to 8-bit like this:

qat_model.patch_embed.proj._input_quantizer = TensorQuantizer(quant_desc=QuantDescriptor(num_bits=8))

We calibrate FakeQuant nodes and do QAT.
Do torch.onnx.export.
Simplify the onnx model through

onnx_model = onnx.load(os.path.join(SAVE_PATH, "<model_name>.onnx"))
model_simp, check = onnx_simplifier.simplify(onnx_model, check_n=0)
onnx.save(model_simp, os.path.join(SAVE_PATH, "<model_name>.onnx"))

Then we convert onnx to trt using the trtexec utility.
If it is fp16 or int8 precision, then as follows:
fp16:

trtexec\
     --onnx={os.path.join(SAVE_PATH, '<model_name>.onnx')} \
     --minShapes=input:1x3x224x224 \
     --optShapes=input:10x3x224x224 \
     --maxShapes=input:64x3x224x224 \
     --explicitBatch\
     --saveEngine={os.path.join(SAVE_PATH, '<model_name>.trt')} \
     --exportTimes={os.path.join(SAVE_PATH, 'timing_results.json')} \
     --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16

int8:

trtexec\
     --onnx={os.path.join(SAVE_PATH, '<model_name>.onnx')} \
     --minShapes=input:1x3x224x224 \
     --optShapes=input:10x3x224x224 \
     --maxShapes=input:64x3x224x224 \
     --explicitBatch\
     --saveEngine={os.path.join(SAVE_PATH, '<model_name>.trt')} \
     --exportTimes={os.path.join(SAVE_PATH, 'timing_results.json')} \
     --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --int8

If this is mixed-precision, then first we create the str variable "LAYERS_PRECISION" and collect precision for layers in it, iterating over the onnx layers of the model.
The result is something like:
LAYERS_PRECISION="layer1:int8,layer2:int8,layer3:fp16,...,layerN:fp16,"
And then we execute the following command

trtexec\
     --onnx={os.path.join(SAVE_PATH, '<model_name>.onnx')} \
     --fp16 --int8 \
     --precisionConstraints=obey --layerPrecisions={LAYERS_PRECISION} \
     --minShapes=input:1x3x224x224 \
     --optShapes=input:10x3x224x224 \
     --maxShapes=input:64x3x224x224 \
     --explicitBatch\
     --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw \
     --saveEngine={os.path.join(SAVE_PATH, '<model_name>.trt')}

Having done all of the above, we get trt files, which, when checked both through trtexec and through the model-analyzer utility for trt-server, show that the operating speed of the int8 and mixed-precision models is worse than that of the fp16 model.

Commands or scripts: see above

Have you tried the latest release?: yes

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt): N/A

The text was updated successfully, but these errors were encountered:

zerollzeng · 2024-01-15T14:48:36Z

How is the perf of

trtexec\
     --onnx={os.path.join(SAVE_PATH, '<model_name>.onnx')} \
     --fp16 --int8 \
     --minShapes=input:1x3x224x224 \
     --optShapes=input:10x3x224x224 \
     --maxShapes=input:64x3x224x224 \
     --explicitBatch\
     --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw \
     --saveEngine={os.path.join(SAVE_PATH, '<model_name>.trt')}

bcd8697 · 2024-01-15T18:36:49Z

@zerollzeng I tried this but the latency is still higher than just fp16...

zerollzeng · 2024-01-16T01:29:17Z

Could you please share the onnx here? If it's a QAT model, --int8 should be required otherwise TRT will throw an error.

zerollzeng · 2024-01-16T01:35:39Z

You may hit a known issue in TRT 8.6 and it's fixed in TRT 9.2. could you please try the latest TRT 9.2?
you can download it from below link:

https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-9.2.0.5.linux.x86_64-gnu.cuda-11.8.tar.gz
https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-9.2.0.5.linux.x86_64-gnu.cuda-12.2.tar.gz
https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-9.2.0.5.ubuntu-22.04.aarch64-gnu.cuda-12.2.tar.gz

bcd8697 · 2024-01-18T08:06:54Z

@zerollzeng
Here is the link to zip-archive with 2 my onnx-models: FP16 and mixed precision (FP16-INT8) generated without
--precisionConstraints=obey --layerPrecisions={LAYERS_PRECISION} flags, as you proposed earlier.
https://drive.google.com/file/d/1dfIufa2aOnLKg2z1zwxd491730mMZcMt/view?usp=sharing

After converting to trt files, FP16 turns out to be faster in execution speed than the mixed precision model.

BTW, what exactly is an issue in TRT 8.6 which is fixed in TRT 9.2?

Thanks

zerollzeng · 2024-01-19T09:48:37Z

You just hit a bug that fix in TRT 9.2 :-)

bcd8697 · 2024-01-22T19:25:07Z

@zerollzeng
Thanks
I have installed and tried TRT 9.2.
It seems that it doesn't help and the latency of FP16 is still smaller than mixed-precision (FP16 + INT8).

Maybe any other suggestions?

zerollzeng · 2024-01-27T08:57:13Z

Could you please share the onnx that can reproduce this issue?

bcd8697 · 2024-01-27T09:45:53Z

@zerollzeng
yes, sure
here you are
https://drive.google.com/drive/folders/1DPb0HigtNiI9Z8TCn7z0HTL0PYsPYwJ4?usp=sharing

I’m also interested to know when will TRT v9.2 be released in docker images?

zerollzeng · 2024-01-27T11:49:22Z

We didn't release it in the official docker image since it's a limited EA release. but you can build the docker by using https://github.com/NVIDIA/TensorRT/blob/release/9.2/docker/ubuntu-20.04.Dockerfile

zerollzeng · 2024-01-27T11:52:36Z

https://drive.google.com/drive/folders/1DPb0HigtNiI9Z8TCn7z0HTL0PYsPYwJ4?usp=sharing

May I ask why I see 2 onnx models here?

bcd8697 · 2024-01-27T23:11:17Z

@zerollzeng
One onnx is for FP16 precision and the second one is for mixed precision (FP16 + INT8)

zerollzeng · 2024-01-28T09:07:43Z

That's weird, you should only need 1 onnx. What if you compare the perf using only 1 onnx? just set full fp16 and set mixed precision separately.

bcd8697 · 2024-01-28T09:18:25Z

@zerollzeng
When FakeQuant nodes are created using

quant_modules.initialize()
quant_desc = QuantDescriptor(num_bits=16)

for further QAT, then I must specify the num_bits parameter.

Shouldn't I make it so that when all layers have num_bits=16, then I only specify the --fp16 flag?
And when I change some FakeQuant nodes in INT8, then with trtexec I specify both: --fp16 and --int8 ?

I mean, how does trtexec know which layers I need to convert to int8 precision if I don't specify it anywhere?

I have two onnx because to convert to FP16 I specify num_bits=16, then I do QAT, then I convert to ONNX, and then to a TRT file.
To convert to Mixed precision, I specify num_bits=16 and then manually specify the layers I want and specify num_bits=8 for them. Then also QAT, ONNX and TRT stages.
If Mixed precision needs to be done differently, then please tell me.

zerollzeng · 2024-02-01T13:01:53Z

@ttyio for above questions.

bcd8697 · 2024-02-15T08:34:11Z

@zerollzeng @ttyio
Should I wait for any answer about the issue?

zerollzeng self-assigned this Jan 15, 2024

zerollzeng added the triaged Issue has been triaged by maintainers label Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Latency failure for FP16, INT8, mixed precision (FP16+INT8) models of TensorRT 8.6 when running trtexec on GPU A100 #3593

GPU Latency failure for FP16, INT8, mixed precision (FP16+INT8) models of TensorRT 8.6 when running trtexec on GPU A100 #3593

bcd8697 commented Jan 12, 2024

zerollzeng commented Jan 15, 2024

bcd8697 commented Jan 15, 2024

zerollzeng commented Jan 16, 2024

zerollzeng commented Jan 16, 2024

bcd8697 commented Jan 18, 2024 •

edited

Loading

zerollzeng commented Jan 19, 2024

bcd8697 commented Jan 22, 2024

zerollzeng commented Jan 27, 2024

bcd8697 commented Jan 27, 2024 •

edited

Loading

zerollzeng commented Jan 27, 2024 •

edited

Loading

zerollzeng commented Jan 27, 2024

bcd8697 commented Jan 27, 2024

zerollzeng commented Jan 28, 2024

bcd8697 commented Jan 28, 2024 •

edited

Loading

zerollzeng commented Feb 1, 2024

bcd8697 commented Feb 15, 2024

GPU Latency failure for FP16, INT8, mixed precision (FP16+INT8) models of TensorRT 8.6 when running trtexec on GPU A100 #3593

GPU Latency failure for FP16, INT8, mixed precision (FP16+INT8) models of TensorRT 8.6 when running trtexec on GPU A100 #3593

Comments

bcd8697 commented Jan 12, 2024

Description

Environment

Relevant Files

Steps To Reproduce

zerollzeng commented Jan 15, 2024

bcd8697 commented Jan 15, 2024

zerollzeng commented Jan 16, 2024

zerollzeng commented Jan 16, 2024

bcd8697 commented Jan 18, 2024 • edited Loading

zerollzeng commented Jan 19, 2024

bcd8697 commented Jan 22, 2024

zerollzeng commented Jan 27, 2024

bcd8697 commented Jan 27, 2024 • edited Loading

zerollzeng commented Jan 27, 2024 • edited Loading

zerollzeng commented Jan 27, 2024

bcd8697 commented Jan 27, 2024

zerollzeng commented Jan 28, 2024

bcd8697 commented Jan 28, 2024 • edited Loading

zerollzeng commented Feb 1, 2024

bcd8697 commented Feb 15, 2024

bcd8697 commented Jan 18, 2024 •

edited

Loading

bcd8697 commented Jan 27, 2024 •

edited

Loading

zerollzeng commented Jan 27, 2024 •

edited

Loading

bcd8697 commented Jan 28, 2024 •

edited

Loading