tensorrt engine of fp32 precision lost accuracy compared with onnx #3700

miraiaroha · 2024-03-06T07:09:01Z

Description

I evaluted the my detection model (resnet101-rtdetr), and found that there is a significant loss of accuracy of fp32, as below:
model | onnx | trt-fp32 | trt-fp16 | trt-fp16-int8
mAP | 70.7 | 66.1 | 65.8 | 65.4

And I use Polygraphy(mark all) to compare all outputs of layers, FAILED layers showed below:
step3-op16-fp32.txt

The problematic layers are mostly MatMul_output and Pow_output

Are there any methods to save the accuracy?

Environment

TensorRT Version: 8.6

Polygraphy Version: 0.49

miraiaroha · 2024-03-06T07:09:19Z

@zerollzeng

zerollzeng · 2024-03-08T09:30:44Z

TRT doesn't guarantee to provide bit-wise accuracy to other framework due to optimization and floating-point error. I guess in you case the accuracy drop happen early and amplify by the pow/matmal layer and it propagate through the network.

zerollzeng · 2024-03-08T09:31:43Z

Would be great if you can:

provide a reproduce or share the model.
try latest TRT 9.2/9.3 to see if there are improvement.

miraiaroha · 2024-03-08T10:23:03Z

TRT doesn't guarantee to provide bit-wise accuracy to other framework due to optimization and floating-point error. I guess in you case the accuracy drop happen early and amplify by the pow/matmal layer and it propagate through the network.

I test the trt-fp32 generated by Polygraphy(mark all) which broke all of the op fusions, and the result mAP is still 66.1, so I think the accuracy drop is due to the float64 precision truncation, but the situation is so uncommon.

miraiaroha · 2024-03-08T10:23:45Z

Would be great if you can:

provide a reproduce or share the model.

try latest TRT 9.2/9.3 to see if there are improvement.

The model is large(400MB), how can I provide it to you?

chinakook · 2024-03-08T12:09:54Z

from ultralytics import RTDETR
model = RTDETR('rtdetr-l.pt')
model.export(format='torchscript')
model.export(format='onnx')

ttyio · 2024-03-09T02:30:42Z

Are you using A10? @chinakook , have you tried disable TF32?

chinakook · 2024-03-09T09:13:03Z

Are you using A10? @chinakook , have you tried disable TF32?

I'll try that

chinakook · 2024-03-09T13:58:28Z

@ttyio No effect on Ampere device.
First, My case is that:

TensorRT FP32 precision matches that of Torch.
Torch FP32 precision is consistent with FP16.
On non-Ampere GPUs (e.g. Tuning), TensorRT FP16 precision aligns with FP32.
On Ampere GPUs, TensorRT FP16 and FP32 precision diverge.

I have already tested it in the 4 combinations on an Ampere device as below (all with fp16 on):

set NVIDIA_TF32_OVERRIDE to 0 to build engine with tf32 (without timecaching).
set NVIDIA_TF32_OVERRIDE to 0 to build engine without tf32 (without timecaching).
set NVIDIA_TF32_OVERRIDE to default to build engine with tf32 (without timecaching).
set NVIDIA_TF32_OVERRIDE to default to build engine without tf32 (without timecaching).

No one of the results of these can be matched with fp32 on an Ampere device. Therefore, I believe that there may be some bugs when using Ampere GPUs to run TensorRT and generate FP16 rtdetr models (especially the transformer layer and the layernorm layer). This has led to poor precision in FP16 results on this card, which significantly deviates from the performance observed in FP32.

miraiaroha · 2024-03-09T14:20:02Z

@ttyio No effect on Ampere device. First, My case is that:
1. TensorRT FP32 precision matches that of Torch.

2. Torch FP32 precision is consistent with FP16.

3. On non-Ampere GPUs (e.g. Tuning), TensorRT FP16 precision aligns with FP32.

4. On Ampere GPUs, TensorRT FP16 and FP32 precision diverge.
I have already tested it in the 4 combinations on an Ampere device as below (all with fp16 on):
1. set NVIDIA_TF32_OVERRIDE to 0 to build engine with tf32 (without timecaching).

2. set NVIDIA_TF32_OVERRIDE to 0 to build engine without tf32 (without timecaching).

3. set NVIDIA_TF32_OVERRIDE to default to build engine with tf32 (without timecaching).

4. set NVIDIA_TF32_OVERRIDE to default to build engine without tf32 (without timecaching).
No one of the results of these can be matched with fp32 on an Ampere device. Therefore, I believe that there may be some bugs when using Ampere GPUs to run TensorRT and generate FP16 rtdetr models (especially the transformer layer and the layernorm layer). This has led to poor precision in FP16 results on this card, which significantly deviates from the performance observed in FP32.

I have also encountered the accuracy drop of dinov2-rtdetr with fp16 precision in Orin platform, and found out that the problematic layers are the self-attentons of RTDETR-decoder. You can see #3657.

But in this Issue, my resnet-rtdetr model in fp32 precision have accuracy drop of 4.6 comapred with onnx, I have no idea how to solve this problem.

chinakook · 2024-03-09T14:58:08Z

@miraiaroha I used opset 17 (the ultralytics default version), and the fp32 precision is matched. The onnx must deploy in fp32 mode (without amp or without model.half()), otherwise It will cause parse error as you mentioned #3567 .

lix19937 · 2024-03-16T07:38:40Z

Usually fp32 is so slowly, why specify that certain layers use fp32 and others use fp16 considering accuracy and speed ?

miraiaroha · 2024-03-19T02:49:06Z

Usually fp32 is so slowly, why specify that certain layers use fp32 and others use fp16 considering accuracy and speed ?

It's a balance between accuracy and speed. Usually -4% accuracy drop is unacceptable.

lix19937 · 2024-03-20T10:37:11Z

@miraiaroha you can mark all node of onnx except real output nodes (add output node for those nodes) to close trt fusion), and --noTF32

miraiaroha · 2024-03-29T10:06:10Z

@miraiaroha you can mark all node of onnx except real output nodes (add output node for those nodes) to close trt fusion), and --noTF32

I have tried to mark all nodes as the outputing nodes to disable the op fusion, and the mAP is still 66.1 in fp32 precision.

zerollzeng · 2024-04-06T13:51:57Z

Hi we just release TRT 10 EA, could you please try the new version?

If the accuracy still bad, you can provide a reproduce and we can take a further check. Thanks!

chinakook · 2024-04-09T12:51:25Z

Hi we just release TRT 10 EA, could you please try the new version?

If the accuracy still bad, you can provide a reproduce and we can take a further check. Thanks!

Yes, I have tried TRT 10 EA, I found that it lose more accuracy...

miraiaroha · 2024-04-11T07:37:12Z

Hi we just release TRT 10 EA, could you please try the new version?

If the accuracy still bad, you can provide a reproduce and we can take a further check. Thanks!

I have found the cause of the problem which the preprocessing method of image is not aligned, is not related to the quantization. Thank you for your patience.

chinakook · 2024-04-11T12:42:09Z

Hi we just release TRT 10 EA, could you please try the new version?
If the accuracy still bad, you can provide a reproduce and we can take a further check. Thanks!

I have found the cause of the problem which the preprocessing method of image is not aligned, is not related to the quantization. Thank you for your patience.

I think you are not using Ampere card. My Turing cards is all OK on fp32/fp16.

chinakook · 2024-04-11T13:51:51Z

I think we can close this issue as @miraiaroha has got a solution. Please reopen #3652 to track Ampere accuracy lose issue.

zerollzeng self-assigned this Mar 8, 2024

zerollzeng added the triaged Issue has been triaged by maintainers label Mar 8, 2024

chinakook mentioned this issue Mar 8, 2024

RT-DETR FP16 inference get correct result on v100 but weird result on a10 #3652

Closed

miraiaroha closed this as completed Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensorrt engine of fp32 precision lost accuracy compared with onnx #3700

tensorrt engine of fp32 precision lost accuracy compared with onnx #3700

miraiaroha commented Mar 6, 2024

miraiaroha commented Mar 6, 2024

zerollzeng commented Mar 8, 2024

zerollzeng commented Mar 8, 2024

miraiaroha commented Mar 8, 2024

miraiaroha commented Mar 8, 2024

chinakook commented Mar 8, 2024

ttyio commented Mar 9, 2024

chinakook commented Mar 9, 2024

chinakook commented Mar 9, 2024

miraiaroha commented Mar 9, 2024

chinakook commented Mar 9, 2024 •

edited

Loading

lix19937 commented Mar 16, 2024 •

edited

Loading

miraiaroha commented Mar 19, 2024

lix19937 commented Mar 20, 2024

miraiaroha commented Mar 29, 2024

zerollzeng commented Apr 6, 2024

chinakook commented Apr 9, 2024

miraiaroha commented Apr 11, 2024

chinakook commented Apr 11, 2024

chinakook commented Apr 11, 2024

tensorrt engine of fp32 precision lost accuracy compared with onnx #3700

tensorrt engine of fp32 precision lost accuracy compared with onnx #3700

Comments

miraiaroha commented Mar 6, 2024

Description

Environment

miraiaroha commented Mar 6, 2024

zerollzeng commented Mar 8, 2024

zerollzeng commented Mar 8, 2024

miraiaroha commented Mar 8, 2024

miraiaroha commented Mar 8, 2024

chinakook commented Mar 8, 2024

ttyio commented Mar 9, 2024

chinakook commented Mar 9, 2024

chinakook commented Mar 9, 2024

miraiaroha commented Mar 9, 2024

chinakook commented Mar 9, 2024 • edited Loading

lix19937 commented Mar 16, 2024 • edited Loading

miraiaroha commented Mar 19, 2024

lix19937 commented Mar 20, 2024

miraiaroha commented Mar 29, 2024

zerollzeng commented Apr 6, 2024

chinakook commented Apr 9, 2024

miraiaroha commented Apr 11, 2024

chinakook commented Apr 11, 2024

chinakook commented Apr 11, 2024

chinakook commented Mar 9, 2024 •

edited

Loading

lix19937 commented Mar 16, 2024 •

edited

Loading