Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorrt engine of fp32 precision lost accuracy compared with onnx #3700

Closed
miraiaroha opened this issue Mar 6, 2024 · 20 comments
Closed

tensorrt engine of fp32 precision lost accuracy compared with onnx #3700

miraiaroha opened this issue Mar 6, 2024 · 20 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@miraiaroha
Copy link

Description

I evaluted the my detection model (resnet101-rtdetr), and found that there is a significant loss of accuracy of fp32, as below:
model | onnx | trt-fp32 | trt-fp16 | trt-fp16-int8
mAP | 70.7 | 66.1 | 65.8 | 65.4

And I use Polygraphy(mark all) to compare all outputs of layers, FAILED layers showed below:
step3-op16-fp32.txt

The problematic layers are mostly MatMul_output and Pow_output

Are there any methods to save the accuracy?

Environment

TensorRT Version: 8.6

Polygraphy Version: 0.49

@miraiaroha
Copy link
Author

@zerollzeng

@zerollzeng
Copy link
Collaborator

TRT doesn't guarantee to provide bit-wise accuracy to other framework due to optimization and floating-point error. I guess in you case the accuracy drop happen early and amplify by the pow/matmal layer and it propagate through the network.

@zerollzeng
Copy link
Collaborator

Would be great if you can:

  1. provide a reproduce or share the model.
  2. try latest TRT 9.2/9.3 to see if there are improvement.

@zerollzeng zerollzeng self-assigned this Mar 8, 2024
@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label Mar 8, 2024
@miraiaroha
Copy link
Author

TRT doesn't guarantee to provide bit-wise accuracy to other framework due to optimization and floating-point error. I guess in you case the accuracy drop happen early and amplify by the pow/matmal layer and it propagate through the network.

I test the trt-fp32 generated by Polygraphy(mark all) which broke all of the op fusions, and the result mAP is still 66.1, so I think the accuracy drop is due to the float64 precision truncation, but the situation is so uncommon.

@miraiaroha
Copy link
Author

Would be great if you can:

  1. provide a reproduce or share the model.
  2. try latest TRT 9.2/9.3 to see if there are improvement.

The model is large(400MB), how can I provide it to you?

@chinakook
Copy link

from ultralytics import RTDETR
model = RTDETR('rtdetr-l.pt')
model.export(format='torchscript')
model.export(format='onnx')

@ttyio
Copy link
Collaborator

ttyio commented Mar 9, 2024

Are you using A10? @chinakook , have you tried disable TF32?

@chinakook
Copy link

Are you using A10? @chinakook , have you tried disable TF32?

I'll try that

@chinakook
Copy link

@ttyio No effect on Ampere device.
First, My case is that:

  1. TensorRT FP32 precision matches that of Torch.
  2. Torch FP32 precision is consistent with FP16.
  3. On non-Ampere GPUs (e.g. Tuning), TensorRT FP16 precision aligns with FP32.
  4. On Ampere GPUs, TensorRT FP16 and FP32 precision diverge.

I have already tested it in the 4 combinations on an Ampere device as below (all with fp16 on):

  1. set NVIDIA_TF32_OVERRIDE to 0 to build engine with tf32 (without timecaching).
  2. set NVIDIA_TF32_OVERRIDE to 0 to build engine without tf32 (without timecaching).
  3. set NVIDIA_TF32_OVERRIDE to default to build engine with tf32 (without timecaching).
  4. set NVIDIA_TF32_OVERRIDE to default to build engine without tf32 (without timecaching).

No one of the results of these can be matched with fp32 on an Ampere device. Therefore, I believe that there may be some bugs when using Ampere GPUs to run TensorRT and generate FP16 rtdetr models (especially the transformer layer and the layernorm layer). This has led to poor precision in FP16 results on this card, which significantly deviates from the performance observed in FP32.

@miraiaroha
Copy link
Author

@ttyio No effect on Ampere device. First, My case is that:

1. TensorRT FP32 precision matches that of Torch.

2. Torch FP32 precision is consistent with FP16.

3. On non-Ampere GPUs (e.g. Tuning), TensorRT FP16 precision aligns with FP32.

4. On Ampere GPUs, TensorRT FP16 and FP32 precision diverge.

I have already tested it in the 4 combinations on an Ampere device as below (all with fp16 on):

1. set NVIDIA_TF32_OVERRIDE to 0 to build engine with tf32 (without timecaching).

2. set NVIDIA_TF32_OVERRIDE to 0 to build engine without tf32 (without timecaching).

3. set NVIDIA_TF32_OVERRIDE to default to build engine with tf32 (without timecaching).

4. set NVIDIA_TF32_OVERRIDE to default to build engine without tf32 (without timecaching).

No one of the results of these can be matched with fp32 on an Ampere device. Therefore, I believe that there may be some bugs when using Ampere GPUs to run TensorRT and generate FP16 rtdetr models (especially the transformer layer and the layernorm layer). This has led to poor precision in FP16 results on this card, which significantly deviates from the performance observed in FP32.

I have also encountered the accuracy drop of dinov2-rtdetr with fp16 precision in Orin platform, and found out that the problematic layers are the self-attentons of RTDETR-decoder. You can see #3657.

But in this Issue, my resnet-rtdetr model in fp32 precision have accuracy drop of 4.6 comapred with onnx, I have no idea how to solve this problem.

@chinakook
Copy link

chinakook commented Mar 9, 2024

@miraiaroha I used opset 17 (the ultralytics default version), and the fp32 precision is matched. The onnx must deploy in fp32 mode (without amp or without model.half()), otherwise It will cause parse error as you mentioned #3567 .

@lix19937
Copy link

lix19937 commented Mar 16, 2024

Usually fp32 is so slowly, why specify that certain layers use fp32 and others use fp16 considering accuracy and speed ?

@miraiaroha
Copy link
Author

Usually fp32 is so slowly, why specify that certain layers use fp32 and others use fp16 considering accuracy and speed ?

It's a balance between accuracy and speed. Usually -4% accuracy drop is unacceptable.

@lix19937
Copy link

@miraiaroha you can mark all node of onnx except real output nodes (add output node for those nodes) to close trt fusion), and --noTF32

@miraiaroha
Copy link
Author

@miraiaroha you can mark all node of onnx except real output nodes (add output node for those nodes) to close trt fusion), and --noTF32

I have tried to mark all nodes as the outputing nodes to disable the op fusion, and the mAP is still 66.1 in fp32 precision.

@zerollzeng
Copy link
Collaborator

Hi we just release TRT 10 EA, could you please try the new version?

If the accuracy still bad, you can provide a reproduce and we can take a further check. Thanks!

@chinakook
Copy link

Hi we just release TRT 10 EA, could you please try the new version?

If the accuracy still bad, you can provide a reproduce and we can take a further check. Thanks!

Yes, I have tried TRT 10 EA, I found that it lose more accuracy...

@miraiaroha
Copy link
Author

Hi we just release TRT 10 EA, could you please try the new version?

If the accuracy still bad, you can provide a reproduce and we can take a further check. Thanks!

I have found the cause of the problem which the preprocessing method of image is not aligned, is not related to the quantization. Thank you for your patience.

@chinakook
Copy link

Hi we just release TRT 10 EA, could you please try the new version?
If the accuracy still bad, you can provide a reproduce and we can take a further check. Thanks!

I have found the cause of the problem which the preprocessing method of image is not aligned, is not related to the quantization. Thank you for your patience.

I think you are not using Ampere card. My Turing cards is all OK on fp32/fp16.

@chinakook
Copy link

I think we can close this issue as @miraiaroha has got a solution. Please reopen #3652 to track Ampere accuracy lose issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

5 participants