Just one element of a batch is correct in TensorRT 8.6.1.6 #3689

Beshkent · 2024-02-29T14:24:27Z

Description

Hello!
I have a pipeline, that gets TRT engine from a torch checkpoint, which works fine for Cuda11.4 && TensorRT-7.2.3.4-1.cuda11.1. When I tried to upgrade GPU libs (and TRT engine), I met some strange error during inference using TRT engine: when batch-size > 1, result of the only one element of the batch is correct (AFAIK the first element). Here are the used versions

Installed versions:
tensorrt-8.6.1.6-1.cuda12.0
cuda-toolkit-12-0-12.0.1-1
libcudnn8-devel-8.9.7.29-1.cuda12.2

Upgraded pip packages upto following versions, but didn't help:
nvidia-cublas-cu12        12.3.4.1 (also checked with 12.1.0.26)
onnx                      1.13.1
onnxruntime               1.11.1
torch                     2.0.0

Conversion of onnx (skipping torch->onnx conversion), that is used in TensorRT7, also didn't help

Environment

TensorRT Version: 8.6.1.6-1.cuda12.0
NVIDIA GPU: Tesla V100S
NVIDIA Driver Version: 525.147.05
CUDA Version: 12.0
CUDNN Version: 8.9.7

Operating System: rhel8
Python Version: 3.8
PyTorch Version: 2

Changed just versions of tools/libs, code of conversion (in Python) and inference (in C++) are the same for both TensorRT7 and TensorRT8

Can you help with questions, please?

is there some change in the representation of the TRT engine input (way of padding, transposing and so on)
prerequisites: minimal versions of tools, that are used in conversion chain torch->onnx->trt

The text was updated successfully, but these errors were encountered:

zerollzeng · 2024-03-04T13:49:53Z

Does the model has dynamic shapes input?
Can it be reproduce with polygraphy? usage would like polygraphy run model.onnx --trt --onnxrt to compare the output between TensorRT and onnxruntime

Thanks!

Beshkent · 2024-03-04T15:14:37Z

Does the model has dynamic shapes input?

Yes, 0th index. Marked as minShapes && optShapes && maxShapes

Can it be reproduce with polygraphy? usage would like polygraphy run model.onnx --trt --onnxrt to compare the output between TensorRT and onnxruntime

Needs pip tensorrt package and polygraphy binary, which I don't have. Will try to install and post the result

OliviaSnail · 2024-03-06T10:33:41Z

I also encountered the same error! My trt engine works well in TensorRT7.1/TensorRT8.5, but not in TensorRT8.6.... I also use dynamic shapes inputs, and use multi-context in different threads. When batch=1, the result is correct. When batch > 1, the all results are wrong.

Beshkent · 2024-04-08T12:53:27Z

@zerollzeng
polygraphy run model.onnx --trt --onnxrt returns Difference exceeds tolerance (rel=1e-05, abs=1e-05). Attaching log: polygraphy.log

zerollzeng · 2024-04-12T14:14:24Z

The diff look good(<1e-5) to me, the reason why it fails is polygraphy use a strict tolerance for the output diff((rel=1e-05, abs=1e-05))

[I]         Error Metrics: qf0_logits
[I]             Minimum Required Tolerance: elemwise error | [abs=5.4359e-05] OR [rel=0.00030196] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=4.0124e-06, std-dev=5.6803e-06, var=3.2265e-11, median=2.1458e-06, min=0 at (0, 2, 3, 2), max=5.4359e-05 at (3, 14, 5, 0), avg-magnitude=4.0124e-06
[I]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (0       , 5.44e-06) |       2795 | ########################################
                    (5.44e-06, 1.09e-05) |        516 | #######
                    (1.09e-05, 1.63e-05) |        163 | ##
                    (1.63e-05, 2.17e-05) |         62 | 
                    (2.17e-05, 2.72e-05) |         26 | 
                    (2.72e-05, 3.26e-05) |         14 | 
                    (3.26e-05, 3.81e-05) |          4 | 
                    (3.81e-05, 4.35e-05) |          5 | 
                    (4.35e-05, 4.89e-05) |          8 | 
                    (4.89e-05, 5.44e-05) |          7 |

Beshkent · 2024-04-12T14:46:46Z

So, then do you have any ideas why batching doesn't work? Does random in the network may cause such error? In our nets we have random, which is moved as input in this polygraphy run

Beshkent · 2024-04-17T17:31:33Z

This problem exists in inference of two different networks. Attaching onnx file of one of them and a command, that we used to convert

trtexec --onnx=full_text.onnx --saveEngine=full_text.trt  --minShapes=text_emb:1x1x192,text_mask:1x1,q_labels:1x1x5,bert_emb:1x1x768,bert_mask:1x1,speaker_ids:1,noise_scale_w:1,length_scale:1x1 \
    --optShapes=text_emb:16x400x192,text_mask:16x400,q_labels:16x400x5,bert_emb:16x200x768,bert_mask:16x200,speaker_ids:16,noise_scale_w:16,length_scale:16x400 \
    --maxShapes=text_emb:16x400x192,text_mask:16x400,q_labels:16x400x5,bert_emb:16x200x768,bert_mask:16x200,speaker_ids:16,noise_scale_w:16,length_scale:16x400 \
    --fp16

full_text_random.onnx.zip

Beshkent · 2024-04-19T18:12:19Z

@zerollzeng
Here is the graph of the second NN. Tried to attach the NN model, but its size is larger than the limit (getting File size too big: 25 MB are allowed, 41 MB were attempted to upload.)

How we prepare the inputs for inference:
- shape[1] of next inputs may be different for each element of the batch, so they are all padded (with 0) upto kMaxAxisValue: attention_weights, attention_weights_cum, processed_memory, encoder_outputs
- padding_mask_zeros_inf[i] is filled with 0 for the first batch[i].shape[1] positions, and with inf for rest (kMaxAxisValue - d[1])
- padding_mask_ones_zeros[i]: the same as above, put padd values are 1 and 0s```

zerollzeng · 2024-04-27T14:09:09Z

This problem exists in inference of two different networks. Attaching onnx file of one of them and a command, that we used to convert

trtexec --onnx=full_text.onnx --saveEngine=full_text.trt  --minShapes=text_emb:1x1x192,text_mask:1x1,q_labels:1x1x5,bert_emb:1x1x768,bert_mask:1x1,speaker_ids:1,noise_scale_w:1,length_scale:1x1 \
    --optShapes=text_emb:16x400x192,text_mask:16x400,q_labels:16x400x5,bert_emb:16x200x768,bert_mask:16x200,speaker_ids:16,noise_scale_w:16,length_scale:16x400 \
    --maxShapes=text_emb:16x400x192,text_mask:16x400,q_labels:16x400x5,bert_emb:16x200x768,bert_mask:16x200,speaker_ids:16,noise_scale_w:16,length_scale:16x400 \
    --fp16

full_text_random.onnx.zip

I did a quick check with this model, it passed with polygraphy

[I]     Comparing Output: 'attn_mask' (dtype=int64, shape=(16, 1, 371, 400)) with 'attn_mask' (dtype=int64, shape=(16, 1, 371, 400))
[I]         Tolerance: [abs=0.0001, rel=0.0001] | Checking elemwise error
/home/scratch.zeroz_sw/miniconda3/lib/python3.9/site-packages/polygraphy/util/array.py:677: RuntimeWarning: invalid value encountered in divide
  "numpy": lambda lhs, rhs: lhs / rhs,
[I]         trt-runner-N0-04/27/24-14:05:22: attn_mask | Stats: mean=0.33135, std-dev=0.4707, var=0.22156, median=0, min=0 at (0, 0, 0, 0), max=1 at (0, 0, 0, 2), avg-magnitude=0.33135
[I]         onnxrt-runner-N0-04/27/24-14:05:22: attn_mask | Stats: mean=0.33135, std-dev=0.4707, var=0.22156, median=0, min=0 at (0, 0, 0, 0), max=1 at (0, 0, 0, 2), avg-magnitude=0.33135
[I]         Error Metrics: attn_mask
[I]             Minimum Required Tolerance: elemwise error | [abs=0] OR [rel=nan] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0, 0, 0, 0), max=0 at (0, 0, 0, 0), avg-magnitude=0
[I]             Relative Difference | Stats: mean=nan, std-dev=nan, var=nan, median=nan, min=nan at (0, 0, 0, 0), max=nan at (0, 0, 0, 0), avg-magnitude=nan
[I]         PASSED | Output: 'attn_mask' | Difference is within tolerance (rel=0.0001, abs=0.0001)
[I]     PASSED | All outputs matched | Outputs: ['m_p', 'logs_p', 'w_ceil', 'attn_mask']
[I] Accuracy Summary | trt-runner-N0-04/27/24-14:05:22 vs. onnxrt-runner-N0-04/27/24-14:05:22 | Passed: 1/1 iterations | Pass Rate: 100.0%
[I] PASSED | Runtime: 147.548s | Command: /home/scratch.zeroz_sw/miniconda3/bin/polygraphy run full_text_random.onnx --trt --trt-opt-shapes text_emb:[16,400,192] text_mask:[16,400] q_labels:[16,400,5] bert_emb:[16,200,768] bert_mask:[16,200] speaker_ids:[16] noise_scale_w:[16] length_scale:[16,400] --trt-min-shapes text_emb:[1,1,192] text_mask:[1,1] q_labels:[1,1,5] bert_emb:[1,1,768] bert_mask:[1,1] speaker_ids:[1] noise_scale_w:[1] length_scale:[1,1] --trt-max-shapes text_emb:[16,400,192] text_mask:[16,400] q_labels:[16,400,5] bert_emb:[16,200,768] bert_mask:[16,200] speaker_ids:[16] noise_scale_w:[16] length_scale:[16,400] --onnxrt --input-shapes text_emb:[16,400,192] text_mask:[16,400] q_labels:[16,400,5] bert_emb:[16,200,768] bert_mask:[16,200] speaker_ids:[16] noise_scale_w:[16] length_scale:[16,400] --atol 1e-4 --rtol 1e-4

zerollzeng · 2024-04-27T14:09:52Z

Since they are good at TRT 7, I guess some api usage error may lead to this, maybe check the TRT 8 release note?

ttyio · 2024-07-02T17:00:14Z

closing since no activity for more than 3 weeks, pls reopen if you still have question, thanks all!

zerollzeng self-assigned this Mar 4, 2024

zerollzeng added the triaged Issue has been triaged by maintainers label Mar 4, 2024

ttyio closed this as completed Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Just one element of a batch is correct in TensorRT 8.6.1.6 #3689

Just one element of a batch is correct in TensorRT 8.6.1.6 #3689

Beshkent commented Feb 29, 2024 •

edited

Loading

zerollzeng commented Mar 4, 2024

Beshkent commented Mar 4, 2024

OliviaSnail commented Mar 6, 2024 •

edited

Loading

Beshkent commented Apr 8, 2024 •

edited

Loading

zerollzeng commented Apr 12, 2024

Beshkent commented Apr 12, 2024

Beshkent commented Apr 17, 2024

Beshkent commented Apr 19, 2024

zerollzeng commented Apr 27, 2024

zerollzeng commented Apr 27, 2024

ttyio commented Jul 2, 2024

Just one element of a batch is correct in TensorRT 8.6.1.6 #3689

Just one element of a batch is correct in TensorRT 8.6.1.6 #3689

Comments

Beshkent commented Feb 29, 2024 • edited Loading

Description

Environment

zerollzeng commented Mar 4, 2024

Beshkent commented Mar 4, 2024

OliviaSnail commented Mar 6, 2024 • edited Loading

Beshkent commented Apr 8, 2024 • edited Loading

zerollzeng commented Apr 12, 2024

Beshkent commented Apr 12, 2024

Beshkent commented Apr 17, 2024

Beshkent commented Apr 19, 2024

zerollzeng commented Apr 27, 2024

zerollzeng commented Apr 27, 2024

ttyio commented Jul 2, 2024

Beshkent commented Feb 29, 2024 •

edited

Loading

OliviaSnail commented Mar 6, 2024 •

edited

Loading

Beshkent commented Apr 8, 2024 •

edited

Loading