tensorrt test failed #1454

twmht · 2021-11-03T02:16:35Z

when trying to run test on test_tensorrt.py, there are many cuda errors.

My environment is cuda11+pytorch1.8.0+tensorrt7.1.3.4

I found out when forwarding pytorch module with cpu mode, the error could be gone.

or remove the tensorrt part, the error could be gone.

so the issue is that when forwarding pytorch module with gpu mode and run tensorrt module together, the cuda error may be throw.

Here are the test log

log.txt

zhouzaida · 2021-11-03T06:51:43Z

hi @twmht, the log file you provided did not include cuda error information.

# log.txt
Traceback (most recent call last):
  File "tools/rapid.py", line 4, in <module>
    import torch
  File "/home/acer/.pyenv/versions/pytorch-1.8/lib/python3.7/site-packages/torch/__init__.py", line 196, in <module>
    from torch._C import *
RuntimeError: KeyboardInterrupt:

twmht · 2021-11-03T07:00:55Z

@zhouzaida

oops.

I may upload the wrong file. I have updated the post and please download the file again.

zhouzaida · 2021-11-03T07:06:24Z

Got it. Please @grimoire have a look.

grimoire · 2021-11-03T07:23:19Z

I am not 100% sure, but I guess it is caused by memory conflict.
Both PyTorch and TensorRT have their own memory pool, which might pollute each other when you are using both of them.
It is recommended to use TensorRT only in the real environment, or you can create a GPU allocator to feed pytorch memory to TensorRT.

twmht · 2021-11-03T07:53:51Z

@grimoire

i don't understand the memory pool, but why we have memory conflict if they have their "own" memory pool?

grimoire · 2021-11-03T11:07:57Z

Update:
I find something that might be related: pytorch/pytorch#52663
I did use thrust in nms ops, which might use cub implementation _ZN6thrust8cuda_cub11sort_by_keyINS0_17execute_on_streamEPfPiNS_7greaterIfEEEEvRNS0_16execution_policyIT_EET0_SB_T1_T2_. And if I swap the order in test_batched_nms, the TensorRT nms ops crashed after thrust::sort_by_key
According to this issue, cub might have a conflict with torch.

twmht · 2021-11-03T14:42:19Z

@grimoire

what cause the conflict? in my test log, there are many test failed, not only test_batched_nms.

grimoire · 2021-11-04T02:34:25Z

Just comment test_batched_nms and all other test will be passed.

twmht · 2021-11-04T12:51:52Z

Interesting. Why is pytorch conflict with thrust::sort_by_key?

grimoire · 2021-11-04T14:32:24Z

Both PyTorch and thrust use cub in their source code. And there is a template function with static variable https://github.com/NVIDIA/cub/blob/499a7bad3416fcc71a7c50351d6b3cdbf3fbbc27/cub/util_device.cuh#L210.

After compile and load both libraries, they have two different DeviceCountCachedValue with same static variable cache. That's where the problem comes from. According to issures in thrust and PyTorch, add compile flags -Xcompiler -fno-gnu-unique might solve the problem.

reference:

[RFE] Support linking multiple Thrust versions: Add hooks that wrap the thrust:: namespace in a custom namespace NVIDIA/thrust#1401 (comment)

twmht · 2021-11-04T15:15:32Z

@grimoire

thank you! Great explain!

But when running mmcv nms ops (https://github.com/open-mmlab/mmcv/blob/master/mmcv/ops/nms.py#L26) alone without running tensorrt module, the test would be fine. Why tensorrt introduce the influence?

if the cub version of mmcv is different from the cub version of pytorch, it should cause the cuda error with mmcv+pytorch1.8 even without running tensorrt module
in some cases (maybe different calling order of the apis). am I right?

the related issue is also here pytorch/pytorch#54245, maybe I can also try pytorch1.8.1 to see if the issue can be solved.

grimoire · 2021-11-05T03:18:31Z

TensorRT nms and MMCV nms are different. In MMCV implementation, we use PyTorch to do the sort or topk, which will not bring another cub.

The error is caused by the way how compiler process static variable. Here is an example (from a blog)
unique_bind.zip
Try comment or uncomment SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-gnu-unique") in CMakeLists.txt. Recompile the project and you will have different output.

I am using torch1.8.1 and the error still exists. Not sure if they have a fix in 1.10.0.

zhouzaida assigned grimoire Nov 3, 2021

zhouzaida added the TensorRT label Nov 3, 2021

twmht mentioned this issue Nov 6, 2021

fix tensorrt test failed with pytorch 1.8+ #1464

Merged

ZwwWayne closed this as completed in #1464 Dec 10, 2021

grimoire mentioned this issue Apr 20, 2022

windows tensorrt speed-test auto close open-mmlab/mmdeploy#338

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensorrt test failed #1454

tensorrt test failed #1454

twmht commented Nov 3, 2021 •

edited

Loading

zhouzaida commented Nov 3, 2021 •

edited

Loading

twmht commented Nov 3, 2021 •

edited

Loading

zhouzaida commented Nov 3, 2021

grimoire commented Nov 3, 2021

twmht commented Nov 3, 2021

grimoire commented Nov 3, 2021 •

edited

Loading

twmht commented Nov 3, 2021

grimoire commented Nov 4, 2021

twmht commented Nov 4, 2021 •

edited

Loading

grimoire commented Nov 4, 2021

twmht commented Nov 4, 2021 •

edited

Loading

grimoire commented Nov 5, 2021 •

edited

Loading

tensorrt test failed #1454

tensorrt test failed #1454

Comments

twmht commented Nov 3, 2021 • edited Loading

zhouzaida commented Nov 3, 2021 • edited Loading

twmht commented Nov 3, 2021 • edited Loading

zhouzaida commented Nov 3, 2021

grimoire commented Nov 3, 2021

twmht commented Nov 3, 2021

grimoire commented Nov 3, 2021 • edited Loading

twmht commented Nov 3, 2021

grimoire commented Nov 4, 2021

twmht commented Nov 4, 2021 • edited Loading

grimoire commented Nov 4, 2021

twmht commented Nov 4, 2021 • edited Loading

grimoire commented Nov 5, 2021 • edited Loading

twmht commented Nov 3, 2021 •

edited

Loading

zhouzaida commented Nov 3, 2021 •

edited

Loading

twmht commented Nov 3, 2021 •

edited

Loading

grimoire commented Nov 3, 2021 •

edited

Loading

twmht commented Nov 4, 2021 •

edited

Loading

twmht commented Nov 4, 2021 •

edited

Loading

grimoire commented Nov 5, 2021 •

edited

Loading