Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorrt test failed #1454

Closed
twmht opened this issue Nov 3, 2021 · 12 comments · Fixed by #1464
Closed

tensorrt test failed #1454

twmht opened this issue Nov 3, 2021 · 12 comments · Fixed by #1464
Assignees
Labels

Comments

@twmht
Copy link
Contributor

twmht commented Nov 3, 2021

when trying to run test on test_tensorrt.py, there are many cuda errors.

My environment is cuda11+pytorch1.8.0+tensorrt7.1.3.4

I found out when forwarding pytorch module with cpu mode, the error could be gone.

or remove the tensorrt part, the error could be gone.

so the issue is that when forwarding pytorch module with gpu mode and run tensorrt module together, the cuda error may be throw.

Here are the test log

log.txt

@zhouzaida
Copy link
Collaborator

zhouzaida commented Nov 3, 2021

hi @twmht, the log file you provided did not include cuda error information.

# log.txt
Traceback (most recent call last):
  File "tools/rapid.py", line 4, in <module>
    import torch
  File "/home/acer/.pyenv/versions/pytorch-1.8/lib/python3.7/site-packages/torch/__init__.py", line 196, in <module>
    from torch._C import *
RuntimeError: KeyboardInterrupt: 

@twmht
Copy link
Contributor Author

twmht commented Nov 3, 2021

@zhouzaida

oops.

I may upload the wrong file. I have updated the post and please download the file again.

@zhouzaida
Copy link
Collaborator

Got it. Please @grimoire have a look.

@grimoire
Copy link
Member

grimoire commented Nov 3, 2021

I am not 100% sure, but I guess it is caused by memory conflict.
Both PyTorch and TensorRT have their own memory pool, which might pollute each other when you are using both of them.
It is recommended to use TensorRT only in the real environment, or you can create a GPU allocator to feed pytorch memory to TensorRT.

@twmht
Copy link
Contributor Author

twmht commented Nov 3, 2021

@grimoire

i don't understand the memory pool, but why we have memory conflict if they have their "own" memory pool?

@grimoire
Copy link
Member

grimoire commented Nov 3, 2021

Update:
I find something that might be related: pytorch/pytorch#52663
I did use thrust in nms ops, which might use cub implementation _ZN6thrust8cuda_cub11sort_by_keyINS0_17execute_on_streamEPfPiNS_7greaterIfEEEEvRNS0_16execution_policyIT_EET0_SB_T1_T2_. And if I swap the order in test_batched_nms, the TensorRT nms ops crashed after thrust::sort_by_key
According to this issue, cub might have a conflict with torch.

@twmht
Copy link
Contributor Author

twmht commented Nov 3, 2021

@grimoire

what cause the conflict? in my test log, there are many test failed, not only test_batched_nms.

@grimoire
Copy link
Member

grimoire commented Nov 4, 2021

Just comment test_batched_nms and all other test will be passed.

@twmht
Copy link
Contributor Author

twmht commented Nov 4, 2021

Interesting. Why is pytorch conflict with thrust::sort_by_key?

@grimoire
Copy link
Member

grimoire commented Nov 4, 2021

Both PyTorch and thrust use cub in their source code. And there is a template function with static variable https://github.com/NVIDIA/cub/blob/499a7bad3416fcc71a7c50351d6b3cdbf3fbbc27/cub/util_device.cuh#L210.

After compile and load both libraries, they have two different DeviceCountCachedValue with same static variable cache. That's where the problem comes from. According to issures in thrust and PyTorch, add compile flags -Xcompiler -fno-gnu-unique might solve the problem.

reference:

@twmht
Copy link
Contributor Author

twmht commented Nov 4, 2021

@grimoire

thank you! Great explain!

But when running mmcv nms ops (https://github.com/open-mmlab/mmcv/blob/master/mmcv/ops/nms.py#L26) alone without running tensorrt module, the test would be fine. Why tensorrt introduce the influence?

if the cub version of mmcv is different from the cub version of pytorch, it should cause the cuda error with mmcv+pytorch1.8 even without running tensorrt module
in some cases (maybe different calling order of the apis). am I right?

the related issue is also here pytorch/pytorch#54245, maybe I can also try pytorch1.8.1 to see if the issue can be solved.

@grimoire
Copy link
Member

grimoire commented Nov 5, 2021

TensorRT nms and MMCV nms are different. In MMCV implementation, we use PyTorch to do the sort or topk, which will not bring another cub.

The error is caused by the way how compiler process static variable. Here is an example (from a blog)
unique_bind.zip
Try comment or uncomment SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-gnu-unique") in CMakeLists.txt. Recompile the project and you will have different output.

I am using torch1.8.1 and the error still exists. Not sure if they have a fix in 1.10.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants