Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to utilise CUDA with TRT Engine when running on Jetson AGX Orin (ONNX->TRT, Transformer) #2997

Open
niqbal996 opened this issue May 23, 2023 · 5 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@niqbal996
Copy link

Description

I am trying to convert a DINO object detector Transformer trained with a custom dataset model to a TensorRT engine with any precision. I am using trtexec for that. The engine file is generated but it gives the following error:

[05/23/2023-10:55:09] [E] Error[1]: [executionContext.cpp::handleTrainStationRunnerPhase1::146] Error Code 1: Cuda Runtime (operation not permitted when stream is capturing)
[05/23/2023-10:55:09] [W] The CUDA graph capture on the stream has failed.
[05/23/2023-10:55:09] [W] The built TensorRT engine contains operations that are not permitted under CUDA graph capture mode.
[05/23/2023-10:55:09] [W] The specified --useCudaGraph flag has been ignored. The inference will be launched without using CUDA graph launch.

I can perform inference with a python script but I think it is not using the GPU for that and can only run at 3 FPS on Jetson Orin. I am expecting roughly at least 10 FPS on the Jetson Orin.
Any ideas what might be causing this issue and how I can solve that?
I used the following command for conversion:

trtexec --onnx=dino_simp.onnx --int8 --useCudaGraph --verbose --saveEngine=dino_last.trt --workspace=20000

Verbose output logs

Environment

TensorRT Version: 8.5.2

NVIDIA GPU: Jetson AGX ORIN

NVIDIA Driver Version: L4T 35.3.1

CUDA Version: 11.4.315

CUDNN Version: 8.6.0.166

Operating System: Ubuntu 20.04 LTS

Python Version (if applicable): 3.8

PyTorch Version (if applicable):

Container (if so, version): nvcr.io/nvidia/l4t-tensorrt:r8.5.2.2-devel

Relevant Files

Model link:
The model onnx file and the full verbose log output file can be downloaded at the following link: drive

Steps To Reproduce

Commands or scripts:
trtexec --onnx=dino_simp.onnx --int8 --useCudaGraph --verbose --saveEngine=dino_last.trt --workspace=20000

Have you tried the latest release?: The latest TensorRT release I can only try on my laptop but the corresponding Jetpack release is not yet available to be installed on the Jetson Orin. #2949

Can this model run on other frameworks? I can do inference with ONNX runtime on my model. I also tried to convert the same model on my laptop and it works without any issues. I can do inference with about 14 FPS which is expected.

Thank you for looking into this.

@zerollzeng
Copy link
Collaborator

Any ideas what might be causing this issue and how I can solve that?

It's because your model contains operators that can not be capture by cuda graph, like loops or if-condition operators. see https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#cuda-graphs

To get the better performance, I have a few suggestions: 1. use the latest TRT, which has better optimization and new feature. 2. use --best, it will also enable fp16.

@j0987834204
Copy link

@niqbal996 Hello, How could you transfer DINO .pth to onnx format?
Which architecture did you use for mmdetection, d2, etc?

Thanks.

@niqbal996
Copy link
Author

Hey @j0987834204,
I use Detrex repo for my model training and added the onnx conversion script from detectron2 here Detrex fork. The script is based on the conversion script from detectron2.

@IamShubhamGupto
Copy link

@niqbal996
Hey thanks for the script and the guide. Could you tell me some stats of the dino model you're running on the orin? I would like to know what dataset you're trained on, the fps, the MaP / accuracy

@Coastchb
Copy link

Any ideas what might be causing this issue and how I can solve that?

It's because your model contains operators that can not be capture by cuda graph, like loops or if-condition operators. see https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#cuda-graphs

To get the better performance, I have a few suggestions: 1. use the latest TRT, which has better optimization and new feature. 2. use --best, it will also enable fp16.

@niqbal996 @zerollzeng
I also got this problem. Do you have any idea to convert the model quickly so that all the operations can be captured?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

5 participants