Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Myelin could not work with Cuda Graph #1614

Closed
feihugis opened this issue Nov 12, 2021 · 10 comments
Closed

Myelin could not work with Cuda Graph #1614

feihugis opened this issue Nov 12, 2021 · 10 comments
Labels
triaged Issue has been triaged by maintainers

Comments

@feihugis
Copy link

Description

Cuda graph failed if the engine has some layers generated by Myelin. The log info can be found below:

root@gpu02:/model# trtexec --loadEngine=model_optimized_tensorrt.trt --verbose --useCudaGraph
&&&& RUNNING TensorRT.trtexec [TensorRT v8003] # trtexec --loadEngine=model_optimized_tensorrt.trt --verbose --useCudaGraph
[11/12/2021-21:46:42] [I] === Model Options ===
[11/12/2021-21:46:42] [I] Format: *
[11/12/2021-21:46:42] [I] Model:
[11/12/2021-21:46:42] [I] Output:
[11/12/2021-21:46:42] [I] === Build Options ===
[11/12/2021-21:46:42] [I] Max batch: 1
[11/12/2021-21:46:42] [I] Workspace: 16 MiB
[11/12/2021-21:46:42] [I] minTiming: 1
[11/12/2021-21:46:42] [I] avgTiming: 8
[11/12/2021-21:46:42] [I] Precision: FP32
[11/12/2021-21:46:42] [I] Calibration:
[11/12/2021-21:46:42] [I] Refit: Disabled
[11/12/2021-21:46:42] [I] Sparsity: Disabled
[11/12/2021-21:46:42] [I] Safe mode: Disabled
[11/12/2021-21:46:42] [I] Restricted mode: Disabled
[11/12/2021-21:46:42] [I] Save engine:
[11/12/2021-21:46:42] [I] Load engine: model_optimized_tensorrt.trt
[11/12/2021-21:46:42] [I] NVTX verbosity: 0
[11/12/2021-21:46:42] [I] Tactic sources: Using default tactic sources
[11/12/2021-21:46:42] [I] timingCacheMode: local
[11/12/2021-21:46:42] [I] timingCacheFile:
[11/12/2021-21:46:42] [I] Input(s)s format: fp32:CHW
[11/12/2021-21:46:42] [I] Output(s)s format: fp32:CHW
[11/12/2021-21:46:42] [I] Input build shapes: model
[11/12/2021-21:46:42] [I] Input calibration shapes: model
[11/12/2021-21:46:42] [I] === System Options ===
[11/12/2021-21:46:42] [I] Device: 0
[11/12/2021-21:46:42] [I] DLACore:
[11/12/2021-21:46:42] [I] Plugins:
[11/12/2021-21:46:42] [I] === Inference Options ===
[11/12/2021-21:46:42] [I] Batch: 1
[11/12/2021-21:46:42] [I] Input inference shapes: model
[11/12/2021-21:46:42] [I] Iterations: 10
[11/12/2021-21:46:42] [I] Duration: 3s (+ 200ms warm up)
[11/12/2021-21:46:42] [I] Sleep time: 0ms
[11/12/2021-21:46:42] [I] Streams: 1
[11/12/2021-21:46:42] [I] ExposeDMA: Disabled
[11/12/2021-21:46:42] [I] Data transfers: Enabled
[11/12/2021-21:46:42] [I] Spin-wait: Disabled
[11/12/2021-21:46:42] [I] Multithreading: Disabled
[11/12/2021-21:46:42] [I] CUDA Graph: Enabled
[11/12/2021-21:46:42] [I] Separate profiling: Disabled
[11/12/2021-21:46:42] [I] Time Deserialize: Disabled
[11/12/2021-21:46:42] [I] Time Refit: Disabled
[11/12/2021-21:46:42] [I] Skip inference: Disabled
[11/12/2021-21:46:42] [I] Inputs:
[11/12/2021-21:46:42] [I] === Reporting Options ===
[11/12/2021-21:46:42] [I] Verbose: Enabled
[11/12/2021-21:46:42] [I] Averages: 10 inferences
[11/12/2021-21:46:42] [I] Percentile: 99
[11/12/2021-21:46:42] [I] Dump refittable layers:Disabled
[11/12/2021-21:46:42] [I] Dump output: Disabled
[11/12/2021-21:46:42] [I] Profile: Disabled
[11/12/2021-21:46:42] [I] Export timing to JSON file:
[11/12/2021-21:46:42] [I] Export output to JSON file:
[11/12/2021-21:46:42] [I] Export profile to JSON file:
[11/12/2021-21:46:42] [I]
[11/12/2021-21:46:53] [I] === Device Information ===
[11/12/2021-21:46:53] [I] Selected Device: Tesla V100-SXM2-32GB
[11/12/2021-21:46:53] [I] Compute Capability: 7.0
[11/12/2021-21:46:53] [I] SMs: 80
[11/12/2021-21:46:53] [I] Compute Clock Rate: 1.53 GHz
[11/12/2021-21:46:53] [I] Device Global Memory: 32510 MiB
[11/12/2021-21:46:53] [I] Shared Memory per SM: 96 KiB
[11/12/2021-21:46:53] [I] Memory Bus Width: 4096 bits (ECC enabled)
[11/12/2021-21:46:53] [I] Memory Clock Rate: 0.877 GHz
[11/12/2021-21:46:53] [I]
[11/12/2021-21:46:53] [I] TensorRT version: 8003
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::Proposal version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::Split version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[11/12/2021-21:46:53] [I] [TRT] [MemUsageChange] Init CUDA: CPU +252, GPU +0, now: CPU 352, GPU 506 (MiB)
[11/12/2021-21:46:53] [I] [TRT] Loaded engine size: 93 MB
[11/12/2021-21:46:53] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 352 MiB, GPU 506 MiB
[11/12/2021-21:46:54] [V] [TRT] Using cublasLt a tactic source
[11/12/2021-21:46:54] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +368, GPU +168, now: CPU 721, GPU 764 (MiB)
[11/12/2021-21:46:54] [V] [TRT] Using cuDNN as a tactic source
[11/12/2021-21:46:54] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +166, GPU +170, now: CPU 887, GPU 934 (MiB)
[11/12/2021-21:46:54] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 887, GPU 916 (MiB)
[11/12/2021-21:46:54] [V] [TRT] Deserialization required 1218352 microseconds.
[11/12/2021-21:46:54] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 887 MiB, GPU 916 MiB
[11/12/2021-21:46:54] [I] Engine loaded in 1.71457 sec.
[11/12/2021-21:46:54] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 794 MiB, GPU 916 MiB
[11/12/2021-21:46:54] [V] [TRT] Using cublasLt a tactic source
[11/12/2021-21:46:54] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 794, GPU 926 (MiB)
[11/12/2021-21:46:54] [V] [TRT] Using cuDNN as a tactic source
[11/12/2021-21:46:54] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 794, GPU 934 (MiB)
[11/12/2021-21:46:54] [V] [TRT] Total per-runner device memory is 0
[11/12/2021-21:46:54] [V] [TRT] Total per-runner host memory is 32
[11/12/2021-21:46:54] [V] [TRT] Allocated activation device memory of size 147456
[11/12/2021-21:46:54] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 796 MiB, GPU 934 MiB
[11/12/2021-21:46:54] [I] Created input binding for src_tokens with dimensions 1x16
[11/12/2021-21:46:54] [I] Created output binding for topk_probs with dimensions 1x16x768
[11/12/2021-21:46:54] [I] Created output binding for topk_index with dimensions 1x16x768
**[11/12/2021-21:46:54] [I] Starting inference
[11/12/2021-21:46:54] [V] [TRT] myelinAllocCb allocated GPU 644 bytes at 0x7f299a25b800.
[11/12/2021-21:46:54] [V] [TRT] myelinAllocCb allocated CPU 1540 bytes at 0x7f2948003300.
[11/12/2021-21:46:54] [E] Error[1]: [runtimeUtils.cpp::gateMyelinGraphStartOnStream::154] Error Code 1: Cuda Runtime (operation not permitted when stream is capturing)
[11/12/2021-21:46:54] [W] The CUDA graph capture on the stream has failed.
[11/12/2021-21:46:54] [W] The built TensorRT engine contains operations that are not permitted under CUDA graph capture mode.
[11/12/2021-21:46:54] [W] The specified --useCudaGraph flag has been ignored. The inference will be launched without using CUDA graph launch.**
[11/12/2021-21:46:58] [I] Warmup completed 1728 queries over 200 ms
[11/12/2021-21:46:58] [I] Timing trace has 24820 queries over 3.00012 s
[11/12/2021-21:42:44] [V] [TRT] Engine Layer Information:
Layer(Constant): encoder.embed_tokens.weight, Tactic: 0,  -> (Unnamed Layer* 0) [Constant]_output[Float(30522,768)]
Layer(Gather): Gather_0, Tactic: 9, (Unnamed Layer* 0) [Constant]_output[Float(30522,768)], src_tokens[Int32(1,16)] -> 253[Float(1,16,768)]
Layer(Myelin): {ForeignNode[254 + (Unnamed Layer* 3) [Shuffle]...Cast_9]}, Tactic: 0, src_tokens[Int32(1,16)] -> 262[Int32(1,16)]
Layer(Constant): encoder.embed_positions.weight, Tactic: 0,  -> (Unnamed Layer* 27) [Constant]_output[Float(167,768)]
Layer(Gather): Gather_10, Tactic: 9, (Unnamed Layer* 27) [Constant]_output[Float(167,768)], 262[Int32(1,16)] -> 263[Float(1,16,768)]
Layer(ElementWise): Add_11, Tactic: 1, 253[Float(1,16,768)], 263[Float(1,16,768)] -> 264[Float(1,16,768)]
Layer(Constant): 265 + (Unnamed Layer* 31) [Shuffle], Tactic: 0,  -> (Unnamed Layer* 31) [Shuffle]_output[Float(1,1,1)]
Layer(Constant): 267 + (Unnamed Layer* 34) [Shuffle], Tactic: 0,  -> (Unnamed Layer* 34) [Shuffle]_output[Float(1,1,1)]
Layer(ElementWise): Add_15, Tactic: 1, 264[Float(1,16,768)], (Unnamed Layer* 34) [Shuffle]_output[Float(1,1,1)] -> topk_index[Float(1,16,768)]
Layer(ElementWise): Add_13, Tactic: 1, 264[Float(1,16,768)], (Unnamed Layer* 31) [Shuffle]_output[Float(1,1,1)] -> topk_probs[Float(1,16,768)]

Environment

TensorRT Version: v8003
NVIDIA GPU: v100
NVIDIA Driver Version:
CUDA Version: 11.4
CUDNN Version: 8.2.4
Operating System: ubuntu 20.04.3 LTS
Python Version (if applicable): 3.8
Tensorflow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if so, version): nvcr.io/nvidia/tensorrt:21.10-py3

Relevant Files

Steps To Reproduce

  1. Reproduce the issue: trtexec --loadEngine=model_optimized_tensorrt.trt --verbose --useCudaGraph

  2. Generate the engine file from the onnx model file: trtexec --onnx=model_optimized_tensorrt_debug.onnx --verbose --useCudaGraph --saveEngine=model_optimized_tensorrt.trt --refit

@zerollzeng
Copy link
Collaborator

zerollzeng commented Dec 9, 2021

It's expected, myelin might do synchronization on the inference stream. CUDA graph capturing also will fail if the network contains loops, conditional layers (if-else), etc that need to do synchronization during inference.

@feihugis
Copy link
Author

feihugis commented Dec 9, 2021

@zerollzeng Thanks for your reply! The model itself should be compatible with CUDA Graph because it can work with the CUDA Graph in pytorch and onnxruntime. So I guess it is caused by myelin. Do you think this issue can be resolved by doing some changes inside myelin? Is myelin open-sourced? If yes, I would like to do more debugging there.

@zerollzeng
Copy link
Collaborator

No, myelin is not open-sourced, I think how tensorrt handle cuda graph is a bit different compares to pytorch/onnxruntime, pytorch will execute the model dynamically, while tensorrt execute an engine statically. but I'm not quite sure so.

@zerollzeng
Copy link
Collaborator

zerollzeng commented Dec 10, 2021

If your model doesn't have a lot of light-weight kernel(kernal launch time is big compare to execution time), you won't get much speed up with cuda graph enabled.

@feihugis
Copy link
Author

For the model I tested, cuda graph reduced the latency around 50% for both pytorch and onnxruntime. It was also observed that tensorrt can significantly accelerate this model (around 40%) without cuda graph, so we wanted to try tensorrt + cuda graph, which may get better performance than pytorch/onnxruntime + cuda graph.

@ttyio
Copy link
Collaborator

ttyio commented Dec 13, 2021

@feihugis , the failure here is a V100 specific problem, do you have Turing or Ampere device to run? we should support run cuda graph for your model. Thanks!

@ttyio ttyio added Release: 8.x triaged Issue has been triaged by maintainers labels Dec 13, 2021
@feihugis
Copy link
Author

Thanks @ttyio! I will find a Turing/Ampere device to test it and keep you updated.

@ttyio
Copy link
Collaborator

ttyio commented Mar 16, 2022

I will close this and please reopen if you still have question, thanks!

@ttyio ttyio closed this as completed Mar 16, 2022
@Coastchb
Copy link

@feihugis
Hi, I alse get the "[11/12/2021-21:46:54] [E] Error[1]: [runtimeUtils.cpp::gateMyelinGraphStartOnStream::154] Error Code 1: Cuda Runtime (operation not permitted when stream is capturing)" Error.
How do you solve the problem?

@feihugis
Copy link
Author

feihugis commented Dec 2, 2024

@feihugis Hi, I alse get the "[11/12/2021-21:46:54] [E] Error[1]: [runtimeUtils.cpp::gateMyelinGraphStartOnStream::154] Error Code 1: Cuda Runtime (operation not permitted when stream is capturing)" Error. How do you solve the problem?

@Coastchb Sorry that I totally forgot if I solved it or not. My suggestion is to try it on Turing/Ampere device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants