Myelin could not work with Cuda Graph #1614

feihugis · 2021-11-12T22:13:09Z

Description

Cuda graph failed if the engine has some layers generated by Myelin. The log info can be found below:

root@gpu02:/model# trtexec --loadEngine=model_optimized_tensorrt.trt --verbose --useCudaGraph
&&&& RUNNING TensorRT.trtexec [TensorRT v8003] # trtexec --loadEngine=model_optimized_tensorrt.trt --verbose --useCudaGraph
[11/12/2021-21:46:42] [I] === Model Options ===
[11/12/2021-21:46:42] [I] Format: *
[11/12/2021-21:46:42] [I] Model:
[11/12/2021-21:46:42] [I] Output:
[11/12/2021-21:46:42] [I] === Build Options ===
[11/12/2021-21:46:42] [I] Max batch: 1
[11/12/2021-21:46:42] [I] Workspace: 16 MiB
[11/12/2021-21:46:42] [I] minTiming: 1
[11/12/2021-21:46:42] [I] avgTiming: 8
[11/12/2021-21:46:42] [I] Precision: FP32
[11/12/2021-21:46:42] [I] Calibration:
[11/12/2021-21:46:42] [I] Refit: Disabled
[11/12/2021-21:46:42] [I] Sparsity: Disabled
[11/12/2021-21:46:42] [I] Safe mode: Disabled
[11/12/2021-21:46:42] [I] Restricted mode: Disabled
[11/12/2021-21:46:42] [I] Save engine:
[11/12/2021-21:46:42] [I] Load engine: model_optimized_tensorrt.trt
[11/12/2021-21:46:42] [I] NVTX verbosity: 0
[11/12/2021-21:46:42] [I] Tactic sources: Using default tactic sources
[11/12/2021-21:46:42] [I] timingCacheMode: local
[11/12/2021-21:46:42] [I] timingCacheFile:
[11/12/2021-21:46:42] [I] Input(s)s format: fp32:CHW
[11/12/2021-21:46:42] [I] Output(s)s format: fp32:CHW
[11/12/2021-21:46:42] [I] Input build shapes: model
[11/12/2021-21:46:42] [I] Input calibration shapes: model
[11/12/2021-21:46:42] [I] === System Options ===
[11/12/2021-21:46:42] [I] Device: 0
[11/12/2021-21:46:42] [I] DLACore:
[11/12/2021-21:46:42] [I] Plugins:
[11/12/2021-21:46:42] [I] === Inference Options ===
[11/12/2021-21:46:42] [I] Batch: 1
[11/12/2021-21:46:42] [I] Input inference shapes: model
[11/12/2021-21:46:42] [I] Iterations: 10
[11/12/2021-21:46:42] [I] Duration: 3s (+ 200ms warm up)
[11/12/2021-21:46:42] [I] Sleep time: 0ms
[11/12/2021-21:46:42] [I] Streams: 1
[11/12/2021-21:46:42] [I] ExposeDMA: Disabled
[11/12/2021-21:46:42] [I] Data transfers: Enabled
[11/12/2021-21:46:42] [I] Spin-wait: Disabled
[11/12/2021-21:46:42] [I] Multithreading: Disabled
[11/12/2021-21:46:42] [I] CUDA Graph: Enabled
[11/12/2021-21:46:42] [I] Separate profiling: Disabled
[11/12/2021-21:46:42] [I] Time Deserialize: Disabled
[11/12/2021-21:46:42] [I] Time Refit: Disabled
[11/12/2021-21:46:42] [I] Skip inference: Disabled
[11/12/2021-21:46:42] [I] Inputs:
[11/12/2021-21:46:42] [I] === Reporting Options ===
[11/12/2021-21:46:42] [I] Verbose: Enabled
[11/12/2021-21:46:42] [I] Averages: 10 inferences
[11/12/2021-21:46:42] [I] Percentile: 99
[11/12/2021-21:46:42] [I] Dump refittable layers:Disabled
[11/12/2021-21:46:42] [I] Dump output: Disabled
[11/12/2021-21:46:42] [I] Profile: Disabled
[11/12/2021-21:46:42] [I] Export timing to JSON file:
[11/12/2021-21:46:42] [I] Export output to JSON file:
[11/12/2021-21:46:42] [I] Export profile to JSON file:
[11/12/2021-21:46:42] [I]
[11/12/2021-21:46:53] [I] === Device Information ===
[11/12/2021-21:46:53] [I] Selected Device: Tesla V100-SXM2-32GB
[11/12/2021-21:46:53] [I] Compute Capability: 7.0
[11/12/2021-21:46:53] [I] SMs: 80
[11/12/2021-21:46:53] [I] Compute Clock Rate: 1.53 GHz
[11/12/2021-21:46:53] [I] Device Global Memory: 32510 MiB
[11/12/2021-21:46:53] [I] Shared Memory per SM: 96 KiB
[11/12/2021-21:46:53] [I] Memory Bus Width: 4096 bits (ECC enabled)
[11/12/2021-21:46:53] [I] Memory Clock Rate: 0.877 GHz
[11/12/2021-21:46:53] [I]
[11/12/2021-21:46:53] [I] TensorRT version: 8003
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::Proposal version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::Split version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[11/12/2021-21:46:53] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[11/12/2021-21:46:53] [I] [TRT] [MemUsageChange] Init CUDA: CPU +252, GPU +0, now: CPU 352, GPU 506 (MiB)
[11/12/2021-21:46:53] [I] [TRT] Loaded engine size: 93 MB
[11/12/2021-21:46:53] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 352 MiB, GPU 506 MiB
[11/12/2021-21:46:54] [V] [TRT] Using cublasLt a tactic source
[11/12/2021-21:46:54] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +368, GPU +168, now: CPU 721, GPU 764 (MiB)
[11/12/2021-21:46:54] [V] [TRT] Using cuDNN as a tactic source
[11/12/2021-21:46:54] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +166, GPU +170, now: CPU 887, GPU 934 (MiB)
[11/12/2021-21:46:54] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 887, GPU 916 (MiB)
[11/12/2021-21:46:54] [V] [TRT] Deserialization required 1218352 microseconds.
[11/12/2021-21:46:54] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 887 MiB, GPU 916 MiB
[11/12/2021-21:46:54] [I] Engine loaded in 1.71457 sec.
[11/12/2021-21:46:54] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 794 MiB, GPU 916 MiB
[11/12/2021-21:46:54] [V] [TRT] Using cublasLt a tactic source
[11/12/2021-21:46:54] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 794, GPU 926 (MiB)
[11/12/2021-21:46:54] [V] [TRT] Using cuDNN as a tactic source
[11/12/2021-21:46:54] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 794, GPU 934 (MiB)
[11/12/2021-21:46:54] [V] [TRT] Total per-runner device memory is 0
[11/12/2021-21:46:54] [V] [TRT] Total per-runner host memory is 32
[11/12/2021-21:46:54] [V] [TRT] Allocated activation device memory of size 147456
[11/12/2021-21:46:54] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 796 MiB, GPU 934 MiB
[11/12/2021-21:46:54] [I] Created input binding for src_tokens with dimensions 1x16
[11/12/2021-21:46:54] [I] Created output binding for topk_probs with dimensions 1x16x768
[11/12/2021-21:46:54] [I] Created output binding for topk_index with dimensions 1x16x768
**[11/12/2021-21:46:54] [I] Starting inference
[11/12/2021-21:46:54] [V] [TRT] myelinAllocCb allocated GPU 644 bytes at 0x7f299a25b800.
[11/12/2021-21:46:54] [V] [TRT] myelinAllocCb allocated CPU 1540 bytes at 0x7f2948003300.
[11/12/2021-21:46:54] [E] Error[1]: [runtimeUtils.cpp::gateMyelinGraphStartOnStream::154] Error Code 1: Cuda Runtime (operation not permitted when stream is capturing)
[11/12/2021-21:46:54] [W] The CUDA graph capture on the stream has failed.
[11/12/2021-21:46:54] [W] The built TensorRT engine contains operations that are not permitted under CUDA graph capture mode.
[11/12/2021-21:46:54] [W] The specified --useCudaGraph flag has been ignored. The inference will be launched without using CUDA graph launch.**
[11/12/2021-21:46:58] [I] Warmup completed 1728 queries over 200 ms
[11/12/2021-21:46:58] [I] Timing trace has 24820 queries over 3.00012 s

[11/12/2021-21:42:44] [V] [TRT] Engine Layer Information:
Layer(Constant): encoder.embed_tokens.weight, Tactic: 0,  -> (Unnamed Layer* 0) [Constant]_output[Float(30522,768)]
Layer(Gather): Gather_0, Tactic: 9, (Unnamed Layer* 0) [Constant]_output[Float(30522,768)], src_tokens[Int32(1,16)] -> 253[Float(1,16,768)]
Layer(Myelin): {ForeignNode[254 + (Unnamed Layer* 3) [Shuffle]...Cast_9]}, Tactic: 0, src_tokens[Int32(1,16)] -> 262[Int32(1,16)]
Layer(Constant): encoder.embed_positions.weight, Tactic: 0,  -> (Unnamed Layer* 27) [Constant]_output[Float(167,768)]
Layer(Gather): Gather_10, Tactic: 9, (Unnamed Layer* 27) [Constant]_output[Float(167,768)], 262[Int32(1,16)] -> 263[Float(1,16,768)]
Layer(ElementWise): Add_11, Tactic: 1, 253[Float(1,16,768)], 263[Float(1,16,768)] -> 264[Float(1,16,768)]
Layer(Constant): 265 + (Unnamed Layer* 31) [Shuffle], Tactic: 0,  -> (Unnamed Layer* 31) [Shuffle]_output[Float(1,1,1)]
Layer(Constant): 267 + (Unnamed Layer* 34) [Shuffle], Tactic: 0,  -> (Unnamed Layer* 34) [Shuffle]_output[Float(1,1,1)]
Layer(ElementWise): Add_15, Tactic: 1, 264[Float(1,16,768)], (Unnamed Layer* 34) [Shuffle]_output[Float(1,1,1)] -> topk_index[Float(1,16,768)]
Layer(ElementWise): Add_13, Tactic: 1, 264[Float(1,16,768)], (Unnamed Layer* 31) [Shuffle]_output[Float(1,1,1)] -> topk_probs[Float(1,16,768)]

Environment

TensorRT Version: v8003
NVIDIA GPU: v100
NVIDIA Driver Version:
CUDA Version: 11.4
CUDNN Version: 8.2.4
Operating System: ubuntu 20.04.3 LTS
Python Version (if applicable): 3.8
Tensorflow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if so, version): nvcr.io/nvidia/tensorrt:21.10-py3

Relevant Files

Steps To Reproduce

Reproduce the issue: trtexec --loadEngine=model_optimized_tensorrt.trt --verbose --useCudaGraph
Generate the engine file from the onnx model file: trtexec --onnx=model_optimized_tensorrt_debug.onnx --verbose --useCudaGraph --saveEngine=model_optimized_tensorrt.trt --refit

The text was updated successfully, but these errors were encountered:

zerollzeng · 2021-12-09T02:18:51Z

It's expected, myelin might do synchronization on the inference stream. CUDA graph capturing also will fail if the network contains loops, conditional layers (if-else), etc that need to do synchronization during inference.

feihugis · 2021-12-09T18:33:30Z

@zerollzeng Thanks for your reply! The model itself should be compatible with CUDA Graph because it can work with the CUDA Graph in pytorch and onnxruntime. So I guess it is caused by myelin. Do you think this issue can be resolved by doing some changes inside myelin? Is myelin open-sourced? If yes, I would like to do more debugging there.

zerollzeng · 2021-12-10T05:48:55Z

No, myelin is not open-sourced, I think how tensorrt handle cuda graph is a bit different compares to pytorch/onnxruntime, pytorch will execute the model dynamically, while tensorrt execute an engine statically. but I'm not quite sure so.

zerollzeng · 2021-12-10T05:50:27Z

If your model doesn't have a lot of light-weight kernel(kernal launch time is big compare to execution time), you won't get much speed up with cuda graph enabled.

feihugis · 2021-12-10T06:27:52Z

For the model I tested, cuda graph reduced the latency around 50% for both pytorch and onnxruntime. It was also observed that tensorrt can significantly accelerate this model (around 40%) without cuda graph, so we wanted to try tensorrt + cuda graph, which may get better performance than pytorch/onnxruntime + cuda graph.

ttyio · 2021-12-13T09:58:24Z

@feihugis , the failure here is a V100 specific problem, do you have Turing or Ampere device to run? we should support run cuda graph for your model. Thanks!

feihugis · 2021-12-13T18:50:23Z

Thanks @ttyio! I will find a Turing/Ampere device to test it and keep you updated.

ttyio · 2022-03-16T01:28:41Z

I will close this and please reopen if you still have question, thanks!

Coastchb · 2024-11-30T13:21:57Z

@feihugis
Hi, I alse get the "[11/12/2021-21:46:54] [E] Error[1]: [runtimeUtils.cpp::gateMyelinGraphStartOnStream::154] Error Code 1: Cuda Runtime (operation not permitted when stream is capturing)" Error.
How do you solve the problem?

feihugis · 2024-12-02T05:35:13Z

@feihugis Hi, I alse get the "[11/12/2021-21:46:54] [E] Error[1]: [runtimeUtils.cpp::gateMyelinGraphStartOnStream::154] Error Code 1: Cuda Runtime (operation not permitted when stream is capturing)" Error. How do you solve the problem?

@Coastchb Sorry that I totally forgot if I solved it or not. My suggestion is to try it on Turing/Ampere device.

ttyio added Release: 8.x triaged Issue has been triaged by maintainers labels Dec 13, 2021

ttyio closed this as completed Mar 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Myelin could not work with Cuda Graph #1614

Myelin could not work with Cuda Graph #1614

feihugis commented Nov 12, 2021

zerollzeng commented Dec 9, 2021 •

edited

Loading

feihugis commented Dec 9, 2021

zerollzeng commented Dec 10, 2021

zerollzeng commented Dec 10, 2021 •

edited

Loading

feihugis commented Dec 10, 2021

ttyio commented Dec 13, 2021

feihugis commented Dec 13, 2021

ttyio commented Mar 16, 2022

Coastchb commented Nov 30, 2024

feihugis commented Dec 2, 2024 •

edited

Loading

Myelin could not work with Cuda Graph #1614

Myelin could not work with Cuda Graph #1614

Comments

feihugis commented Nov 12, 2021

Description

Environment

Relevant Files

Steps To Reproduce

zerollzeng commented Dec 9, 2021 • edited Loading

feihugis commented Dec 9, 2021

zerollzeng commented Dec 10, 2021

zerollzeng commented Dec 10, 2021 • edited Loading

feihugis commented Dec 10, 2021

ttyio commented Dec 13, 2021

feihugis commented Dec 13, 2021

ttyio commented Mar 16, 2022

Coastchb commented Nov 30, 2024

feihugis commented Dec 2, 2024 • edited Loading

zerollzeng commented Dec 9, 2021 •

edited

Loading

zerollzeng commented Dec 10, 2021 •

edited

Loading

feihugis commented Dec 2, 2024 •

edited

Loading