Enable cuda graph in TensorRT EP #10423

stevenlix · 2022-01-28T03:50:04Z

Cuda graph can improve TRT inference latency by reducing kernel launch time. Particularly the models with small batch size and many TRT kernels could benefit from it.
This PR adds cuda graph support in TRT EP.

hariharans29 · 2022-01-28T04:10:57Z

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc

+            if (!trt_context->enqueueV2(&buffers[0], stream, nullptr)) {
+              return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "TensorRT EP execution context enqueue failed.");
+            }   
+            cudaStreamBeginCapture(stream, cudaStreamCaptureModeRelaxed );


Should the CUDA calls have CUDA_CALL_THROW() to deal with CUDA call errors if any ?

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc

hariharans29 · 2022-01-28T04:14:52Z

include/onnxruntime/core/session/onnxruntime_c_api.h

@@ -482,6 +482,7 @@ typedef struct OrtTensorRTProviderOptions {
  int trt_engine_decryption_enable;             // enable engine decryption. Default 0 = false, nonzero = true
  const char* trt_engine_decryption_lib_path;   // specify engine decryption library path
  int trt_force_sequential_engine_build;        // force building TensorRT engine sequentially. Default 0 = false, nonzero = true
+  int trt_cuda_graph_enable;                    // enable cuda graph. Default 0 = false, nonzero = true


Shouldn't the struct get versioned with the addition of a new option ?

I thought we agreed we shouldn't be updating provider structs in c api anymore (for the very reason that Hari brings up about versioning)
and instead only updating the opaque struct OrtTensorRTProviderOptionsV2
+@chilo-ms FYI

Yeah, I think it should become OrtTensorRTProviderOptionsV3 if V2 has shipped with the previous ORT release (This was my understanding)

Yeah, I think it should become OrtTensorRTProviderOptionsV3 if V2 has shipped with the previous ORT release (This was my understanding)

we don't need to update the version of the opaque struct when adding fields right? since it's only accessed via api and not directly. if the newly added field can't be represented as a string, then we would need to add another api to access those.

IIRC from the time we were discussing it, we need to version it for any mutation (addition or removal of fields). If we add support for a new field without versioning it, doesn't the UpdateTensorRTProviderOptions API behave differently in ORT 1.10 (where the V2 struct won't support the new field) and in ORT 1.11 (where the V2 struct will support the new field) ?

We had a consensus on using key-value strings for provider options that can be represented by string but we didn't explicitly say we support versioning. But we require error reporting for undocumented config keys.

@stevenlix, here is my "enable timing cache" PR, you can reference it to add new field to opaque struct OrtTensorRTProviderOptionsV2

No versioning of struct is required beyond V2. The availability or the unavailability of APIs to manipulate the V2 struct provides the versioning.

"The availability or the unavailability of APIs to manipulate the V2 struct provides the versioning." - What APIs exists to manipulate the V2 struct will be the same in ORT 1.10 and ORT 1.11 won't they ? It is just that the UpdateTensorRTProviderOptions() API will additionally support one more key (enable_cuda_graph) in 1.11 (which obviously won't be supported in the released 1.10).

yes, one uses CreateTensorRTProviderOptions(), UpdateTensorRTProviderOptions() to create and update the struct.
The second api deals with strings, and will recognize a new string key "enable_cuda_graph" in ort 1.11
api signatures don't change.

talked to @chilo-ms offline. He is going to make a separate PR from the timing cache PR for OrtTensorRTProviderOptionsV2. After the PR merged, I will add cuda_graph option in OrtTensorRTProviderOptionsV2.

hariharans29 · 2022-01-28T04:21:19Z

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc

+            cudaGraph_t graph;
+            *cuda_graph = &(trt_state->cuda_graph_instance);
+            //warm up for cuda graph capturing
+            if (!trt_context->enqueueV2(&buffers[0], stream, nullptr)) {


Does enqueueV2() synchronize with the GPU before returning ? If not, we may have to wait for the warm-up tasks queued on the stream to finish before the stream capture...

why is this warm up even needed?

seems like it's for handling a known issue with dynamic shapes? (please add a comment)

warm-up is needed to do initialization (flushing any old context) before graph capturing according to Nvidia. CUDA graph still has issue in some dynamic shape cases.

in the note section of the api doc for enqueuev2 https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_execution_context.html#a2f4429652736e8ef6e19f433400108c7
seems to allude to dynamic shapes can work if you call enqueuev2() once before graph capture? is it similar to what you are doing?

If shape changed, cuda graph needs to be recaptured, which is not desired because the capturing happens in inference.

hariharans29 · 2022-01-28T04:56:04Z

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc

+  int fused_nodes_size = fused_nodes.size();
+  cuda_graphs_.reserve(fused_nodes_size);
+  for (int node_idx = 0; node_idx < fused_nodes_size; node_idx++) {
+    const auto* fused_node = fused_nodes[node_idx];


In general, how does the TRT EP handle control flow nodes ? I fear we must explicitly not support using cuda graphs for models with control flow nodes as the graph captured for one input may not the same graph required for another input (because of the dynamic graph branching).

agreed. we should explicitly exclude graphs with loops/conditionals.

We may highlight the constrains for dynamic shape cases in document, so that users can choose to enable cuda graph or not.

it would be nice to enforce constraints (doesn't support dynamic shapes and dynamic graphs) in code rather than punting to user/documentation. Let's see if there's a reasonable balance that can be achieved here.

Here is a little bit tricky. Dynamic shape model is okay, but when input shapes of incoming data change, cuda graph needs to be recaptured, so the check has to be done in runtime. There is an API to update executable graph, but I haven't seen any APIs that can check existing cuda graph's profile, and we can't afford to update graph for every enqueue.

jywu-msft · 2022-01-28T05:31:57Z

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc

+            }	    
+            cudaStreamEndCapture(stream, &graph);
+            cudaGraphInstantiate(*cuda_graph, graph, NULL, NULL, 0);
+          }


what happens to graph? do you need to destroy it? else we leak memory?

jywu-msft · 2022-01-28T05:33:25Z

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.h

@@ -96,6 +97,9 @@ struct TensorrtFuncState {
  bool int8_calibration_cache_available;
  bool dla_enable;
  int dla_core;
+  bool cuda_graph_enable;
+  cudaGraphExec_t* cuda_graph = nullptr;


why is there both cuda_graph and cuda_graph_instance?
cudaGraphExec_t is already a pointer. why do we need the pointer to pointer?
i also find the naming confusing, since we use cuda_graph variable to refer to an executable graph in some places and graphs in other places.

do the cudaGraphExec_t's need to be destroyed? do we need to use unique_ptrs here?

cuda_graph (maybe cuda_graph_ptr is a more propriate name) is used to indicate if graph has been captured for the subgraph. If cuda graph has been there, graph capturing will be skipped in inference. So we only capture the graph once.

yes but there is also some naming confusion between a captured graph and an executable graph (that one instantiates from a captured graph) that can be launched.
i would like to have the naming carry that distinction. so executables maybe should either have exec or instance in the name.

jywu-msft · 2022-01-28T05:44:37Z

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc

+      // Run TRT inference			  
+      if (trt_state->cuda_graph_enable)
+      {
+          if (*cuda_graph == nullptr) {


i think this entire path isn't thread safe as cuda graphs and associated api's don't seem to be thread safe.

Yes, if cuda graphs are enabled, then Run() will no longer be thread-safe and calls to Run() needs to be serialized either by the caller or ORT itself should perform the graph replay within a critical section.

we have lock in the inference and the compute() is serialized already.

ah yes tensorrt_mu_ already protects the compute_func body.

jywu-msft · 2022-01-28T06:03:34Z

add tests that exercise these new code paths?

jywu-msft · 2022-01-28T20:11:44Z

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc

+            CUDA_CALL_THROW(cudaStreamEndCapture(stream, &graph));
+            CUDA_CALL_THROW(cudaStreamSynchronize(stream));
+            CUDA_CALL_THROW(cudaGraphInstantiate(*cuda_graph_ptr, graph, NULL, NULL, 0));
+            CUDA_CALL_THROW(cudaGraphDestroy(graph));


safer to use unique_ptr right?
otherwise in error/exception paths you may not hit this destroy.

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc

hariharans29 · 2022-01-31T05:49:03Z

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc

+                                     "TensorRT engine has operations that are not allowed in CUDA graph capture mode. ",
+                                     "Please disable trt_cuda_graph_enable.");
+            }
+            CUDA_CALL_THROW(cudaStreamBeginCapture(stream, cudaStreamCaptureModeRelaxed));


If certain operators can't be supported by the TensorRT and CUDA EPs, does the CPU EP come into play ? If so, the same comment as this- #9978 (comment)

The cuda graph capturing in this PR is only for TensorRT subgraphs. Unsupported ops can still fall back to CUDA/CPU EPs.

Understood now, thanks.

One question though: Let us say the model looks like this: TensorRT subgraph 1 -> CPU op -> TensorRT subgraph 2 and you are capturing the graphs for both the TRT subgraphs, will the necessary synchronization logic before the CPU op happen even with the cuda graph setup ?

I think the case is the same with and without cuda graph, isn't it? CPU needs to wait until TRT subgraph1 produces its result. That's interesting if there is a better cuda graph setup that can avoid the wait.

jywu-msft · 2022-02-01T23:04:51Z

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc

+  size_t fused_nodes_size = fused_nodes.size();
+  std::vector<std::unique_ptr<cudaGraphExec_t>> executable_cuda_graphs(fused_nodes_size);
+  for (size_t node_idx = 0; node_idx < fused_nodes_size; node_idx++) {
+    executable_cuda_graphs[node_idx] = std::make_unique<cudaGraphExec_t>();


since cudaGraph_t and cudaGraphExec_t are already pointers, i think we can't use make_unique/unique pointers like this ?
need to see if there's a better way to manage memory of the cudaGraphExec and cudaGraph

stevenlix and others added 5 commits January 20, 2022 23:13

add cuda graph support

04c9fd7

add provider option for cuda graph

634c2e3

fix issues

2489976

clean up

1e71901

clean up

a330b5a

stevenlix requested review from jywu-msft and chilo-ms January 28, 2022 03:50

hariharans29 reviewed Jan 28, 2022

View reviewed changes

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc Outdated Show resolved Hide resolved

hariharans29 reviewed Jan 28, 2022

View reviewed changes

jywu-msft reviewed Jan 28, 2022

View reviewed changes

fix issues

3e2e847

jywu-msft reviewed Jan 28, 2022

View reviewed changes

hariharans29 reviewed Jan 29, 2022

View reviewed changes

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc Show resolved Hide resolved

fixed pointer issue

6427a24

hariharans29 reviewed Jan 31, 2022

View reviewed changes

hariharans29 mentioned this pull request Jan 31, 2022

Support CUDA Graph #9978

Merged

jywu-msft reviewed Feb 1, 2022

View reviewed changes

stevenlix added 2 commits February 24, 2022 00:55

merge master

f17e84c

fix issues

d09a39d

Enable cuda graph in TensorRT EP #10423

Are you sure you want to change the base?

Enable cuda graph in TensorRT EP #10423

Conversation

stevenlix commented Jan 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chilo-ms Jan 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenlix Jan 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenlix Feb 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jywu-msft Jan 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hariharans29 Jan 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jywu-msft commented Jan 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chilo-ms Jan 28, 2022 •

edited

Loading

stevenlix Jan 28, 2022 •

edited

Loading

stevenlix Feb 2, 2022 •

edited

Loading

jywu-msft Jan 28, 2022 •

edited

Loading

hariharans29 Jan 28, 2022 •

edited

Loading