-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable cuda graph in TensorRT EP #10423
base: main
Are you sure you want to change the base?
Conversation
if (!trt_context->enqueueV2(&buffers[0], stream, nullptr)) { | ||
return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "TensorRT EP execution context enqueue failed."); | ||
} | ||
cudaStreamBeginCapture(stream, cudaStreamCaptureModeRelaxed ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the CUDA calls have CUDA_CALL_THROW() to deal with CUDA call errors if any ?
onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc
Outdated
Show resolved
Hide resolved
@@ -482,6 +482,7 @@ typedef struct OrtTensorRTProviderOptions { | |||
int trt_engine_decryption_enable; // enable engine decryption. Default 0 = false, nonzero = true | |||
const char* trt_engine_decryption_lib_path; // specify engine decryption library path | |||
int trt_force_sequential_engine_build; // force building TensorRT engine sequentially. Default 0 = false, nonzero = true | |||
int trt_cuda_graph_enable; // enable cuda graph. Default 0 = false, nonzero = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the struct get versioned with the addition of a new option ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we agreed we shouldn't be updating provider structs in c api anymore (for the very reason that Hari brings up about versioning)
and instead only updating the opaque struct OrtTensorRTProviderOptionsV2
+@chilo-ms FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it should become OrtTensorRTProviderOptionsV3 if V2 has shipped with the previous ORT release (This was my understanding)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it should become OrtTensorRTProviderOptionsV3 if V2 has shipped with the previous ORT release (This was my understanding)
we don't need to update the version of the opaque struct when adding fields right? since it's only accessed via api and not directly. if the newly added field can't be represented as a string, then we would need to add another api to access those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC from the time we were discussing it, we need to version it for any mutation (addition or removal of fields). If we add support for a new field without versioning it, doesn't the UpdateTensorRTProviderOptions API behave differently in ORT 1.10 (where the V2 struct won't support the new field) and in ORT 1.11 (where the V2 struct will support the new field) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a consensus on using key-value strings for provider options that can be represented by string but we didn't explicitly say we support versioning. But we require error reporting for undocumented config keys.
@stevenlix, here is my "enable timing cache" PR, you can reference it to add new field to opaque struct OrtTensorRTProviderOptionsV2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No versioning of struct is required beyond V2. The availability or the unavailability of APIs to manipulate the V2 struct provides the versioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The availability or the unavailability of APIs to manipulate the V2 struct provides the versioning." - What APIs exists to manipulate the V2 struct will be the same in ORT 1.10 and ORT 1.11 won't they ? It is just that the UpdateTensorRTProviderOptions()
API will additionally support one more key (enable_cuda_graph) in 1.11 (which obviously won't be supported in the released 1.10).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, one uses CreateTensorRTProviderOptions(), UpdateTensorRTProviderOptions() to create and update the struct.
The second api deals with strings, and will recognize a new string key "enable_cuda_graph" in ort 1.11
api signatures don't change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
talked to @chilo-ms offline. He is going to make a separate PR from the timing cache PR for OrtTensorRTProviderOptionsV2. After the PR merged, I will add cuda_graph option in OrtTensorRTProviderOptionsV2.
cudaGraph_t graph; | ||
*cuda_graph = &(trt_state->cuda_graph_instance); | ||
//warm up for cuda graph capturing | ||
if (!trt_context->enqueueV2(&buffers[0], stream, nullptr)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does enqueueV2() synchronize with the GPU before returning ? If not, we may have to wait for the warm-up tasks queued on the stream to finish before the stream capture...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this warm up even needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like it's for handling a known issue with dynamic shapes? (please add a comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warm-up is needed to do initialization (flushing any old context) before graph capturing according to Nvidia. CUDA graph still has issue in some dynamic shape cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the note section of the api doc for enqueuev2 https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_execution_context.html#a2f4429652736e8ef6e19f433400108c7
seems to allude to dynamic shapes can work if you call enqueuev2() once before graph capture? is it similar to what you are doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If shape changed, cuda graph needs to be recaptured, which is not desired because the capturing happens in inference.
int fused_nodes_size = fused_nodes.size(); | ||
cuda_graphs_.reserve(fused_nodes_size); | ||
for (int node_idx = 0; node_idx < fused_nodes_size; node_idx++) { | ||
const auto* fused_node = fused_nodes[node_idx]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, how does the TRT EP handle control flow nodes ? I fear we must explicitly not support using cuda graphs for models with control flow nodes as the graph captured for one input may not the same graph required for another input (because of the dynamic graph branching).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. we should explicitly exclude graphs with loops/conditionals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may highlight the constrains for dynamic shape cases in document, so that users can choose to enable cuda graph or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to enforce constraints (doesn't support dynamic shapes and dynamic graphs) in code rather than punting to user/documentation. Let's see if there's a reasonable balance that can be achieved here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a little bit tricky. Dynamic shape model is okay, but when input shapes of incoming data change, cuda graph needs to be recaptured, so the check has to be done in runtime. There is an API to update executable graph, but I haven't seen any APIs that can check existing cuda graph's profile, and we can't afford to update graph for every enqueue.
} | ||
cudaStreamEndCapture(stream, &graph); | ||
cudaGraphInstantiate(*cuda_graph, graph, NULL, NULL, 0); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens to graph? do you need to destroy it? else we leak memory?
@@ -96,6 +97,9 @@ struct TensorrtFuncState { | |||
bool int8_calibration_cache_available; | |||
bool dla_enable; | |||
int dla_core; | |||
bool cuda_graph_enable; | |||
cudaGraphExec_t* cuda_graph = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is there both cuda_graph and cuda_graph_instance?
cudaGraphExec_t is already a pointer. why do we need the pointer to pointer?
i also find the naming confusing, since we use cuda_graph variable to refer to an executable graph in some places and graphs in other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the cudaGraphExec_t's need to be destroyed? do we need to use unique_ptrs here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuda_graph (maybe cuda_graph_ptr is a more propriate name) is used to indicate if graph has been captured for the subgraph. If cuda graph has been there, graph capturing will be skipped in inference. So we only capture the graph once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but there is also some naming confusion between a captured graph and an executable graph (that one instantiates from a captured graph) that can be launched.
i would like to have the naming carry that distinction. so executables maybe should either have exec or instance in the name.
// Run TRT inference | ||
if (trt_state->cuda_graph_enable) | ||
{ | ||
if (*cuda_graph == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this entire path isn't thread safe as cuda graphs and associated api's don't seem to be thread safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if cuda graphs are enabled, then Run() will no longer be thread-safe and calls to Run() needs to be serialized either by the caller or ORT itself should perform the graph replay within a critical section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have lock in the inference and the compute() is serialized already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes tensorrt_mu_ already protects the compute_func body.
add tests that exercise these new code paths? |
CUDA_CALL_THROW(cudaStreamEndCapture(stream, &graph)); | ||
CUDA_CALL_THROW(cudaStreamSynchronize(stream)); | ||
CUDA_CALL_THROW(cudaGraphInstantiate(*cuda_graph_ptr, graph, NULL, NULL, 0)); | ||
CUDA_CALL_THROW(cudaGraphDestroy(graph)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
safer to use unique_ptr right?
otherwise in error/exception paths you may not hit this destroy.
"TensorRT engine has operations that are not allowed in CUDA graph capture mode. ", | ||
"Please disable trt_cuda_graph_enable."); | ||
} | ||
CUDA_CALL_THROW(cudaStreamBeginCapture(stream, cudaStreamCaptureModeRelaxed)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If certain operators can't be supported by the TensorRT and CUDA EPs, does the CPU EP come into play ? If so, the same comment as this- #9978 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cuda graph capturing in this PR is only for TensorRT subgraphs. Unsupported ops can still fall back to CUDA/CPU EPs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood now, thanks.
One question though: Let us say the model looks like this: TensorRT subgraph 1 -> CPU op -> TensorRT subgraph 2 and you are capturing the graphs for both the TRT subgraphs, will the necessary synchronization logic before the CPU op happen even with the cuda graph setup ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the case is the same with and without cuda graph, isn't it? CPU needs to wait until TRT subgraph1 produces its result. That's interesting if there is a better cuda graph setup that can avoid the wait.
size_t fused_nodes_size = fused_nodes.size(); | ||
std::vector<std::unique_ptr<cudaGraphExec_t>> executable_cuda_graphs(fused_nodes_size); | ||
for (size_t node_idx = 0; node_idx < fused_nodes_size; node_idx++) { | ||
executable_cuda_graphs[node_idx] = std::make_unique<cudaGraphExec_t>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since cudaGraph_t and cudaGraphExec_t are already pointers, i think we can't use make_unique/unique pointers like this ?
need to see if there's a better way to manage memory of the cudaGraphExec and cudaGraph
Cuda graph can improve TRT inference latency by reducing kernel launch time. Particularly the models with small batch size and many TRT kernels could benefit from it.
This PR adds cuda graph support in TRT EP.