Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable cuda graph in TensorRT EP #10423

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Enable cuda graph in TensorRT EP #10423

wants to merge 9 commits into from

Conversation

stevenlix
Copy link
Contributor

Cuda graph can improve TRT inference latency by reducing kernel launch time. Particularly the models with small batch size and many TRT kernels could benefit from it.
This PR adds cuda graph support in TRT EP.

if (!trt_context->enqueueV2(&buffers[0], stream, nullptr)) {
return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "TensorRT EP execution context enqueue failed.");
}
cudaStreamBeginCapture(stream, cudaStreamCaptureModeRelaxed );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the CUDA calls have CUDA_CALL_THROW() to deal with CUDA call errors if any ?

@@ -482,6 +482,7 @@ typedef struct OrtTensorRTProviderOptions {
int trt_engine_decryption_enable; // enable engine decryption. Default 0 = false, nonzero = true
const char* trt_engine_decryption_lib_path; // specify engine decryption library path
int trt_force_sequential_engine_build; // force building TensorRT engine sequentially. Default 0 = false, nonzero = true
int trt_cuda_graph_enable; // enable cuda graph. Default 0 = false, nonzero = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the struct get versioned with the addition of a new option ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we agreed we shouldn't be updating provider structs in c api anymore (for the very reason that Hari brings up about versioning)
and instead only updating the opaque struct OrtTensorRTProviderOptionsV2
+@chilo-ms FYI

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it should become OrtTensorRTProviderOptionsV3 if V2 has shipped with the previous ORT release (This was my understanding)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it should become OrtTensorRTProviderOptionsV3 if V2 has shipped with the previous ORT release (This was my understanding)

we don't need to update the version of the opaque struct when adding fields right? since it's only accessed via api and not directly. if the newly added field can't be represented as a string, then we would need to add another api to access those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC from the time we were discussing it, we need to version it for any mutation (addition or removal of fields). If we add support for a new field without versioning it, doesn't the UpdateTensorRTProviderOptions API behave differently in ORT 1.10 (where the V2 struct won't support the new field) and in ORT 1.11 (where the V2 struct will support the new field) ?

Copy link
Contributor

@chilo-ms chilo-ms Jan 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a consensus on using key-value strings for provider options that can be represented by string but we didn't explicitly say we support versioning. But we require error reporting for undocumented config keys.

@stevenlix, here is my "enable timing cache" PR, you can reference it to add new field to opaque struct OrtTensorRTProviderOptionsV2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No versioning of struct is required beyond V2. The availability or the unavailability of APIs to manipulate the V2 struct provides the versioning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The availability or the unavailability of APIs to manipulate the V2 struct provides the versioning." - What APIs exists to manipulate the V2 struct will be the same in ORT 1.10 and ORT 1.11 won't they ? It is just that the UpdateTensorRTProviderOptions() API will additionally support one more key (enable_cuda_graph) in 1.11 (which obviously won't be supported in the released 1.10).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, one uses CreateTensorRTProviderOptions(), UpdateTensorRTProviderOptions() to create and update the struct.
The second api deals with strings, and will recognize a new string key "enable_cuda_graph" in ort 1.11
api signatures don't change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked to @chilo-ms offline. He is going to make a separate PR from the timing cache PR for OrtTensorRTProviderOptionsV2. After the PR merged, I will add cuda_graph option in OrtTensorRTProviderOptionsV2.

cudaGraph_t graph;
*cuda_graph = &(trt_state->cuda_graph_instance);
//warm up for cuda graph capturing
if (!trt_context->enqueueV2(&buffers[0], stream, nullptr)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does enqueueV2() synchronize with the GPU before returning ? If not, we may have to wait for the warm-up tasks queued on the stream to finish before the stream capture...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this warm up even needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like it's for handling a known issue with dynamic shapes? (please add a comment)

Copy link
Contributor Author

@stevenlix stevenlix Jan 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warm-up is needed to do initialization (flushing any old context) before graph capturing according to Nvidia. CUDA graph still has issue in some dynamic shape cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the note section of the api doc for enqueuev2 https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_execution_context.html#a2f4429652736e8ef6e19f433400108c7
seems to allude to dynamic shapes can work if you call enqueuev2() once before graph capture? is it similar to what you are doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If shape changed, cuda graph needs to be recaptured, which is not desired because the capturing happens in inference.

int fused_nodes_size = fused_nodes.size();
cuda_graphs_.reserve(fused_nodes_size);
for (int node_idx = 0; node_idx < fused_nodes_size; node_idx++) {
const auto* fused_node = fused_nodes[node_idx];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, how does the TRT EP handle control flow nodes ? I fear we must explicitly not support using cuda graphs for models with control flow nodes as the graph captured for one input may not the same graph required for another input (because of the dynamic graph branching).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. we should explicitly exclude graphs with loops/conditionals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may highlight the constrains for dynamic shape cases in document, so that users can choose to enable cuda graph or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to enforce constraints (doesn't support dynamic shapes and dynamic graphs) in code rather than punting to user/documentation. Let's see if there's a reasonable balance that can be achieved here.

Copy link
Contributor Author

@stevenlix stevenlix Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a little bit tricky. Dynamic shape model is okay, but when input shapes of incoming data change, cuda graph needs to be recaptured, so the check has to be done in runtime. There is an API to update executable graph, but I haven't seen any APIs that can check existing cuda graph's profile, and we can't afford to update graph for every enqueue.

}
cudaStreamEndCapture(stream, &graph);
cudaGraphInstantiate(*cuda_graph, graph, NULL, NULL, 0);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens to graph? do you need to destroy it? else we leak memory?

@@ -96,6 +97,9 @@ struct TensorrtFuncState {
bool int8_calibration_cache_available;
bool dla_enable;
int dla_core;
bool cuda_graph_enable;
cudaGraphExec_t* cuda_graph = nullptr;
Copy link
Member

@jywu-msft jywu-msft Jan 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is there both cuda_graph and cuda_graph_instance?
cudaGraphExec_t is already a pointer. why do we need the pointer to pointer?
i also find the naming confusing, since we use cuda_graph variable to refer to an executable graph in some places and graphs in other places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the cudaGraphExec_t's need to be destroyed? do we need to use unique_ptrs here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuda_graph (maybe cuda_graph_ptr is a more propriate name) is used to indicate if graph has been captured for the subgraph. If cuda graph has been there, graph capturing will be skipped in inference. So we only capture the graph once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but there is also some naming confusion between a captured graph and an executable graph (that one instantiates from a captured graph) that can be launched.
i would like to have the naming carry that distinction. so executables maybe should either have exec or instance in the name.

// Run TRT inference
if (trt_state->cuda_graph_enable)
{
if (*cuda_graph == nullptr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this entire path isn't thread safe as cuda graphs and associated api's don't seem to be thread safe.

Copy link
Member

@hariharans29 hariharans29 Jan 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if cuda graphs are enabled, then Run() will no longer be thread-safe and calls to Run() needs to be serialized either by the caller or ORT itself should perform the graph replay within a critical section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have lock in the inference and the compute() is serialized already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes tensorrt_mu_ already protects the compute_func body.

@jywu-msft
Copy link
Member

add tests that exercise these new code paths?

CUDA_CALL_THROW(cudaStreamEndCapture(stream, &graph));
CUDA_CALL_THROW(cudaStreamSynchronize(stream));
CUDA_CALL_THROW(cudaGraphInstantiate(*cuda_graph_ptr, graph, NULL, NULL, 0));
CUDA_CALL_THROW(cudaGraphDestroy(graph));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safer to use unique_ptr right?
otherwise in error/exception paths you may not hit this destroy.

"TensorRT engine has operations that are not allowed in CUDA graph capture mode. ",
"Please disable trt_cuda_graph_enable.");
}
CUDA_CALL_THROW(cudaStreamBeginCapture(stream, cudaStreamCaptureModeRelaxed));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If certain operators can't be supported by the TensorRT and CUDA EPs, does the CPU EP come into play ? If so, the same comment as this- #9978 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cuda graph capturing in this PR is only for TensorRT subgraphs. Unsupported ops can still fall back to CUDA/CPU EPs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood now, thanks.

One question though: Let us say the model looks like this: TensorRT subgraph 1 -> CPU op -> TensorRT subgraph 2 and you are capturing the graphs for both the TRT subgraphs, will the necessary synchronization logic before the CPU op happen even with the cuda graph setup ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the case is the same with and without cuda graph, isn't it? CPU needs to wait until TRT subgraph1 produces its result. That's interesting if there is a better cuda graph setup that can avoid the wait.

@hariharans29 hariharans29 mentioned this pull request Jan 31, 2022
size_t fused_nodes_size = fused_nodes.size();
std::vector<std::unique_ptr<cudaGraphExec_t>> executable_cuda_graphs(fused_nodes_size);
for (size_t node_idx = 0; node_idx < fused_nodes_size; node_idx++) {
executable_cuda_graphs[node_idx] = std::make_unique<cudaGraphExec_t>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since cudaGraph_t and cudaGraphExec_t are already pointers, i think we can't use make_unique/unique pointers like this ?
need to see if there's a better way to manage memory of the cudaGraphExec and cudaGraph

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants