Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Dynamic Batch Size in CUDA Graph Inference with TensorRT #3798

Closed
OuyangChao opened this issue Apr 15, 2024 · 5 comments
Closed
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@OuyangChao
Copy link

I'm currently exploring TensorRT for inference tasks and aiming to optimize performance using CUDA graph. One of the requirements for my application is to support dynamic batch sizes during inference. While TensorRT provides dynamic shape support, I couldn't find sufficient information on how to incorporate this feature into CUDA graph inference.

I would appreciate any guidance, documentation, or examples demonstrating how to implement dynamic batch size support in CUDA graph inference with TensorRT.

Thank you for your assistance!

@zerollzeng
Copy link
Collaborator

zerollzeng commented Apr 18, 2024

See https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#cuda-graphs

Basically, when changing the input shapes, you have to re-capture the graph, because some internal state is changed.

@zerollzeng zerollzeng self-assigned this Apr 18, 2024
@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label Apr 18, 2024
@zerollzeng
Copy link
Collaborator

Therefore, the best practice is to use one execution context per captured graph, and to share memory across the contexts with createExecutionContextWithoutDeviceMemory(). will it help?

@lix19937
Copy link

share memory across the contexts need mutex /lock ? @zerollzeng

@ttyio
Copy link
Collaborator

ttyio commented Jul 2, 2024

share memory across the contexts need mutex /lock ? @zerollzeng

Yes, User need make sure there is concurrent execution when two context share the same memory. Because some reduction kernel might have race condition when they write to the same memory, the behavior is undefined.

@ttyio
Copy link
Collaborator

ttyio commented Jul 2, 2024

closing since this should already solved, thanks all!

@ttyio ttyio closed this as completed Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants