You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently exploring TensorRT for inference tasks and aiming to optimize performance using CUDA graph. One of the requirements for my application is to support dynamic batch sizes during inference. While TensorRT provides dynamic shape support, I couldn't find sufficient information on how to incorporate this feature into CUDA graph inference.
I would appreciate any guidance, documentation, or examples demonstrating how to implement dynamic batch size support in CUDA graph inference with TensorRT.
Thank you for your assistance!
The text was updated successfully, but these errors were encountered:
Therefore, the best practice is to use one execution context per captured graph, and to share memory across the contexts with createExecutionContextWithoutDeviceMemory(). will it help?
share memory across the contexts need mutex /lock ? @zerollzeng
Yes, User need make sure there is concurrent execution when two context share the same memory. Because some reduction kernel might have race condition when they write to the same memory, the behavior is undefined.
I'm currently exploring TensorRT for inference tasks and aiming to optimize performance using CUDA graph. One of the requirements for my application is to support dynamic batch sizes during inference. While TensorRT provides dynamic shape support, I couldn't find sufficient information on how to incorporate this feature into CUDA graph inference.
I would appreciate any guidance, documentation, or examples demonstrating how to implement dynamic batch size support in CUDA graph inference with TensorRT.
Thank you for your assistance!
The text was updated successfully, but these errors were encountered: