-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the test checking for cooperative kernels in conditional nodes. #9869
Update the test checking for cooperative kernels in conditional nodes. #9869
Conversation
Now we conditionally xfail only when a cuda driver version less than 12.6 is installed. CUDA 12.6 fixes this issue. Before it, cooperative kernels could not be used within the body of a conditional node. We also provide a better error message for users to know that the fix is to upgrade to CUDA 12.6. Signed-off-by: Daniel Galvez <[email protected]>
87ee9ed
to
ce7477f
Compare
Signed-off-by: galv <[email protected]>
except RuntimeError as err: | ||
if "CUDA error: invalid argument" in str(err): | ||
raise RuntimeError( | ||
"CUDA Graph capture failed. It is likely that you are calling a cooperative kernel in your RNN-T or TDT prediction network. Cooperative kernels are not allowed inside the bodies of CUDA Graph conditional nodes until CUDA 12.6. Please update to CUDA 12.6. File an issue if that still does not work." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only support CUDA 12.5 with published pytorch containers see here: https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html.
Make sure to support running without cuda_graphs decoding by default and for later version we can make cuda_graphs on by default when containers support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update
NeMo/nemo/core/utils/cuda_python_utils.py
Line 21 in 8880c37
__CUDA_PYTHON_MINIMUM_VERSION_CUDA_GRAPH_CONDITIONAL_NODES_SUPPORTED__ = (12, 3) # 12030 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think @galv wants to preserve 12.3
as a requirement, but fail in rare cases when cooperative kernels are selected.
I would suggest in this case applying fallback behavior like this
if self.cuda_graphs_mode is self.CudaGraphsMode.FULL_GRAPH: |
(also, the same code in
tdt_loop_labels_computer.py
)
if self.cuda_graphs_mode is self.CudaGraphsMode.FULL_GRAPH:
try:
self._full_graph_compile()
except RuntimeError:
# fallback to graphs without while loops
self.cuda_graphs_mode = self.CudaGraphsMode.NO_WHILE_LOOPS
self._partial_graphs_compile()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's an interesting possibility.
One of the big challenges is that the error returned by torch is not very precise. It's just a RuntimeError corresponding to "invalid argument", or cudaErrorInvalidValue, which is not a precise enough error for us to tell that the problem specifically is that the code is using a cooperative kernel within a conditional node's body graph. And unfortunately we cannot check whether this is the case because conditional node API does not expose a way to get the body graph(s) of a conditional node, right now...
Anyway, I suppose if the error was not because of a cooperative kernel, but because of something else, then there is a good chance the error will get thrown by the partial graphs implementation. But it's still not a guarantee!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO sounds like it's worth a shot to be able to move forward here
@@ -630,7 +631,7 @@ def _partial_graphs_compile(self): | |||
with ( | |||
torch.cuda.stream(stream_for_graph), | |||
torch.inference_mode(), | |||
torch.cuda.graph( | |||
checked_graph( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need checked_graph
for partial graphs (without conditional nodes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good point. No, we don't.
But I am considering doing your suggestion anyway for a fallback, in which case we wouldn't use checked_graph.
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
This PR was closed because it has been inactive for 7 days since being marked as stale. |
Now we conditionally xfail only when a cuda driver version less than 12.6 is installed. CUDA 12.6 fixes this issue. Before it, cooperative kernels could not be used within the body of a conditional node.
We also provide a better error message for users to know that the fix is to upgrade to CUDA 12.6.