Update the test checking for cooperative kernels in conditional nodes. #9869

galv · 2024-07-24T18:46:20Z

Now we conditionally xfail only when a cuda driver version less than 12.6 is installed. CUDA 12.6 fixes this issue. Before it, cooperative kernels could not be used within the body of a conditional node.

We also provide a better error message for users to know that the fix is to upgrade to CUDA 12.6.

nemo/core/utils/cuda_python_utils.py

Now we conditionally xfail only when a cuda driver version less than 12.6 is installed. CUDA 12.6 fixes this issue. Before it, cooperative kernels could not be used within the body of a conditional node. We also provide a better error message for users to know that the fix is to upgrade to CUDA 12.6. Signed-off-by: Daniel Galvez <[email protected]>

Signed-off-by: galv <[email protected]>

nithinraok · 2024-07-25T18:24:05Z

nemo/core/utils/cuda_python_utils.py

+    except RuntimeError as err:
+        if "CUDA error: invalid argument" in str(err):
+            raise RuntimeError(
+                "CUDA Graph capture failed. It is likely that you are calling a cooperative kernel in your RNN-T or TDT prediction network. Cooperative kernels are not allowed inside the bodies of CUDA Graph conditional nodes until CUDA 12.6. Please update to CUDA 12.6. File an issue if that still does not work."


We only support CUDA 12.5 with published pytorch containers see here: https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html.

Make sure to support running without cuda_graphs decoding by default and for later version we can make cuda_graphs on by default when containers support it.

Update

NeMo/nemo/core/utils/cuda_python_utils.py

Line 21 in 8880c37

__CUDA_PYTHON_MINIMUM_VERSION_CUDA_GRAPH_CONDITIONAL_NODES_SUPPORTED__ = (12, 3) # 12030

the minimum version to 12.6

Hmm, I think @galv wants to preserve 12.3 as a requirement, but fail in rare cases when cooperative kernels are selected.

I would suggest in this case applying fallback behavior like this

NeMo/nemo/collections/asr/parts/submodules/rnnt_loop_labels_computer.py

Line 614 in 8880c37

if self.cuda_graphs_mode is self.CudaGraphsMode.FULL_GRAPH:

(also, the same code in tdt_loop_labels_computer.py)

if self.cuda_graphs_mode is self.CudaGraphsMode.FULL_GRAPH: try: self._full_graph_compile() except RuntimeError: # fallback to graphs without while loops self.cuda_graphs_mode = self.CudaGraphsMode.NO_WHILE_LOOPS self._partial_graphs_compile()

Yeah, that's an interesting possibility.

One of the big challenges is that the error returned by torch is not very precise. It's just a RuntimeError corresponding to "invalid argument", or cudaErrorInvalidValue, which is not a precise enough error for us to tell that the problem specifically is that the code is using a cooperative kernel within a conditional node's body graph. And unfortunately we cannot check whether this is the case because conditional node API does not expose a way to get the body graph(s) of a conditional node, right now...

Anyway, I suppose if the error was not because of a cooperative kernel, but because of something else, then there is a good chance the error will get thrown by the partial graphs implementation. But it's still not a guarantee!

IMO sounds like it's worth a shot to be able to move forward here

artbataev · 2024-08-06T20:20:00Z

nemo/collections/asr/parts/submodules/rnnt_loop_labels_computer.py

@@ -630,7 +631,7 @@ def _partial_graphs_compile(self):
        with (
            torch.cuda.stream(stream_for_graph),
            torch.inference_mode(),
-            torch.cuda.graph(
+            checked_graph(


Do we need checked_graph for partial graphs (without conditional nodes)?

This is a very good point. No, we don't.

But I am considering doing your suggestion anyway for a fallback, in which case we wouldn't use checked_graph.

github-actions · 2024-08-23T01:51:54Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions · 2024-08-30T01:54:51Z

This PR was closed because it has been inactive for 7 days since being marked as stale.

github-actions bot added core Changes to NeMo Core ASR labels Jul 24, 2024

github-advanced-security bot found potential problems Jul 24, 2024

View reviewed changes

nemo/core/utils/cuda_python_utils.py Dismissed Show dismissed Hide dismissed

galv added Run CICD and removed Run CICD labels Jul 24, 2024

galv requested a review from nithinraok July 25, 2024 06:34

galv force-pushed the galv/better-cooperative-kernel-error-message branch from 87ee9ed to ce7477f Compare July 25, 2024 06:36

galv added Run CICD and removed Run CICD labels Jul 25, 2024

Apply isort and black reformatting

ecdd410

Signed-off-by: galv <[email protected]>

nithinraok added Run CICD and removed Run CICD labels Jul 25, 2024

nithinraok reviewed Jul 25, 2024

View reviewed changes

artbataev reviewed Aug 6, 2024

View reviewed changes

github-actions bot added the stale label Aug 23, 2024

github-actions bot closed this Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the test checking for cooperative kernels in conditional nodes. #9869

Update the test checking for cooperative kernels in conditional nodes. #9869

galv commented Jul 24, 2024

nithinraok Jul 25, 2024

nithinraok Aug 6, 2024

artbataev Aug 6, 2024 •

edited

Loading

galv Aug 6, 2024

pzelasko Aug 8, 2024

artbataev Aug 6, 2024

galv Aug 6, 2024

github-actions bot commented Aug 23, 2024

github-actions bot commented Aug 30, 2024

Update the test checking for cooperative kernels in conditional nodes. #9869

Update the test checking for cooperative kernels in conditional nodes. #9869

Conversation

galv commented Jul 24, 2024

nithinraok Jul 25, 2024

Choose a reason for hiding this comment

nithinraok Aug 6, 2024

Choose a reason for hiding this comment

artbataev Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

galv Aug 6, 2024

Choose a reason for hiding this comment

pzelasko Aug 8, 2024

Choose a reason for hiding this comment

artbataev Aug 6, 2024

Choose a reason for hiding this comment

galv Aug 6, 2024

Choose a reason for hiding this comment

github-actions bot commented Aug 23, 2024

github-actions bot commented Aug 30, 2024

artbataev Aug 6, 2024 •

edited

Loading