Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Print out the size of the batch currently being processed for GPU OOM. #11732

Closed
firestarman opened this issue Nov 19, 2024 · 0 comments · Fixed by #11733
Closed

[FEA] Print out the size of the batch currently being processed for GPU OOM. #11732

firestarman opened this issue Nov 19, 2024 · 0 comments · Fixed by #11733
Assignees
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@firestarman
Copy link
Collaborator

When getting GPU OOMs, we usually see error stacks as below. And it only tells where the OOM happens.

It would be better to also know the size of the batch currently being processed for the error triage.

com.nvidia.spark.rapids.jni.GpuSplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:458)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:588)
com.nvidia.spark.rapids.jni.GpuRetryOOM: GPU OutOfMemory
        at ai.rapids.cudf.Table.contiguousSplit(Native Method)
        at ai.rapids.cudf.Table.contiguousSplit(Table.java:2766)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.$anonfun$splitSpillableInHalfByRows$4(RmmRapidsRetryIterator.scala:681)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
@firestarman firestarman added ? - Needs Triage Need team to review and classify feature request New feature or request labels Nov 19, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Nov 19, 2024
firestarman added a commit that referenced this issue Nov 25, 2024
…11733)

closes #11732

This PR adds the support to print out the current attempt object being processed
when OOM happens in the retry block.
This is designed for the better OOM issues triage.
---------

Signed-off-by: Firestarman <[email protected]>
@sameerz sameerz added reliability Features to improve reliability or bugs that severly impact the reliability of the plugin and removed feature request New feature or request labels Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants