Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry with smaller split on CudfColumnSizeOverflowException #9085

Merged
merged 2 commits into from
Aug 25, 2023

Conversation

mythrocks
Copy link
Collaborator

Depends on rapidsai/cudf#13911.

When a CUDF operation causes a column's size to exceed the valid range for CUDF columns (i.e. cudf::size_type), CUDF will throw an exception.

Prior to this commit, the RmmRapidsRetryIterator does not attempt retries with smaller splits in this case. Instead, the overflow is treated as a generic exception.

This commit allows the RmmRapidsRetryIterator to recognize the exception specific to the overflow case (i.e. CudfColumnSizeOverflowException), and attempt a split-retry.

Note: This error condition is difficult to reproduce. The catch/retry is a best effort attempt not to fail the entire task.

@mythrocks mythrocks self-assigned this Aug 21, 2023
@mythrocks mythrocks force-pushed the column-overflow-retry branch from 122370c to e64056a Compare August 21, 2023 16:27
@mythrocks mythrocks added the bug Something isn't working label Aug 21, 2023
@mythrocks mythrocks requested review from revans2 and abellina August 21, 2023 21:28
abellina
abellina previously approved these changes Aug 21, 2023
Copy link
Collaborator

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a very minor nit, otherwise this looks good to me.

@mythrocks
Copy link
Collaborator Author

(CI will continue to fail on this PR, until rapidsai/cudf#13911 is merged, and available in spark-rapids-jni.)

revans2
revans2 previously approved these changes Aug 23, 2023
@revans2
Copy link
Collaborator

revans2 commented Aug 23, 2023

build

@mythrocks
Copy link
Collaborator Author

mythrocks commented Aug 23, 2023

Examining the test failure now.
Edit: It's definitely related to the change in this PR.

@mythrocks
Copy link
Collaborator Author

(I'm not sure how the whole world got added as reviewers for this PR. I'll request the original reviewers for review.)

@mythrocks
Copy link
Collaborator Author

Build

@mythrocks mythrocks requested a review from revans2 August 24, 2023 18:38
@abellina
Copy link
Collaborator

Look at git, it's counting a bunch of other commits here. My guess is that via some sort of rebase the PR ended up counting all of that in the diff, hence it thinks there are changes to a pom file (which brings a whole group of reviewers with). You could try rebasing your changes on top of branch-23.10 HEAD and force pushing. I know that's a no-no, but it will clear up the history for this change.

@mythrocks
Copy link
Collaborator Author

mythrocks commented Aug 24, 2023

My guess is that via some sort of rebase the PR...

Argh. Let me try doing a rebase and a force-push.
Edit: Yep, that appears to have worked. Sorry for the noise. I'm not sure where that rebase came from.
Also, the conversations/review comments seem to be intact.

Depends on rapidsai/cudf#13911.

When a CUDF operation causes a column's size to exceed the valid range
for CUDF columns (i.e. cudf::size_type), CUDF will throw an exception.

Prior to this commit, the `RmmRapidsRetryIterator` does not attempt retries
with smaller splits, in this case. Instead, the overflow is treated as
a generic exception.

This commit allows the RmmRapidsRetryIterator to recognize the exception
specific to the overflow case (i.e. `CudfColumnSizeOverflowException`),
and attempt a split-retry.

Note: This error condition is difficult to reproduce. The catch/retry is
a "best effort" attempt not to fail the entire task.

Signed-off-by: MithunR <[email protected]>
@mythrocks mythrocks force-pushed the column-overflow-retry branch from db98518 to ad48d46 Compare August 24, 2023 20:14
@mythrocks
Copy link
Collaborator Author

Build

@mythrocks
Copy link
Collaborator Author

Build

@mythrocks mythrocks merged commit 9b91a80 into NVIDIA:branch-23.10 Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants