Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a NPE issue in GpuRand #11647

Merged
merged 1 commit into from
Oct 28, 2024
Merged

Conversation

firestarman
Copy link
Collaborator

@firestarman firestarman commented Oct 23, 2024

close #11646

curXORShiftRandomSeed is marked as transient, so it will be null on executors without retry-restore context, leading to this NPE.
This fix removes the transient for curXORShiftRandomSeed, seed and previousPartition that will be used by the computation on executors.

I verified it by the customer case, and it works well. The fix is simple so i don't add any tests.

Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Collaborator Author

build

@sameerz sameerz added the bug Something isn't working label Oct 23, 2024
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really want to know what situation this was in where we missed the checkpoint/restore on retry.

Nondeterministic expressions in Spark have an initialize API

https://github.com/apache/spark/blob/5e9dec77769707647a587dafe88c8f7e58bb0084/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L498

This is how Spark initializes the partition. Because of this the Exec nodes that non-deterministic expressions are allowed to run in are limited. This is so that the plan is guaranteed to not run the non-deterministic expression multiple times for a single row (like in the case of a join), and so that this function can be called properly.

Our goal is to have the checkpoint restore code in all places that these types of expressions could run. If we missed that I want to understand what happened so we can fix that too.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will just file a follow on issue to investigate this. Because we probably want to look at this in more depth.

@revans2
Copy link
Collaborator

revans2 commented Oct 23, 2024

I spoke with @jlowe and I think we really want to understand this better. #11649

The problem is that if a retry happens and it is not in a checkpoint/restore, then we will technically get data corruption. It is not 100% data corruption because it is a random, so we get a slightly different random number compared to Spark on the CPU, which is the only reason I am not blocking this from going in. But I really want to understand the situation where this happened.

@firestarman firestarman merged commit b653ce2 into NVIDIA:branch-24.12 Oct 28, 2024
47 checks passed
@firestarman firestarman deleted the fix-rand-npe branch October 28, 2024 02:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] NullPointerException in GpuRand
3 participants