-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix degenerate conditional nested loop join detection [databricks] #11268
Conversation
Signed-off-by: Jason Lowe <[email protected]>
build |
CI failed in test_right_broadcast_nested_loop_join_without_condition_empty, which exposed that we were not properly handling empty build-side batches in unconditional nested outer loop joins. Previously we hacked around it by adding an always-true condition, but this adds support for it and removes the hack. |
build |
joinTime = joinTime) | ||
|
||
localJoinType match { | ||
case LeftOuter if spillableBuiltBatch.numRows == 0 => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to worry about a full outer join?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future yes, but we do not support FullOuter joins for broadcasted nested loop joins. Support for that is tracked by #3269.
CI failure appears to be related to rapidsai/cudf#16426. |
build |
some regex cases (https://github.com/NVIDIA/spark-rapids/blob/branch-24.08/jenkins/spark-premerge-build.sh#L223-L224) seems to be hanging there forever in scala213 CI, like
executor stop producing any further logs,
|
As the above is the last case of CI, and it cannot be reproduced locally. Im going to merge this to unblock other changes, thanks |
Fixes #11266. The failure on Databricks is because the aggregate was pushed through the join, which resulted in a non-empty output from the join. The fix from #11244 was flawed in that it detected unconditional joins if the condition was true and the output is empty (i.e.: a row-count-only join), but this last condition isn't necessary. Nested loop joins are unconditional joins if the join condition is always true regardless of the outputs being produced by the join.