-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support ExistenceJoin #589
Comments
mini repro:
Driver Log message:
|
We might be able to do this without any changes from CUDF. ExistenceJoin is not a regular join type, as was stated before. As such cudf might push back a bit if we try to implement it there. Happily ExistanceJoin really just produces the left output columns, unchanged, and adds in an "exists" boolean column. The exists boolean columns is the same thing that is used by Spark to filter the left table on a semi-join. As such we can produce the gather map for a left semi-join, and instead of gathering the results, we can look to see if the column would have been gathered. I agree that in general if we can get left-semi or left-anti to optionally spit out a boolean column instead of a gather map, that would be ideal, but we should be able to do it without any help if needed. |
I think we can generate the existence column from the semi-join gather map like this:
|
Allow matching short TreeNode string against a regex. Enables to make sure that the test exhibits ExistenceJoin Contributes to NVIDIA#589 Signed-off-by: Gera Shegalov <[email protected]>
) - Allow matching short TreeNode string against a regex. - Ensure that the added test exhibits a query plan with an ExistenceJoin Contributes to #589 Signed-off-by: Gera Shegalov <[email protected]>
) This PR implements an iterator for ExistenceJoin 1. This PR computes ExistenceJoin by executing a left semijoin via cuDF. The lhs GatherMap scatters `true` into a Boolean column with all lhs.numRows being initially`false` . The rhs data is not gathered. 1. The PR also fixes regex matching against SparkPlan node strings. The previously used simple String mentions ExistenceJoin only in the CPU plan but does not print ExistenceJoin type as part of the Join exec string in the GPU plan. Closes #589 Signed-off-by: Gera Shegalov <[email protected]>
…IDIA#589) Signed-off-by: spark-rapids automation <[email protected]> Signed-off-by: spark-rapids automation <[email protected]>
Is your feature request related to a problem? Please describe.
Spark will some times use an optimized join type called an
ExistenceJoin
. It is not a direct join type you can just do but instead an optimized version of theIN
orEXISTS
operators that lets spark remove a subquery. This shows up in TPC-DS query 10, and probably others. It is very similar to a left semi join, but instead of filtering out columns that don't match instead it adds a new column with a boolean true or false to indicate if the join would have matched.Describe the solution you'd like
I think we might need some support from cudf for this.
Describe alternatives you've considered
There isn't a lot here. I think we need some help from cudf.
The text was updated successfully, but these errors were encountered: