Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support ExistenceJoin #589

Closed
2 tasks done
revans2 opened this issue Aug 19, 2020 · 3 comments · Fixed by #4796
Closed
2 tasks done

[FEA] Support ExistenceJoin #589

revans2 opened this issue Aug 19, 2020 · 3 comments · Fixed by #4796
Assignees
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request P0 Must have for release performance A performance related task/issue

Comments

@revans2
Copy link
Collaborator

revans2 commented Aug 19, 2020

Is your feature request related to a problem? Please describe.
Spark will some times use an optimized join type called an ExistenceJoin. It is not a direct join type you can just do but instead an optimized version of the IN or EXISTS operators that lets spark remove a subquery. This shows up in TPC-DS query 10, and probably others. It is very similar to a left semi join, but instead of filtering out columns that don't match instead it adds a new column with a boolean true or false to indicate if the join would have matched.

Describe the solution you'd like
I think we might need some support from cudf for this.

Describe alternatives you've considered
There isn't a lot here. I think we need some help from cudf.

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Aug 19, 2020
@sameerz sameerz added P2 Not required for release and removed ? - Needs Triage Need team to review and classify labels Aug 25, 2020
@sameerz sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Feb 18, 2021
@viadea
Copy link
Collaborator

viadea commented Nov 3, 2021

mini repro:

select
  count(*)
from
  customer c
where
  exists (select * from store_sales
          where c.c_customer_sk = ss_customer_sk) 
          AND
   (exists (select * from web_sales
            where c.c_customer_sk = ws_bill_customer_sk) or
    exists (select * from catalog_sales
            where c.c_customer_sk = cs_ship_customer_sk))

Driver Log message:

scala> spark.sql(query).explain()
21/11/03 23:19:56 WARN GpuOverrides:
          !Exec <SortMergeJoinExec> cannot run on GPU because ExistenceJoin(exists#225) currently is not supported
            @Expression <AttributeReference> c_customer_sk#0 could run on GPU
            @Expression <AttributeReference> cs_ship_customer_sk#156 could run on GPU
            !Exec <SortMergeJoinExec> cannot run on GPU because ExistenceJoin(exists#224) currently is not supported
              @Expression <AttributeReference> c_customer_sk#0 could run on GPU
              @Expression <AttributeReference> ws_bill_customer_sk#85 could run on GPU

@revans2
Copy link
Collaborator Author

revans2 commented Nov 9, 2021

We might be able to do this without any changes from CUDF. ExistenceJoin is not a regular join type, as was stated before. As such cudf might push back a bit if we try to implement it there. Happily ExistanceJoin really just produces the left output columns, unchanged, and adds in an "exists" boolean column. The exists boolean columns is the same thing that is used by Spark to filter the left table on a semi-join.

As such we can produce the gather map for a left semi-join, and instead of gathering the results, we can look to see if the column would have been gathered.

I agree that in general if we can get left-semi or left-anti to optionally spit out a boolean column instead of a gather map, that would be ideal, but we should be able to do it without any help if needed.

@gerashegalov gerashegalov self-assigned this Dec 1, 2021
@jlowe
Copy link
Contributor

jlowe commented Jan 14, 2022

I think we can generate the existence column from the semi-join gather map like this:

  1. Create a boolean ColumnVector of FALSE values with as many rows as the left table
  2. Build a Table with that column vector
  3. Use Table.scatter to scatter a TRUE boolean scalar to the above table using the semi join gather map as the scatter map

@jlowe jlowe added the performance A performance related task/issue label Jan 14, 2022
gerashegalov added a commit to gerashegalov/spark-rapids that referenced this issue Feb 5, 2022
Allow matching short TreeNode string against a regex. Enables to make
sure that the test exhibits ExistenceJoin

Contributes to NVIDIA#589

Signed-off-by: Gera Shegalov <[email protected]>
gerashegalov added a commit that referenced this issue Feb 8, 2022
)

- Allow matching short TreeNode string against a regex. 
- Ensure that the added test exhibits a query plan with an ExistenceJoin

Contributes to #589

Signed-off-by: Gera Shegalov <[email protected]>
@sameerz sameerz added this to the Feb 28 - Mar 18 milestone Feb 26, 2022
@sameerz sameerz added P0 Must have for release and removed P2 Not required for release labels Mar 1, 2022
gerashegalov added a commit that referenced this issue Mar 9, 2022
)

This PR implements an iterator for ExistenceJoin 

1. This PR computes ExistenceJoin by executing a left semijoin via cuDF. The lhs GatherMap scatters `true` into a Boolean column with all lhs.numRows  being initially`false` . The rhs data is not gathered. 

1. The PR also fixes regex matching against SparkPlan node strings. The previously used simple String mentions ExistenceJoin only in the CPU plan but does not print ExistenceJoin type as part of the Join exec string in the GPU plan. 

Closes #589
 
Signed-off-by: Gera Shegalov <[email protected]>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
…IDIA#589)

Signed-off-by: spark-rapids automation <[email protected]>

Signed-off-by: spark-rapids automation <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request P0 Must have for release performance A performance related task/issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants