[FEA] Support ExistenceJoin #589

revans2 · 2020-08-19T15:51:24Z

Is your feature request related to a problem? Please describe.
Spark will some times use an optimized join type called an ExistenceJoin. It is not a direct join type you can just do but instead an optimized version of the IN or EXISTS operators that lets spark remove a subquery. This shows up in TPC-DS query 10, and probably others. It is very similar to a left semi join, but instead of filtering out columns that don't match instead it adds a new column with a boolean true or false to indicate if the join would have matched.

Describe the solution you'd like
I think we might need some support from cudf for this.

Describe alternatives you've considered
There isn't a lot here. I think we need some help from cudf.

add an integration test exhibiting ExistenceJoin Add preliminary test and test framework changes for ExistanceJoin #4697
implement a gatherer for ExistenceJoin Implement ExistenceJoin Iterator using an auxiliary left semijoin #4796

The text was updated successfully, but these errors were encountered:

viadea · 2021-11-03T23:47:38Z

mini repro:

select
  count(*)
from
  customer c
where
  exists (select * from store_sales
          where c.c_customer_sk = ss_customer_sk) 
          AND
   (exists (select * from web_sales
            where c.c_customer_sk = ws_bill_customer_sk) or
    exists (select * from catalog_sales
            where c.c_customer_sk = cs_ship_customer_sk))

Driver Log message:

scala> spark.sql(query).explain()
21/11/03 23:19:56 WARN GpuOverrides:
          !Exec <SortMergeJoinExec> cannot run on GPU because ExistenceJoin(exists#225) currently is not supported
            @Expression <AttributeReference> c_customer_sk#0 could run on GPU
            @Expression <AttributeReference> cs_ship_customer_sk#156 could run on GPU
            !Exec <SortMergeJoinExec> cannot run on GPU because ExistenceJoin(exists#224) currently is not supported
              @Expression <AttributeReference> c_customer_sk#0 could run on GPU
              @Expression <AttributeReference> ws_bill_customer_sk#85 could run on GPU

revans2 · 2021-11-09T15:56:30Z

We might be able to do this without any changes from CUDF. ExistenceJoin is not a regular join type, as was stated before. As such cudf might push back a bit if we try to implement it there. Happily ExistanceJoin really just produces the left output columns, unchanged, and adds in an "exists" boolean column. The exists boolean columns is the same thing that is used by Spark to filter the left table on a semi-join.

As such we can produce the gather map for a left semi-join, and instead of gathering the results, we can look to see if the column would have been gathered.

I agree that in general if we can get left-semi or left-anti to optionally spit out a boolean column instead of a gather map, that would be ideal, but we should be able to do it without any help if needed.

jlowe · 2022-01-14T19:08:27Z

I think we can generate the existence column from the semi-join gather map like this:

Create a boolean ColumnVector of FALSE values with as many rows as the left table
Build a Table with that column vector
Use Table.scatter to scatter a TRUE boolean scalar to the above table using the semi join gather map as the scatter map

Allow matching short TreeNode string against a regex. Enables to make sure that the test exhibits ExistenceJoin Contributes to NVIDIA#589 Signed-off-by: Gera Shegalov <[email protected]>

) - Allow matching short TreeNode string against a regex. - Ensure that the added test exhibits a query plan with an ExistenceJoin Contributes to #589 Signed-off-by: Gera Shegalov <[email protected]>

) This PR implements an iterator for ExistenceJoin 1. This PR computes ExistenceJoin by executing a left semijoin via cuDF. The lhs GatherMap scatters `true` into a Boolean column with all lhs.numRows being initially`false` . The rhs data is not gathered. 1. The PR also fixes regex matching against SparkPlan node strings. The previously used simple String mentions ExistenceJoin only in the CPU plan but does not print ExistenceJoin type as part of the Join exec string in the GPU plan. Closes #589 Signed-off-by: Gera Shegalov <[email protected]>

…IDIA#589) Signed-off-by: spark-rapids automation <[email protected]> Signed-off-by: spark-rapids automation <[email protected]>

revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Aug 19, 2020

sameerz added P2 Not required for release and removed ? - Needs Triage Need team to review and classify labels Aug 25, 2020

sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Feb 18, 2021

viadea mentioned this issue Nov 3, 2021

[BUG] Some TPC-DS queries are failing using the latest v21.10 RAPIDS jars #3913

Closed

gerashegalov self-assigned this Dec 1, 2021

jlowe added the performance A performance related task/issue label Jan 14, 2022

gerashegalov mentioned this issue Feb 5, 2022

Add preliminary test and test framework changes for ExistanceJoin #4697

Merged

sameerz linked a pull request Feb 7, 2022 that will close this issue

Add preliminary test and test framework changes for ExistanceJoin #4697

Merged

sameerz removed a link to a pull request Feb 7, 2022

Add preliminary test and test framework changes for ExistanceJoin #4697

Merged

gerashegalov mentioned this issue Feb 16, 2022

Implement ExistenceJoin Iterator using an auxiliary left semijoin #4796

Merged

sameerz added this to the Feb 28 - Mar 18 milestone Feb 26, 2022

sameerz added P0 Must have for release and removed P2 Not required for release labels Mar 1, 2022

gerashegalov closed this as completed in #4796 Mar 9, 2022

nartal1 mentioned this issue Jun 29, 2023

[BUG] Add ExistenceJoin to supported joinType NVIDIA/spark-rapids-tools#400

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support ExistenceJoin #589

[FEA] Support ExistenceJoin #589

revans2 commented Aug 19, 2020 •

edited by gerashegalov

Loading

viadea commented Nov 3, 2021

revans2 commented Nov 9, 2021

jlowe commented Jan 14, 2022

[FEA] Support ExistenceJoin #589

[FEA] Support ExistenceJoin #589

Comments

revans2 commented Aug 19, 2020 • edited by gerashegalov Loading

viadea commented Nov 3, 2021

revans2 commented Nov 9, 2021

jlowe commented Jan 14, 2022

revans2 commented Aug 19, 2020 •

edited by gerashegalov

Loading