[FEA] Support out of core joins #20

revans2 · 2020-05-28T20:26:27Z

Is your feature request related to a problem? Please describe.
Some joins can be very large and with data skew can make it difficult to do what we do now where we hold one of the join tables in memory while streaming through the other side in batches.

This is to detect when that is not going to work and switch over to a sort merge join when needed preferably using #19 for the sorting.

sameerz · 2020-10-13T20:25:03Z

We should see what Blazing's approach for out of core joins is.

Update databricks user doc

revans2 · 2021-02-18T18:22:00Z

We are working with cudf on getting something that can just return the gather maps

rapidsai/cudf#6480

and then we can try and chunk the output result with a little help from

rapidsai/cudf#7408

jlowe · 2021-05-05T19:39:54Z

@revans2 do you consider this addressed by the recent join work or are there still items left to address?

revans2 · 2021-05-05T19:50:01Z

@jlowe I am not ready to say that yet. It is really close and I would not have a problem if we wanted to close this and just track the follow on work for a sort merge join fallback with #2252

This is required to be able to support FullOuter join as out of core, and also to be able to support large join data on both sides. Right now all of the joins except cross join require that we can fit all of the data for at least one side in a single batch. In some cases you can work around this with more partitioning, but not in all cases.

jlowe · 2021-05-05T20:00:52Z

Since it's not done I'd rather leave it open to track. I suspected there was some extra work left, and now that you've linked what it is here, I'm good. Thanks!

revans2 · 2021-05-06T17:39:16Z

Sorry I pointed to the wrong issue. It should be #2354. I thought I had hit the button to file it but didn't and my search for "sort merge" turned up the wrong issue.

sameerz · 2021-05-11T17:18:11Z

Based on PR #2310, for hash joins, if we can materialize the gather map, we can likely complete the out of core join.

sameerz · 2021-05-11T20:28:34Z

Closing based on follow up work tracked in #2354

Signed-off-by: spark-rapids automation <[email protected]>

* polish the debug log to list the hash code --------- Signed-off-by: Firestarman <[email protected]>

…hen-perf case when improvement: avoid copy_if_else

revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify SQL part of the SQL/Dataframe plugin labels May 28, 2020

This was referenced Sep 8, 2020

tpc-ds query72 run failed OOM with sf=10000[BUG] #669

Closed

[BUG] TPC-ds 14a and 14b failed to run #650

Closed

sameerz removed the ? - Needs Triage Need team to review and classify label Oct 13, 2020

wjxiz1992 pushed a commit to wjxiz1992/spark-rapids that referenced this issue Oct 29, 2020

Fix Links to Small CSV in README (NVIDIA#20)

870c067

wjxiz1992 added a commit to wjxiz1992/spark-rapids that referenced this issue Oct 29, 2020

Merge pull request NVIDIA#20 from wjxiz1992/update-databricks-doc

1fb9a28

Update databricks user doc

revans2 added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Feb 18, 2021

revans2 self-assigned this Apr 29, 2021

revans2 added this to the Apr 26 - May 7 milestone Apr 29, 2021

revans2 mentioned this issue Apr 29, 2021

Allow batching the output of a join #2310

Merged

jlowe added the epic Issue that encompasses a significant feature or body of work label May 5, 2021

revans2 mentioned this issue May 6, 2021

[FEA] Support a sort merge join as a fallback on the GPU #2354

Closed

sameerz removed this from the Apr 26 - May 7 milestone May 11, 2021

sameerz closed this as completed May 11, 2021

sameerz removed the epic Issue that encompasses a significant feature or body of work label May 11, 2021

sameerz added this to the May 10 - May 21 milestone May 11, 2021

abellina mentioned this issue Oct 13, 2022

[FEA] Investigate how to handle memory explosion #6785

Closed

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Update submodule cudf to 83ec0af (NVIDIA#20)

6942db5

Signed-off-by: spark-rapids automation <[email protected]>

sperlingxx pushed a commit to sperlingxx/spark-rapids that referenced this issue Jan 16, 2024

Polish log (NVIDIA#20)

84fedb0

* polish the debug log to list the hash code --------- Signed-off-by: Firestarman <[email protected]>

binmahone pushed a commit to binmahone/spark-rapids that referenced this issue Jun 25, 2024

Merge pull request NVIDIA#20 from nvliyuan/0612-base-local-for-case-w…

0591964

…hen-perf case when improvement: avoid copy_if_else

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support out of core joins #20

[FEA] Support out of core joins #20

revans2 commented May 28, 2020

sameerz commented Oct 13, 2020

revans2 commented Feb 18, 2021

jlowe commented May 5, 2021

revans2 commented May 5, 2021

jlowe commented May 5, 2021

revans2 commented May 6, 2021

sameerz commented May 11, 2021

sameerz commented May 11, 2021

[FEA] Support out of core joins #20

[FEA] Support out of core joins #20

Comments

revans2 commented May 28, 2020

sameerz commented Oct 13, 2020

revans2 commented Feb 18, 2021

jlowe commented May 5, 2021

revans2 commented May 5, 2021

jlowe commented May 5, 2021

revans2 commented May 6, 2021

sameerz commented May 11, 2021

sameerz commented May 11, 2021