Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support out of core joins #20

Closed
revans2 opened this issue May 28, 2020 · 8 comments
Closed

[FEA] Support out of core joins #20

revans2 opened this issue May 28, 2020 · 8 comments
Assignees
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request SQL part of the SQL/Dataframe plugin

Comments

@revans2
Copy link
Collaborator

revans2 commented May 28, 2020

Is your feature request related to a problem? Please describe.
Some joins can be very large and with data skew can make it difficult to do what we do now where we hold one of the join tables in memory while streaming through the other side in batches.

This is to detect when that is not going to work and switch over to a sort merge join when needed preferably using #19 for the sorting.

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify SQL part of the SQL/Dataframe plugin labels May 28, 2020
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Oct 13, 2020
@sameerz
Copy link
Collaborator

sameerz commented Oct 13, 2020

We should see what Blazing's approach for out of core joins is.

wjxiz1992 pushed a commit to wjxiz1992/spark-rapids that referenced this issue Oct 29, 2020
wjxiz1992 added a commit to wjxiz1992/spark-rapids that referenced this issue Oct 29, 2020
@revans2 revans2 added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Feb 18, 2021
@revans2
Copy link
Collaborator Author

revans2 commented Feb 18, 2021

We are working with cudf on getting something that can just return the gather maps

rapidsai/cudf#6480

and then we can try and chunk the output result with a little help from

rapidsai/cudf#7408

@revans2 revans2 self-assigned this Apr 29, 2021
@revans2 revans2 added this to the Apr 26 - May 7 milestone Apr 29, 2021
@jlowe
Copy link
Member

jlowe commented May 5, 2021

@revans2 do you consider this addressed by the recent join work or are there still items left to address?

@revans2
Copy link
Collaborator Author

revans2 commented May 5, 2021

@jlowe I am not ready to say that yet. It is really close and I would not have a problem if we wanted to close this and just track the follow on work for a sort merge join fallback with #2252

This is required to be able to support FullOuter join as out of core, and also to be able to support large join data on both sides. Right now all of the joins except cross join require that we can fit all of the data for at least one side in a single batch. In some cases you can work around this with more partitioning, but not in all cases.

@jlowe jlowe added the epic Issue that encompasses a significant feature or body of work label May 5, 2021
@jlowe
Copy link
Member

jlowe commented May 5, 2021

Since it's not done I'd rather leave it open to track. I suspected there was some extra work left, and now that you've linked what it is here, I'm good. Thanks!

@revans2
Copy link
Collaborator Author

revans2 commented May 6, 2021

Sorry I pointed to the wrong issue. It should be #2354. I thought I had hit the button to file it but didn't and my search for "sort merge" turned up the wrong issue.

@sameerz sameerz removed this from the Apr 26 - May 7 milestone May 11, 2021
@sameerz
Copy link
Collaborator

sameerz commented May 11, 2021

Based on PR #2310, for hash joins, if we can materialize the gather map, we can likely complete the out of core join.

@sameerz
Copy link
Collaborator

sameerz commented May 11, 2021

Closing based on follow up work tracked in #2354

@sameerz sameerz closed this as completed May 11, 2021
@sameerz sameerz removed the epic Issue that encompasses a significant feature or body of work label May 11, 2021
@sameerz sameerz added this to the May 10 - May 21 milestone May 11, 2021
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Signed-off-by: spark-rapids automation <[email protected]>
sperlingxx pushed a commit to sperlingxx/spark-rapids that referenced this issue Jan 16, 2024
* polish the debug log to list the hash code
---------

Signed-off-by: Firestarman <[email protected]>
binmahone pushed a commit to binmahone/spark-rapids that referenced this issue Jun 25, 2024
…hen-perf

case when improvement: avoid copy_if_else
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request SQL part of the SQL/Dataframe plugin
Projects
None yet
Development

No branches or pull requests

3 participants