-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support out of core joins #20
Comments
We should see what Blazing's approach for out of core joins is. |
Update databricks user doc
We are working with cudf on getting something that can just return the gather maps and then we can try and chunk the output result with a little help from |
@revans2 do you consider this addressed by the recent join work or are there still items left to address? |
@jlowe I am not ready to say that yet. It is really close and I would not have a problem if we wanted to close this and just track the follow on work for a sort merge join fallback with #2252 This is required to be able to support FullOuter join as out of core, and also to be able to support large join data on both sides. Right now all of the joins except cross join require that we can fit all of the data for at least one side in a single batch. In some cases you can work around this with more partitioning, but not in all cases. |
Since it's not done I'd rather leave it open to track. I suspected there was some extra work left, and now that you've linked what it is here, I'm good. Thanks! |
Sorry I pointed to the wrong issue. It should be #2354. I thought I had hit the button to file it but didn't and my search for "sort merge" turned up the wrong issue. |
Based on PR #2310, for hash joins, if we can materialize the gather map, we can likely complete the out of core join. |
Closing based on follow up work tracked in #2354 |
Signed-off-by: spark-rapids automation <[email protected]>
* polish the debug log to list the hash code --------- Signed-off-by: Firestarman <[email protected]>
…hen-perf case when improvement: avoid copy_if_else
Is your feature request related to a problem? Please describe.
Some joins can be very large and with data skew can make it difficult to do what we do now where we hold one of the join tables in memory while streaming through the other side in batches.
This is to detect when that is not going to work and switch over to a sort merge join when needed preferably using #19 for the sorting.
The text was updated successfully, but these errors were encountered: