[FEA] Support a sort merge join as a fallback on the GPU #2354

revans2 · 2021-05-06T17:38:01Z

This work should happen after the basic out of core join is implemented that just allows for chunking of the output of an exploding join #20.

Hash joins are always going to be faster than sort merge joins. Because there is no need to sort the data. But sort merge joins are really the only option in some cases when we want to do a join for data that is large on both sides of the join. Also a FullOuter join cannot work unless all of the data for a given key is a part of the current join, and a sort merge join can handle that, where as a hash join cannot unless all of the data for both sides is in memory.

The idea is that we start off trying to do a regular hash join. We read in the build side table until it hits a size limit (probably the target batch size). If the build table is too large then we switch over to doing a sort merge join.

In the case of a FullOuter join we also need to batch the stream side (because there really is no stream side). It would be similar that we would read in the stream side table until we got all of the data or we hit a size limit. If we go over the size limit, then we switch over to a sort merge join.

To switch to a sort merge join we will create a sorting iterator for each side (build side and stream side). We will also need to insert into it any data that we have already read/cached. After that we get a batch from each side, and look at the last row join keys in each batch. We then do a lower bound on both batches on both sets of keys to see how much we can keep to do the join (really it should be the lowest bound for the lowest keys, but instead of trying to compare keys on the CPU we can just let the GPU do it and compare the offsets). We then split the batches based off of the lowest point for either set of keys, save the data that is not ready (everything above that cutoff) so it can be combined with the next batch we read and do the join on the data that is ready (everything below the cutoff).

In some cases we might end up with an empty table on one side or the other. That is OK. We just do the join and go on. If we end up with an empty table on both sides that means we have a key that is large enough it is taking up an entire batch. Just skip the join and continue to read the next batch, and hopefully we have enough memory/rows to make it work.

If we start to run into a lot of cases where we have keys that do not fit in a single batch, we might be able to add in some other optimizations. For all of the joins except FullOuter if we have a full set of keys on the build side we can do the join without waiting for all of the keys on the stream side to be there. But then we have to also make sure that we do not throw out that data on the build side until all of the corresponding keys have been seen on the stream side.

For InnerLike joins we can also do the equivalent of a cross join for a given key. do a join for each batch on the left hand side pared with each batch on the right hand side.

But both of these optimizations are complicated and hopefully unnecessary.

revans2 · 2022-04-08T19:41:10Z

We could also re-hash the data instead of sorting it. The question really is up to how much work/effort is it to re-hash the input data vs sorting the input data.

mattahrens · 2023-01-06T19:28:52Z

Replaced by #7471

revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels May 6, 2021

revans2 mentioned this issue May 6, 2021

[FEA] Support out of core joins #20

Closed

sameerz removed the ? - Needs Triage Need team to review and classify label May 11, 2021

revans2 mentioned this issue May 13, 2021

[FEA] Split stream side join batches into smaller sizes if the join explodes #2408

Closed

This was referenced Sep 24, 2021

Revert "Use cudf to compute exact hash join output row sizes (#3288)" #3657

Merged

[FEA] Use CUDF API for getting join output size #2440

Open

revans2 mentioned this issue Apr 8, 2022

[TASK] Big Reliability Epic #1870

Closed

14 tasks

revans2 added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Apr 12, 2022

abellina mentioned this issue Oct 17, 2022

[TASK] Run without fatal OOMs #6746

Closed

10 tasks

abellina mentioned this issue Nov 2, 2022

[BUG] query95 @ 30TB negative allocation from BaseHashJoinIterator.countGroups with default 200 partitions #6983

Closed

revans2 mentioned this issue Jan 6, 2023

[FEA] dynamic repartitioning of large/skewed hash joins #7471

Closed

mattahrens closed this as completed Jan 6, 2023

sameerz added the duplicate This issue or pull request already exists label Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support a sort merge join as a fallback on the GPU #2354

[FEA] Support a sort merge join as a fallback on the GPU #2354

revans2 commented May 6, 2021

revans2 commented Apr 8, 2022

mattahrens commented Jan 6, 2023

[FEA] Support a sort merge join as a fallback on the GPU #2354

[FEA] Support a sort merge join as a fallback on the GPU #2354

Comments

revans2 commented May 6, 2021

revans2 commented Apr 8, 2022

mattahrens commented Jan 6, 2023