Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added in basic support for broadcast nested loop join #296

Merged
merged 3 commits into from
Jun 29, 2020

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Jun 26, 2020

I would appreciate some reviews on this.

It is a part of #265 but is missing CartesianExec which is an implementation that shows up if one of the tables is too larger to be broadcast and it is an inner or cross join with no equality comparison.

It adds in support for Cross equality joins, that in those cases are the same as an Inner join.

It also add in support for BroadcastNestedLoopJoin on Cross and Inner joins. The biggest issue is the amount of memory that could be used by a Cross join this big.

I plan on trying to use the current memory size of each table (left and right) to decide if we should play some games with memory. If the size is too large then we break the tables down into smaller pieces.

i.e.

left_size_per_row = left_size_memory/left_rows
right_size_per_row = right_size_memory/right_rows

if (((left_size_per_row + right_size_per_row) * left_rows * right_rows) > target_batch_size) {
  split the tables.  Preferable just split the stream table, but if we cannot, then split both and loop.
}

This would still not fix all cases, because we could broadcast something really large and blow up from just trying to hold it in memory.

@revans2 revans2 added the SQL part of the SQL/Dataframe plugin label Jun 26, 2020
@revans2 revans2 self-assigned this Jun 26, 2020
@revans2
Copy link
Collaborator Author

revans2 commented Jun 26, 2020

build

@jlowe
Copy link
Contributor

jlowe commented Jun 26, 2020

Took a quick glance, seems fine to me. As you noted, would be great to explore sharing a lot of the boilerplate build type handling, output distribution, etc. that is common with the existing hash join.

@revans2 revans2 changed the title [WIP] Added in basic support for broadcast nested loop join Added in basic support for broadcast nested loop join Jun 29, 2020
@revans2
Copy link
Collaborator Author

revans2 commented Jun 29, 2020

build

@revans2
Copy link
Collaborator Author

revans2 commented Jun 29, 2020

I updated the code so BroadcastnestedLoopJoin is off by default. I could not find a clean way to reuse anything between it and the other join implementations. It would take about as much code to make it common as it saved, so I have left them separate. We might revisit it again in the future if someone has a better idea on how to do it (I tried a trait to mix in).

I also filed a follow on issue #302 to try and fix some of the memory issues and let us turn this on by default.

@revans2 revans2 merged commit 41980f0 into NVIDIA:branch-0.2 Jun 29, 2020
@revans2 revans2 deleted the b_n_l_j branch June 29, 2020 16:58
@sameerz sameerz added this to the Jun 22 - Jul 2 milestone Jul 2, 2020
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
pxLi pushed a commit to pxLi/spark-rapids that referenced this pull request May 12, 2022
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
[auto-merge] bot-auto-merge-branch-22.06 to branch-22.08 [skip ci] [bot]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SQL part of the SQL/Dataframe plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants