Added in basic support for broadcast nested loop join #296

revans2 · 2020-06-26T21:38:00Z

I would appreciate some reviews on this.

It is a part of #265 but is missing CartesianExec which is an implementation that shows up if one of the tables is too larger to be broadcast and it is an inner or cross join with no equality comparison.

It adds in support for Cross equality joins, that in those cases are the same as an Inner join.

It also add in support for BroadcastNestedLoopJoin on Cross and Inner joins. The biggest issue is the amount of memory that could be used by a Cross join this big.

I plan on trying to use the current memory size of each table (left and right) to decide if we should play some games with memory. If the size is too large then we break the tables down into smaller pieces.

i.e.

left_size_per_row = left_size_memory/left_rows
right_size_per_row = right_size_memory/right_rows

if (((left_size_per_row + right_size_per_row) * left_rows * right_rows) > target_batch_size) {
  split the tables.  Preferable just split the stream table, but if we cannot, then split both and loop.
}

This would still not fix all cases, because we could broadcast something really large and blow up from just trying to hold it in memory.

revans2 · 2020-06-26T21:38:10Z

build

...in/src/main/scala/org/apache/spark/sql/rapids/execution/GpuBroadcastNestedLoopJoinExec.scala

jlowe · 2020-06-26T21:51:25Z

Took a quick glance, seems fine to me. As you noted, would be great to explore sharing a lot of the boilerplate build type handling, output distribution, etc. that is common with the existing hash join.

revans2 · 2020-06-29T16:19:37Z

build

revans2 · 2020-06-29T16:22:06Z

I updated the code so BroadcastnestedLoopJoin is off by default. I could not find a clean way to reuse anything between it and the other join implementations. It would take about as much code to make it common as it saved, so I have left them separate. We might revisit it again in the future if someone has a better idea on how to do it (I tried a trait to mix in).

I also filed a follow on issue #302 to try and fix some of the memory issues and let us turn this on by default.

[auto-merge] bot-auto-merge-branch-22.06 to branch-22.08 [skip ci] [bot]

Added in basic support for broadcast nested loop join

0062558

revans2 added the SQL part of the SQL/Dataframe plugin label Jun 26, 2020

revans2 self-assigned this Jun 26, 2020

revans2 commented Jun 26, 2020

View reviewed changes

...in/src/main/scala/org/apache/spark/sql/rapids/execution/GpuBroadcastNestedLoopJoinExec.scala Outdated Show resolved Hide resolved

revans2 mentioned this pull request Jun 29, 2020

[FEA] Better Memory Management for BroadcastNestedLoopJoin #302

Closed

revans2 added 2 commits June 29, 2020 11:15

Addressed review comments

5a26a7a

Merge branch 'branch-0.2' into b_n_l_j

3af73d0

revans2 changed the title ~~[WIP] Added in basic support for broadcast nested loop join~~ Added in basic support for broadcast nested loop join Jun 29, 2020

jlowe approved these changes Jun 29, 2020

View reviewed changes

revans2 merged commit 41980f0 into NVIDIA:branch-0.2 Jun 29, 2020

revans2 deleted the b_n_l_j branch June 29, 2020 16:58

sameerz added this to the Jun 22 - Jul 2 milestone Jul 2, 2020

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021

Added in basic support for broadcast nested loop join (NVIDIA#296)

1170ca0

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021

Added in basic support for broadcast nested loop join (NVIDIA#296)

91158d5

pxLi pushed a commit to pxLi/spark-rapids that referenced this pull request May 12, 2022

Fix isort/black/flake8 issues (NVIDIA#296)

903a494

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023

Merge pull request NVIDIA#296 from NVIDIA/bot-auto-merge-branch-22.06

b3b3717

[auto-merge] bot-auto-merge-branch-22.06 to branch-22.08 [skip ci] [bot]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added in basic support for broadcast nested loop join #296

Added in basic support for broadcast nested loop join #296

revans2 commented Jun 26, 2020

revans2 commented Jun 26, 2020

jlowe commented Jun 26, 2020

revans2 commented Jun 29, 2020

revans2 commented Jun 29, 2020

Added in basic support for broadcast nested loop join #296

Added in basic support for broadcast nested loop join #296

Conversation

revans2 commented Jun 26, 2020

revans2 commented Jun 26, 2020

jlowe commented Jun 26, 2020

revans2 commented Jun 29, 2020

revans2 commented Jun 29, 2020