Sub-partitioning supports repartitioning the input data multiple times #7996

firestarman · 2023-04-03T08:54:58Z

This adds in the support of repartitioning the input data multiple times with different hash seeds to sub-partitioning when the initial partition number is not big enough to over partition data. It will also calculate the actual partition number needed for the over partitioning.

Signed-off-by: Firestarman <[email protected]>

firestarman · 2023-04-03T09:03:44Z

build

revans2 · 2023-04-05T14:09:11Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuSubPartitionHashJoin.scala

+   * Always return the current seed.
+   * This is intended to share the same seed across sub-partitioners.
+   */
+  override final def nextSeed: Int = seedGenerator.currentSeed


nit: This is confusing. I would prefer it if we had a way to pass the seed directly into a sub-partitioner instead.

revans2 · 2023-04-05T14:13:10Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuSubPartitionHashJoin.scala

+  }
+
+  private[this] def needRepartition(parts: SubPartitionBuffer): Boolean = {
+    // FIXME Is it good enough to ask for repartitioning when there exists any sub


I think we would want to stop for a given batch if we try to partition the batch and it didn't change. Meaning we tried to partition it N ways and we got back N-1 empty batches and 1 batch with all of the data in it.

That said I think we only need to do two passes through the data at most. The algorithm I see would be something like the following.

If the build side < target batch size

do the join as normal

else

partition the build/stream sides according to the configured number of splits

group build partitions to try and be just under target batch size and do the join for all build partitions that met this goal

for each build partition > target batch size

num_partitions = ceil(build size / target batch size)

partition the build and stream batches using the new seed and the calculated number of partitions

group partitions together again like before and do the join no matter how large the size is.

Signed-off-by: Firestarman <[email protected]>

firestarman · 2023-04-07T09:08:27Z

build

revans2

Looks good to me, but I would like to have at least one other person look at this too.

abellina

I just had a non-blocking comment.

abellina · 2023-04-07T13:57:23Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuSubPartitionHashJoin.scala

+            pair = Some(new PartitionPair(buildBatch, streamBatches))
+          }
+        }
+      } else if (bigBuildBatches.nonEmpty) {


the way this seems to be working is that big batches are handled later, after the original sub partitioner iterators are flushed. Do I understand that correctly? If so I wonder if it makes sense to stop pulling from the sub partitioner, and give priority to the big batches temporarily, because that should alleviate the need to spill all the batches in bigBuildBatches.

Yes the big batches come later. That is what I suggested we do above #7996 (comment)

They come later because it allows us to know their size and repartition them in a single pass instead of needing to possibly split it more than twice. I think we might be able to do something similar to what you want. But we would still have to process the entire build side first. Once we have pulled in the build side and repartitioned it the first time, we could then know which batches we would want to repartition a second time and how many sub-partitions we would need. But then we have to hand the same plan over to the steam side partitioner. We would have to be able to tell it what the seed is for the first partition pass and how many partitions to use, along with what partitions would need a second pass and the corresponding seeds and number of partitions for each of them.

It would be doable, but it is a fairly large change, and it would be a performance improvement when we are in a memory constrained situation. So we would want to have a few queries that exhibited this behavior that we could use to benchmark it.

That all sounds rather involved and I think it would be best to just file a follow on issue to look at it and we can the prioritize it in the backlog according to what management decides.

This sounds good to me. I filed #8057

NVIDIA#7996) Signed-off-by: Firestarman <[email protected]>

firestarman added 3 commits April 3, 2023 03:36

Sub-partitioning supports to repartition multiple times

d1fb25a

Signed-off-by: Firestarman <[email protected]>

add tests

dd32fe8

Signed-off-by: Firestarman <[email protected]>

fix a format issue

78cf95a

Signed-off-by: Firestarman <[email protected]>

firestarman requested a review from revans2 April 4, 2023 02:46

firestarman self-assigned this Apr 4, 2023

firestarman linked an issue Apr 4, 2023 that may be closed by this pull request

Support re-partitioning large data multiple times and each time with a different seed. #7911

Closed

revans2 reviewed Apr 5, 2023

View reviewed changes

sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Apr 7, 2023

firestarman added 2 commits April 7, 2023 07:20

address comments

9e9fad4

Signed-off-by: Firestarman <[email protected]>

add unit tests

c3c0d29

Signed-off-by: Firestarman <[email protected]>

firestarman requested a review from revans2 April 7, 2023 09:06

revans2 approved these changes Apr 7, 2023

View reviewed changes

abellina reviewed Apr 7, 2023

View reviewed changes

abellina mentioned this pull request Apr 7, 2023

[FEA] Follow on: hash join sub-partitioning could prioritize larger batches in multi-level sub-partitioning scheme #8057

Open

abellina approved these changes Apr 7, 2023

View reviewed changes

firestarman merged commit 19a658f into NVIDIA:branch-23.06 Apr 10, 2023

firestarman deleted the multi-sub-part branch April 10, 2023 05:02

firestarman mentioned this pull request Apr 10, 2023

[FEA] Improve the sub-partitioning algorithm for large/skewed hash joins #7832

Closed

7 tasks

abellina pushed a commit to abellina/spark-rapids that referenced this pull request Apr 14, 2023

Sub-partitioning supports repartitioning the input data multiple times (

7c068be

NVIDIA#7996) Signed-off-by: Firestarman <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sub-partitioning supports repartitioning the input data multiple times #7996

Sub-partitioning supports repartitioning the input data multiple times #7996

firestarman commented Apr 3, 2023 •

edited

Loading

firestarman commented Apr 3, 2023

revans2 Apr 5, 2023

firestarman Apr 7, 2023

revans2 Apr 5, 2023

firestarman Apr 7, 2023

firestarman commented Apr 7, 2023

revans2 left a comment

abellina left a comment

abellina Apr 7, 2023

revans2 Apr 7, 2023

abellina Apr 7, 2023

Sub-partitioning supports repartitioning the input data multiple times #7996

Sub-partitioning supports repartitioning the input data multiple times #7996

Conversation

firestarman commented Apr 3, 2023 • edited Loading

firestarman commented Apr 3, 2023

revans2 Apr 5, 2023

Choose a reason for hiding this comment

firestarman Apr 7, 2023

Choose a reason for hiding this comment

revans2 Apr 5, 2023

Choose a reason for hiding this comment

firestarman Apr 7, 2023

Choose a reason for hiding this comment

firestarman commented Apr 7, 2023

revans2 left a comment

Choose a reason for hiding this comment

abellina left a comment

Choose a reason for hiding this comment

abellina Apr 7, 2023

Choose a reason for hiding this comment

revans2 Apr 7, 2023

Choose a reason for hiding this comment

abellina Apr 7, 2023

Choose a reason for hiding this comment

firestarman commented Apr 3, 2023 •

edited

Loading