[FEA] Follow on: hash join sub-partitioning could prioritize larger batches in multi-level sub-partitioning scheme #8057

abellina · 2023-04-07T14:46:08Z

I am filing this issue to capture @revans2 comment in @firestarman's PR for multi-level sub-partitioning #7996.

Paraphrasing Bobby's comment:

The build side sub partitioner is drained and big batches handled later here (#7996) because it allows us to know their size and repartition them in a single pass instead of needing to possibly split it more than twice. In order to not leave the largest batches for last (likely incurring spill costs), we would still have to process the entire build side first. Once we have pulled in the build side and repartitioned it the first time, we could then know which batches we would want to repartition a second time and how many sub-partitions we would need. But then we have to hand the same plan over to the steam side partitioner. We would have to be able to tell it what the seed is for the first partition pass and how many partitions to use, along with what partitions would need a second pass and the corresponding seeds and number of partitions for each of them.

This is a fairly large change, and it would be a performance improvement when we are in a memory constrained situation. So we would want to have a few queries that exhibited this behavior that we could use to benchmark it.

abellina added feature request New feature or request ? - Needs Triage Need team to review and classify performance A performance related task/issue reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Apr 7, 2023

abellina mentioned this issue Apr 7, 2023

Sub-partitioning supports repartitioning the input data multiple times #7996

Merged

firestarman self-assigned this Apr 10, 2023

firestarman mentioned this issue Apr 12, 2023

[FEA] Improve the sub-partitioning algorithm for large/skewed hash joins #7832

Closed

7 tasks

mattahrens removed the ? - Needs Triage Need team to review and classify label Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Follow on: hash join sub-partitioning could prioritize larger batches in multi-level sub-partitioning scheme #8057

[FEA] Follow on: hash join sub-partitioning could prioritize larger batches in multi-level sub-partitioning scheme #8057

abellina commented Apr 7, 2023

[FEA] Follow on: hash join sub-partitioning could prioritize larger batches in multi-level sub-partitioning scheme #8057

[FEA] Follow on: hash join sub-partitioning could prioritize larger batches in multi-level sub-partitioning scheme #8057

Comments

abellina commented Apr 7, 2023