Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Follow on: hash join sub-partitioning could prioritize larger batches in multi-level sub-partitioning scheme #8057

Open
abellina opened this issue Apr 7, 2023 · 0 comments
Assignees
Labels
feature request New feature or request performance A performance related task/issue reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@abellina
Copy link
Collaborator

abellina commented Apr 7, 2023

I am filing this issue to capture @revans2 comment in @firestarman's PR for multi-level sub-partitioning #7996.

Paraphrasing Bobby's comment:

The build side sub partitioner is drained and big batches handled later here (#7996) because it allows us to know their size and repartition them in a single pass instead of needing to possibly split it more than twice. In order to not leave the largest batches for last (likely incurring spill costs), we would still have to process the entire build side first. Once we have pulled in the build side and repartitioned it the first time, we could then know which batches we would want to repartition a second time and how many sub-partitions we would need. But then we have to hand the same plan over to the steam side partitioner. We would have to be able to tell it what the seed is for the first partition pass and how many partitions to use, along with what partitions would need a second pass and the corresponding seeds and number of partitions for each of them.

This is a fairly large change, and it would be a performance improvement when we are in a memory constrained situation. So we would want to have a few queries that exhibited this behavior that we could use to benchmark it.

@abellina abellina added feature request New feature or request ? - Needs Triage Need team to review and classify performance A performance related task/issue reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Apr 7, 2023
@firestarman firestarman self-assigned this Apr 10, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Apr 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request performance A performance related task/issue reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

No branches or pull requests

3 participants