-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix NestedLoopJoin performance regression #12531
Merged
ozankabak
merged 6 commits into
apache:main
from
synnada-ai:fix/nested-loop-join-performance-regression
Sep 20, 2024
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
e0f9ca7
Optimize apply_join_filter_to_indices calls
alihan-synnada c91d60b
Optimize join indices calculation
alihan-synnada 253e207
Cache join indices
alihan-synnada 5670aba
Update datafusion/physical-plan/src/joins/nested_loop_join.rs
ozankabak bd2def0
Fix missing flag for adjust_indices_by_join_type
alihan-synnada 6a26fa6
Fix SQL logic test
alihan-synnada File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of 25 rows build-side there are 200k arrays, for 500 rows -- 4kk and so on (I suppose we don't need that much data on the build side to reach GBs size for these arrays).
I understand that we still will have to create interemediate batches to apply filter, and produce output batches, but I suppose, that starting from some point the size of these caches will become meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can do away with the cache or make it optional. In case we remove the cache, we could create the indices and apply the filter in chunks similar to before. If we pass in a range that we then use to calculate the indices for instead of creating
right_batch.num_rows()
chunks, we can control the size of the intermediate batches too. Something like(0..output_row_count).chunks(CHUNK_SIZE)
should do the trick, now that we create the indices by mapping the current row index.I believe it can bring the performance without cache down to a similar level to before the regression, maybe even better. I'll run a few benchmarks with this setup without a cache and update the benchmarks table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The chunks approach didn't change the performance, but it helped reduce the sizes of the intermediate batches. The 10% performance hit without a cache comes from the way the arrays are constructed and I couldn't find a faster approach for now. I suggest we go with the cached approach for now. When the issue that enables NLJ to emit massive batches is implemented, we can choose between the cached and chunked approaches depending on NLJ's output size. I'll open an issue about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I will merge this soon to avoid performance issues in any upcoming release unless there is more feedback. We seem to gain 20% performance relative to how it was before with caches, and we can migrate to a cached-vs-chunked-depending-on-output-batch-size approach in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for checking this option