-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify merge internals and reduce overhead #9516
Simplify merge internals and reduce overhead #9516
Conversation
…ely, and take advantage to always precompute key cols with identical names.
…me duplicate loops.
…ctionality into one class.
Is this PR still intended for 21.12 or should it be retargeted to |
It's going to 22.02. I moved it on the project board, I just haven't retargeted the branch since I was waiting for #9178 to get merged since that introduces merge conflicts that needed to get resolved and at the time the 21.12->22.02 forward merge wasn't activated yet. I'll resolve conflicts and update the target today. |
@gpucibot merge |
This PR is a pretty thorough rewrite of the internals of merging. There is a ton of complexity imposed by matching all the different edge cases allowed by the pandas API, but I've tried to unify the logic for different code paths as much as possible. I've also added checks for a number of edge cases that were not previously being handled. I see about a 10% performance improvement for merges on small to medium data sizes from this PR (as expected, there's no change for large data where most time is spent in C++). There's also a substantial reduction in total code that should make it easier to address issues going forward. I'm still not entirely happy with the complexity of the result and I think that further simplification should be possible, but I think this is a sufficiently large step forward to be worth pushing forward in this state, especially if it helps enable other changes to joining.