Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify merge internals and reduce overhead #9516

Merged
merged 40 commits into from
Nov 17, 2021

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Oct 25, 2021

This PR is a pretty thorough rewrite of the internals of merging. There is a ton of complexity imposed by matching all the different edge cases allowed by the pandas API, but I've tried to unify the logic for different code paths as much as possible. I've also added checks for a number of edge cases that were not previously being handled. I see about a 10% performance improvement for merges on small to medium data sizes from this PR (as expected, there's no change for large data where most time is spent in C++). There's also a substantial reduction in total code that should make it easier to address issues going forward. I'm still not entirely happy with the complexity of the result and I think that further simplification should be possible, but I think this is a sufficiently large step forward to be worth pushing forward in this state, especially if it helps enable other changes to joining.

vyasr added 30 commits October 20, 2021 21:19
…ely, and take advantage to always precompute key cols with identical names.
@github-actions github-actions bot added CMake CMake build issue conda libcudf Affects libcudf (C++/CUDA) code. labels Nov 12, 2021
@charlesbluca
Copy link
Member

Is this PR still intended for 21.12 or should it be retargeted to branch-22.02?

@vyasr
Copy link
Contributor Author

vyasr commented Nov 15, 2021

It's going to 22.02. I moved it on the project board, I just haven't retargeted the branch since I was waiting for #9178 to get merged since that introduces merge conflicts that needed to get resolved and at the time the 21.12->22.02 forward merge wasn't activated yet. I'll resolve conflicts and update the target today.

@vyasr vyasr requested review from a team as code owners November 16, 2021 19:32
@vyasr vyasr requested review from hyperbolic2346 and karthikeyann and removed request for a team November 16, 2021 19:32
@vyasr vyasr requested a review from a team as a code owner November 16, 2021 19:33
@github-actions github-actions bot added the Java Affects Java cuDF API. label Nov 16, 2021
@vyasr vyasr changed the base branch from branch-21.12 to branch-22.02 November 16, 2021 19:35
@vyasr vyasr removed request for a team, hyperbolic2346 and karthikeyann November 16, 2021 19:36
@vyasr vyasr removed libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue conda Java Affects Java cuDF API. labels Nov 16, 2021
@shwina
Copy link
Contributor

shwina commented Nov 17, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 17e6f5b into rapidsai:branch-22.02 Nov 17, 2021
@vyasr vyasr deleted the refactor/merging_part1 branch January 14, 2022 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Performance Performance related issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants