Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Python and Cython internals for groupby aggregation #7818

Merged

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Apr 1, 2021

This PR makes some improvements to the groupby/aggregation code that I identified while working on #7731. The main purpose is to make the code logic easier to follow and reduce some unnecessary complexity; I see minor but measurable performance improvements (2-5% for small datasets) as well, but those are mostly just side effects here. Specifically, it makes the following changes:

  1. Inlines the logic for dropping unsupported aggregations. The old function was repetitive and necessitated looping over the aggregations twice, whereas the new approach drops unwanted aggregations on the fly so it only loops once. The new code also makes it so that you only construct a C aggregation object once.
  2. Merges the logic from _AggregationFactory into Aggregation, and removes the constructor for Aggregation. The one downside here is that the Cython Aggregation object's constructor no longer places it in a valid state; however, in practice the object is always constructed via either the make_aggregation function or its various factories, and the object's constructor was only every used in _drop_unsupported_aggs anyway. The benefit is we remove the fragmentation between these two classes, making the code much more readable, and the Aggregation class actually serves a purpose now beyond just providing a single property kind that is only used once: it is now the primary way that other Cython files interact with aggregations. This also means that in most places other Cython modules don't need to work with unique_ptr[aggregation] as much anymore (although they do still have to move Aggregation.c_obj for performance reasons). make_aggregation now returns the Cython class instead of the underlying C++ one.
  3. Modified all the "allowed aggregations" sets to use the uppercase names of the aggregations. In addition to simplifying the code a tiny bit, this helps reduce confusion between the aggregation names used in Python for pandas compatibility and the libcudf names (for instance, idxmin vs argmin, now ARGMIN).
  4. Explicitly defines all the aggregations on a groupby. I discussed this briefly with @shwina, the change has pros and cons. The benefit is that all of these methods are properly documented now, there's less magic (the binding of methods to a class after its definition can be confusing for less experienced Python developers and has a lot of potential gotchas), and we can use the simpler string-based agg definition wherever possible. The downside is that we now have to define all of these methods. I think the change is definitely an improvement, but I'm happy to change it back if anyone can suggest a better alternative. In the long run we probably need to find a way to share both code and docstrings more effectively between all aggregations (DataFrame, Series, and GroupBy).

vyasr added 30 commits March 24, 2021 12:29
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cython looks good.

@vyasr vyasr requested a review from shwina April 13, 2021 16:42
@vyasr vyasr requested a review from shwina April 14, 2021 19:28
Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Very nice cleanup, @vyasr!

@karthikeyann
Copy link
Contributor

rerun tests

@karthikeyann
Copy link
Contributor

Why are there only 4 checks showing in this PR?

@cwharris cwharris changed the title [REVIEW] Refactor Python and Cython internals for groupby aggregation Refactor Python and Cython internals for groupby aggregation Apr 15, 2021
@cwharris
Copy link
Contributor

rerun tests

@vyasr
Copy link
Contributor Author

vyasr commented Apr 15, 2021

Why are there only 4 checks showing in this PR?

@karthikeyann all CI was blocked yesterday due to connectivity issues. The queue is slowly being emptied, I'll merge in #7968 once that's finalized then trigger tests again.

@vyasr vyasr requested a review from a team as a code owner April 15, 2021 18:54
@vyasr vyasr requested a review from davidwendt April 15, 2021 18:54
@vyasr vyasr force-pushed the refactor/cleanup_aggregation_code branch from 25794bc to affc9c9 Compare April 15, 2021 18:55
@vyasr vyasr removed request for a team and davidwendt April 15, 2021 18:55
@vyasr
Copy link
Contributor Author

vyasr commented Apr 15, 2021

I can't merge the changes from #7968 without triggering C++ reviews, so I'll just wait for that to get merged into branch-0.20 and merge that down after.

@vyasr
Copy link
Contributor Author

vyasr commented Apr 16, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 8a666a0 into rapidsai:branch-0.20 Apr 16, 2021
@vyasr vyasr added this to the cuDF Python Refactoring milestone Jul 22, 2021
@vyasr vyasr deleted the refactor/cleanup_aggregation_code branch January 14, 2022 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants