Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Refactor Python and Cython internals for groupby aggregation (#7818)
This PR makes some improvements to the groupby/aggregation code that I identified while working on #7731. The main purpose is to make the code logic easier to follow and reduce some unnecessary complexity; I see minor but measurable performance improvements (2-5% for small datasets) as well, but those are mostly just side effects here. Specifically, it makes the following changes: 1. Inlines the logic for dropping unsupported aggregations. The old function was repetitive and necessitated looping over the aggregations twice, whereas the new approach drops unwanted aggregations on the fly so it only loops once. The new code also makes it so that you only construct a C aggregation object once. 2. Merges the logic from `_AggregationFactory` into `Aggregation`, and removes the constructor for `Aggregation`. The one downside here is that the Cython `Aggregation` object's constructor no longer places it in a valid state; however, in practice the object is always constructed via either the `make_aggregation` function or its various factories, and the object's constructor was only every used in `_drop_unsupported_aggs` anyway. The benefit is we remove the fragmentation between these two classes, making the code much more readable, and the `Aggregation` class actually serves a purpose now beyond just providing a single property `kind` that is only used once: it is now the primary way that other Cython files interact with aggregations. This also means that in most places other Cython modules don't need to work with `unique_ptr[aggregation]` as much anymore (although they do still have to move `Aggregation.c_obj` for performance reasons). `make_aggregation` now returns the Cython class instead of the underlying C++ one. 3. Modified all the "allowed aggregations" sets to use the uppercase names of the aggregations. In addition to simplifying the code a tiny bit, this helps reduce confusion between the aggregation names used in Python for pandas compatibility and the libcudf names (for instance, `idxmin` vs `argmin`, now `ARGMIN`). 4. Explicitly defines all the aggregations on a groupby. I discussed this briefly with @shwina, the change has pros and cons. The benefit is that all of these methods are properly documented now, there's less magic (the binding of methods to a class after its definition can be confusing for less experienced Python developers and has a lot of potential gotchas), and we can use the simpler string-based agg definition wherever possible. The downside is that we now have to define all of these methods. I think the change is definitely an improvement, but I'm happy to change it back if anyone can suggest a better alternative. In the long run we probably need to find a way to share both code and docstrings more effectively between all aggregations (DataFrame, Series, and GroupBy). Authors: - Vyas Ramasubramani (https://github.com/vyasr) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Ashwin Srinath (https://github.com/shwina) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #7818
- Loading branch information