[REVIEW] Support groupby operations for decimal dtypes #7731

vyasr · 2021-03-25T20:59:23Z

This PR resolves #7687. It also does a bit of cleanup of the internals of the code base. There is more that I would like to do, but I'll probably punt everything to a future PR that I don't directly have to touch for this change in the interest of quickly resolving the issue.

I still need help determining why a few aggregations aren't working. The problems fall into two groups:

The var and std aggregations currently don't fail, but they always return columns filled with NULLs. I found the implementation of the dispatch for these methods in variance.cu/compound.cuh, and at least nominally it seems like these methods are not currently supported because the corresponding enable_if_t is based on whether the type satisfies std::is_arithmetic, which decimal types will not. However, I'm not sure whether the problem is that this classification is incorrect and these types are actually supported by libcudf, or if there really isn't an implementation; I tried to find one, but there are a lot of different files related to aggregation and I'm sure I didn't find all of them. If we simply don't have an implementation, I can remove these from the list of valid aggregations.
The mean, quantile, and median aggregations all raise a RuntimeError from binaryop.hpp: "Input must have fixed_point data_type." I've traced the error to the Cython GroupBy.aggregate method, specifically the line where it calls through to the underlying c_obj's aggregate method. The call stack in C++ is pretty deep after that, though, and I haven't yet been able to pinpoint whether the failure is a missing cast somewhere (i.e. libcudf thinks that the column is a floating point type when it's really not) or if the problem lies elsewhere.

Update
Thanks to @codereport, I've now marked all the above as unsupported operations. After some discussion with other devs I've also handled the other extended types. I still need to write tests, but I think this PR is ready for review in its current form to identify if I've missed anything in the implementation.

…nt of aggregators.

codereport · 2021-03-25T21:42:28Z

@vyasr None of the aggregations mentioned in 1) or 2) are currently supported for cudf::group_by with Decimal32/64 in libcudf.

This is the main issue for all of fixed_point work (I kept it up to date at least while I was spending most of my time on fixed_point). The two hash/sort group_by PRs are here: #7169 and #7190. Looks like MIN, MAX, SUM & COUNT are the only ops that are currently supported. More work will need to be done in order to support other ops. Especially the ops like MEAN, which are hard coded to only work with floating point accumulator times (aka the variables that store the temporary values).

vyasr · 2021-03-25T22:02:12Z

Thanks @codereport! That's helpful. I've updated the list of supported operations in this PR.

python/cudf/cudf/core/groupby/groupby.py

python/cudf/cudf/core/column/column.py

vyasr · 2021-03-31T02:21:53Z

After tracing this pretty deep down the stack @codereport helped me identify that these operations are actually not supported on cuda 10.x due to a compiler bug. The C++ tests for this already exist and are explicitly disabled. In the interest of testing this to the extent possible, I've adding a conditional skip with pytest based on the nvcc version available.

kkraus14 · 2021-03-31T04:01:36Z

After tracing this pretty deep down the stack @codereport helped me identify that these operations are actually not supported on cuda 10.x due to a compiler bug. The C++ tests for this already exist and are explicitly disabled. In the interest of testing this to the extent possible, I've adding a conditional skip with pytest based on the nvcc version available.

Instead of checking the nvcc version we should be able to use the rmm._cuda module to get the CUDA version and skip on 10.1 / 10.2.

python/cudf/cudf/tests/test_groupby.py

python/cudf/cudf/_lib/groupby.pyx

…ompiler bug.

kkraus14 · 2021-03-31T22:01:01Z

@gpucibot merge

@shwina

This PR makes some improvements to the groupby/aggregation code that I identified while working on #7731. The main purpose is to make the code logic easier to follow and reduce some unnecessary complexity; I see minor but measurable performance improvements (2-5% for small datasets) as well, but those are mostly just side effects here. Specifically, it makes the following changes: 1. Inlines the logic for dropping unsupported aggregations. The old function was repetitive and necessitated looping over the aggregations twice, whereas the new approach drops unwanted aggregations on the fly so it only loops once. The new code also makes it so that you only construct a C aggregation object once. 2. Merges the logic from `_AggregationFactory` into `Aggregation`, and removes the constructor for `Aggregation`. The one downside here is that the Cython `Aggregation` object's constructor no longer places it in a valid state; however, in practice the object is always constructed via either the `make_aggregation` function or its various factories, and the object's constructor was only every used in `_drop_unsupported_aggs` anyway. The benefit is we remove the fragmentation between these two classes, making the code much more readable, and the `Aggregation` class actually serves a purpose now beyond just providing a single property `kind` that is only used once: it is now the primary way that other Cython files interact with aggregations. This also means that in most places other Cython modules don't need to work with `unique_ptr[aggregation]` as much anymore (although they do still have to move `Aggregation.c_obj` for performance reasons). `make_aggregation` now returns the Cython class instead of the underlying C++ one. 3. Modified all the "allowed aggregations" sets to use the uppercase names of the aggregations. In addition to simplifying the code a tiny bit, this helps reduce confusion between the aggregation names used in Python for pandas compatibility and the libcudf names (for instance, `idxmin` vs `argmin`, now `ARGMIN`). 4. Explicitly defines all the aggregations on a groupby. I discussed this briefly with @shwina, the change has pros and cons. The benefit is that all of these methods are properly documented now, there's less magic (the binding of methods to a class after its definition can be confusing for less experienced Python developers and has a lot of potential gotchas), and we can use the simpler string-based agg definition wherever possible. The downside is that we now have to define all of these methods. I think the change is definitely an improvement, but I'm happy to change it back if anyone can suggest a better alternative. In the long run we probably need to find a way to share both code and docstrings more effectively between all aggregations (DataFrame, Series, and GroupBy). Authors: - Vyas Ramasubramani (https://github.com/vyasr) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Ashwin Srinath (https://github.com/shwina) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #7818

vyasr added 13 commits March 24, 2021 12:29

Don't identify decimals as strings.

26bafd0

Reject all extension types as string types.

babcdfc

Create separate lists for extension type methods.

2b00611

Merge branch 'branch-0.19' into fix/issue7687_part2

76ab556

Enable collect for decimals.

1ebde51

Enable argmin and argmax.

4c5d876

Fix variance key name.

4134e43

Move groupby aggregation list to groupby.py and clean up the assignme…

43cf580

…nt of aggregators.

Disable aggs that are overrides of actual methods.

474a179

Move more logic out of the GroupBy class.

25e74ef

Simplify getattr usage.

8a44827

Clearly documented unknown failures.

6b5c67f

Match other class groupbys to strings.

8e45ad0

vyasr requested a review from a team as a code owner March 25, 2021 20:59

vyasr requested review from cwharris and isVoid March 25, 2021 20:59

vyasr self-assigned this Mar 25, 2021

github-actions bot added the Python Affects Python cuDF API. label Mar 25, 2021

vyasr added 2 - In Progress Currently a work in progress Cython improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API. and removed Python Affects Python cuDF API. labels Mar 25, 2021

vyasr added 2 commits March 25, 2021 14:56

Fix style and remove unsupported operations.

81ffe0a

Apply black reformattings.

6d3fad3

Remove variance from obviously unsupported types.

714742d

shwina reviewed Mar 25, 2021

View reviewed changes

python/cudf/cudf/core/groupby/groupby.py Outdated Show resolved Hide resolved

Don't overwrite dtype variable.

c8a83b3

shwina reviewed Mar 30, 2021

View reviewed changes

python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved

Skip decimal tests on CUDA 10.x.

040401c

vyasr marked this pull request as ready for review March 31, 2021 02:16

Rename pyarrow_dtype to pyarrow_type.

286b686

vyasr force-pushed the fix/issue7687_part2 branch from 804e623 to 286b686 Compare March 31, 2021 02:20

Use rmm to get the CUDA version.

91642d3

vyasr removed the 2 - In Progress Currently a work in progress label Mar 31, 2021

vyasr requested a review from kkraus14 March 31, 2021 16:38

kkraus14 approved these changes Mar 31, 2021

View reviewed changes

kkraus14 requested changes Mar 31, 2021

View reviewed changes

python/cudf/cudf/tests/test_groupby.py Outdated Show resolved Hide resolved

Make decimal fail loudly on older architectures.

2e7dfb8

kkraus14 reviewed Mar 31, 2021

View reviewed changes

python/cudf/cudf/_lib/groupby.pyx Outdated Show resolved Hide resolved

vyasr added 2 commits March 31, 2021 14:49

Fix import order.

fbfcdf8

Change exception message to indicate that the underlying cause is a c…

72bc1b7

…ompiler bug.

kkraus14 approved these changes Mar 31, 2021

View reviewed changes

kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Mar 31, 2021

rapids-bot bot merged commit 24f3016 into rapidsai:branch-0.19 Apr 1, 2021

vyasr mentioned this pull request Apr 1, 2021

Refactor Python and Cython internals for groupby aggregation #7818

Merged

ChrisJar mentioned this pull request Apr 5, 2021

[FEA] Support GroupBy aggregations on DecimalColumn #7489

Closed

vyasr mentioned this pull request Jul 14, 2021

[FEA] Support groupby collect agg on struct columns #8520

Open

vyasr deleted the fix/issue7687_part2 branch January 14, 2022 18:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Support groupby operations for decimal dtypes #7731

[REVIEW] Support groupby operations for decimal dtypes #7731

vyasr commented Mar 25, 2021 •

edited

Loading

codereport commented Mar 25, 2021 •

edited

Loading

vyasr commented Mar 25, 2021

vyasr commented Mar 31, 2021

kkraus14 commented Mar 31, 2021

kkraus14 commented Mar 31, 2021

[REVIEW] Support groupby operations for decimal dtypes #7731

[REVIEW] Support groupby operations for decimal dtypes #7731

Conversation

vyasr commented Mar 25, 2021 • edited Loading

codereport commented Mar 25, 2021 • edited Loading

vyasr commented Mar 25, 2021

vyasr commented Mar 31, 2021

kkraus14 commented Mar 31, 2021

kkraus14 commented Mar 31, 2021

vyasr commented Mar 25, 2021 •

edited

Loading

codereport commented Mar 25, 2021 •

edited

Loading