Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Support groupby operations for decimal dtypes #7731

Merged
merged 39 commits into from
Apr 1, 2021

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Mar 25, 2021

This PR resolves #7687. It also does a bit of cleanup of the internals of the code base. There is more that I would like to do, but I'll probably punt everything to a future PR that I don't directly have to touch for this change in the interest of quickly resolving the issue.

I still need help determining why a few aggregations aren't working. The problems fall into two groups:

  1. The var and std aggregations currently don't fail, but they always return columns filled with NULLs. I found the implementation of the dispatch for these methods in variance.cu/compound.cuh, and at least nominally it seems like these methods are not currently supported because the corresponding enable_if_t is based on whether the type satisfies std::is_arithmetic, which decimal types will not. However, I'm not sure whether the problem is that this classification is incorrect and these types are actually supported by libcudf, or if there really isn't an implementation; I tried to find one, but there are a lot of different files related to aggregation and I'm sure I didn't find all of them. If we simply don't have an implementation, I can remove these from the list of valid aggregations.
  2. The mean, quantile, and median aggregations all raise a RuntimeError from binaryop.hpp: "Input must have fixed_point data_type." I've traced the error to the Cython GroupBy.aggregate method, specifically the line where it calls through to the underlying c_obj's aggregate method. The call stack in C++ is pretty deep after that, though, and I haven't yet been able to pinpoint whether the failure is a missing cast somewhere (i.e. libcudf thinks that the column is a floating point type when it's really not) or if the problem lies elsewhere.

Update
Thanks to @codereport, I've now marked all the above as unsupported operations. After some discussion with other devs I've also handled the other extended types. I still need to write tests, but I think this PR is ready for review in its current form to identify if I've missed anything in the implementation.

@vyasr vyasr requested a review from a team as a code owner March 25, 2021 20:59
@vyasr vyasr requested review from cwharris and isVoid March 25, 2021 20:59
@vyasr vyasr self-assigned this Mar 25, 2021
@github-actions github-actions bot added the Python Affects Python cuDF API. label Mar 25, 2021
@vyasr vyasr added 2 - In Progress Currently a work in progress Cython improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API. and removed Python Affects Python cuDF API. labels Mar 25, 2021
@codereport
Copy link
Contributor

codereport commented Mar 25, 2021

@vyasr None of the aggregations mentioned in 1) or 2) are currently supported for cudf::group_by with Decimal32/64 in libcudf.

This is the main issue for all of fixed_point work (I kept it up to date at least while I was spending most of my time on fixed_point). The two hash/sort group_by PRs are here: #7169 and #7190. Looks like MIN, MAX, SUM & COUNT are the only ops that are currently supported. More work will need to be done in order to support other ops. Especially the ops like MEAN, which are hard coded to only work with floating point accumulator times (aka the variables that store the temporary values).

@vyasr
Copy link
Contributor Author

vyasr commented Mar 25, 2021

Thanks @codereport! That's helpful. I've updated the list of supported operations in this PR.

@vyasr vyasr marked this pull request as ready for review March 31, 2021 02:16
@vyasr vyasr force-pushed the fix/issue7687_part2 branch from 804e623 to 286b686 Compare March 31, 2021 02:20
@vyasr
Copy link
Contributor Author

vyasr commented Mar 31, 2021

After tracing this pretty deep down the stack @codereport helped me identify that these operations are actually not supported on cuda 10.x due to a compiler bug. The C++ tests for this already exist and are explicitly disabled. In the interest of testing this to the extent possible, I've adding a conditional skip with pytest based on the nvcc version available.

@kkraus14
Copy link
Collaborator

After tracing this pretty deep down the stack @codereport helped me identify that these operations are actually not supported on cuda 10.x due to a compiler bug. The C++ tests for this already exist and are explicitly disabled. In the interest of testing this to the extent possible, I've adding a conditional skip with pytest based on the nvcc version available.

Instead of checking the nvcc version we should be able to use the rmm._cuda module to get the CUDA version and skip on 10.1 / 10.2.

@vyasr vyasr removed the 2 - In Progress Currently a work in progress label Mar 31, 2021
@vyasr vyasr requested a review from kkraus14 March 31, 2021 16:38
python/cudf/cudf/tests/test_groupby.py Outdated Show resolved Hide resolved
@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Mar 31, 2021
@kkraus14
Copy link
Collaborator

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 24f3016 into rapidsai:branch-0.19 Apr 1, 2021
rapids-bot bot pushed a commit that referenced this pull request Apr 16, 2021
This PR makes some improvements to the groupby/aggregation code that I identified while working on #7731. The main purpose is to make the code logic easier to follow and reduce some unnecessary complexity; I see minor but measurable performance improvements (2-5% for small datasets) as well, but those are mostly just side effects here. Specifically, it makes the following changes:

1. Inlines the logic for dropping unsupported aggregations. The old function was repetitive and necessitated looping over the aggregations twice, whereas the new approach drops unwanted aggregations on the fly so it only loops once. The new code also makes it so that you only construct a C aggregation object once.
2. Merges the logic from `_AggregationFactory` into `Aggregation`, and removes the constructor for `Aggregation`. The one downside here is that the Cython `Aggregation` object's constructor no longer places it in a valid state; however, in practice the object is always constructed via either the `make_aggregation` function or its various factories, and the object's constructor was only every used in `_drop_unsupported_aggs` anyway. The benefit is we remove the fragmentation between these two classes, making the code much more readable, and the `Aggregation` class actually serves a purpose now beyond just providing a single property `kind` that is only used once: it is now the primary way that other Cython files interact with aggregations. This also means that in most places other Cython modules don't need to work with `unique_ptr[aggregation]` as much anymore (although they do still have to move `Aggregation.c_obj` for performance reasons). `make_aggregation` now returns the Cython class instead of the underlying C++ one.
3. Modified all the "allowed aggregations" sets to use the uppercase names of the aggregations. In addition to simplifying the code a tiny bit, this helps reduce confusion between the aggregation names used in Python for pandas compatibility and the libcudf names (for instance, `idxmin` vs `argmin`, now `ARGMIN`).
4. Explicitly defines all the aggregations on a groupby. I discussed this briefly with @shwina, the change has pros and cons. The benefit is that all of these methods are properly documented now, there's less magic (the binding of methods to a class after its definition can be confusing for less experienced Python developers and has a lot of potential gotchas), and we can use the simpler string-based agg definition wherever possible. The downside is that we now have to define all of these methods. I think the change is definitely an improvement, but I'm happy to change it back if anyone can suggest a better alternative. In the long run we probably need to find a way to share both code and docstrings more effectively between all aggregations (DataFrame, Series, and GroupBy).

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Ashwin Srinath (https://github.com/shwina)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #7818
@vyasr vyasr deleted the fix/issue7687_part2 branch January 14, 2022 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support groupby aggregations w/ Decimal columns
4 participants