Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support groupby aggregations w/ Decimal columns #7687

Closed
randerzander opened this issue Mar 23, 2021 · 0 comments · Fixed by #7731
Closed

[FEA] Support groupby aggregations w/ Decimal columns #7687

randerzander opened this issue Mar 23, 2021 · 0 comments · Fixed by #7731
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@randerzander
Copy link
Contributor

Example:

import cudf
from cudf.core.dtypes import Decimal64Dtype

df = cudf.DataFrame({'val': [0.01, 0.02, 0.03], 'id': [0, 1, 2]})

df['dec_val'] = df['val'].astype(Decimal64Dtype(7,2))
df.groupby('id').sum()

Result: Note that the dec_val column is dropped entirely.


  | val
-- | --

0.03
0.01
0.02
@randerzander randerzander added feature request New feature or request Python Affects Python cuDF API. labels Mar 23, 2021
@vyasr vyasr self-assigned this Mar 24, 2021
@rapids-bot rapids-bot bot closed this as completed in #7731 Apr 1, 2021
rapids-bot bot pushed a commit that referenced this issue Apr 1, 2021
This PR resolves #7687. It also does a bit of cleanup of the internals of the code base. There is more that I would like to do, but I'll probably punt everything to a future PR that I don't directly have to touch for this change in the interest of quickly resolving the issue.

I still need help determining why a few aggregations aren't working. The problems fall into two groups:
1. The `var` and `std` aggregations currently don't fail, but they always return columns filled with NULLs. I found the implementation of the dispatch for these methods in `variance.cu`/`compound.cuh`, and at least nominally it seems like these methods _are not_ currently supported because the corresponding `enable_if_t` is based on whether the type satisfies `std::is_arithmetic`, which decimal types will not. However, I'm not sure whether the problem is that this classification is incorrect and these types are actually supported by `libcudf`, or if there really isn't an implementation; I tried to find one, but there are a lot of different files related to aggregation and I'm sure I didn't find all of them. If we simply don't have an implementation, I can remove these from the list of valid aggregations.
2. The `mean`, `quantile`, and `median` aggregations all raise a `RuntimeError` from `binaryop.hpp`: "Input must have fixed_point data_type." I've traced the error to the Cython `GroupBy.aggregate` method, specifically the line where it calls through to the underlying `c_obj`'s `aggregate` method. The call stack in C++ is pretty deep after that, though, and I haven't yet been able to pinpoint whether the failure is a missing cast somewhere (i.e. `libcudf` thinks that the column is a floating point type when it's really not) or if the problem lies elsewhere.

**Update**
Thanks to @codereport, I've now marked all the above as unsupported operations. After some discussion with other devs I've also handled the other extended types. I still need to write tests, but I think this PR is ready for review in its current form to identify if I've missed anything in the implementation.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - Keith Kraus (https://github.com/kkraus14)

URL: #7731
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants