-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Groupby transform functions #4522
Comments
cc @shwina does your groupby refactor cover any of these? Specifically: https://github.com/aerdem4/rapids-kaggle-utils/blob/master/cu_utils/transform.py#L5-L41 For rolling per group I imagine we'll need libcudf support 😄 |
I may be am missing something here, but aren't When used with In [5]: a = pd.DataFrame({'a': [1, 1, 2, 1, 3], 'b': [1, 1, 1, 2, 3], 'c': [1, 1, 3, 4, 5], 'd': ['a', 'b', 'c', 'd', 'e']})
In [6]: a
Out[6]:
a b c d
0 1 1 1 a
1 1 1 1 b
2 2 1 3 c
3 1 2 4 d
4 3 3 5 e
In [7]: a.groupby('a').agg('max')
Out[7]:
b c d
a
1 2 4 d
2 1 3 c
3 3 5 e
In [8]: a.groupby('a').transform('max')
Out[8]:
b c d
0 2 4 d
1 2 4 d
2 1 3 c
3 2 4 d
4 3 5 e Is this the desired functionality? |
Indeed, the desired functionality is to broadcast the aggregations. As a workaround, we can also do groupby agg and merge afterwards. |
Without express support for transformations from libcudf, I'd +1 the |
We have recently got 3 more function requests additional to what we have so far:
|
@shwina if this requires libcudf support please add more detail and appropriate labels. Thanks! |
Groupby aggregations typically are reduction-like functions that result in a single row per group. The result of a groupby transformation on the other hand, is the same size as the group. As an example, here's the difference between doing a
Not all transformations are a combination of aggregation+broadcast. Transformations can also be UDFs that are applied to each column in a group. Here's a contrived example in which the UDF fills nulls in each column in a group with the maximum value of that column:
Finally, groupby rolling is also considered a groupby-transform. |
@harrism Yes libcudf support is needed otherwise we have very expensive workarounds. For explanation, groupby transform is basically a fused aggregation + broadcast. Where the aggregation is calculated per group, but the value is output per row. For example, with input table
|
Had some further offline discussions today about this today and one of the driving use cases is to allow doing something like:
In order to enable this nicely we need to make sure to retain the index in the transform operation. |
This describes my use case exactly. Is there a action pending on this feature request? |
Development hasn't started on this feature. One thing to note: Pandas is order maintaining when using
|
Thanks for the reply. I am working on a cudf concept for replacing a clunky SQL electric load forecasting model. My working dateset is 600M rows x 10 columns and I am stuck at needing the transform functionality. I'm not clear if the workaround links at the top of this thread are applicable to my use case. Can a function be written to accomplish the example from your Jul 13 post using a gpu? Any other suggestions of topics I can investigate? Working Pandas Example: 1 - starting df = [ 'account', 'year', 'month', 'day', 'minute', 'kwh_use'] 2 - working pandas line: df['max_kwh'] = df.groupby(['account', 'year', 'month', day'])['kwh_use'].transform(max) 3 - desired result df = [ 'account', 'year', 'month', 'day', 'minute', 'kwh_use', 'kwh_max' ] |
@pretzelpy the workaround is to use a groupby + merge:
|
Do you guys intend RAPIDS will reach feature parody with packages like pandas? Or are there fundamental limits in architecture that limit where a GPU can accelerate? |
We will continue to increase the feature parity of cuDF and Pandas, but generally our approach is to listen to user feedback and provide functionality that users need to effectively solve their problems. A prime example of this is we recently added support for
Certain things like iterating over column values / Dataframe rows we cannot efficiently support on the GPU (even though its quite inefficient to do in Pandas as well!) so we clearly error instead. Also, we cannot handle true |
This issue has been labeled |
This issue has been labeled |
Closes #4522 This PR adds support for doing groupby aggregations via the `transform()` API, where the result of the aggregation is broadcasted to the size of the group. Note that more general transformations are not supported at this time. Authors: - Ashwin Srinath (https://github.com/shwina) Approvers: - Michael Wang (https://github.com/isVoid) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #10005
Is your feature request related to a problem? Please describe.
We are missing some groupby transform functions on cudf.
Describe the solution you'd like
We want to be able to call the transform functions that are available on Pandas.
Describe alternatives you've considered
As a solution, we implement numba (cuda) functions and use them within apply_grouped. You can see the functions here: https://github.com/aerdem4/rapids-kaggle-utils
Additional context
Example usage of these functions:
https://www.kaggle.com/aerdem4/ion-lofo-importance-on-gpu-via-rapids-xgboost
https://www.kaggle.com/aerdem4/m5-lofo-importance-on-gpu-via-rapids-xgboost
The text was updated successfully, but these errors were encountered: