[FEA] Groupby transform functions #4522

aerdem4 · 2020-03-16T16:15:37Z

Is your feature request related to a problem? Please describe.
We are missing some groupby transform functions on cudf.

Describe the solution you'd like
We want to be able to call the transform functions that are available on Pandas.

Describe alternatives you've considered
As a solution, we implement numba (cuda) functions and use them within apply_grouped. You can see the functions here: https://github.com/aerdem4/rapids-kaggle-utils

Additional context
Example usage of these functions:
https://www.kaggle.com/aerdem4/ion-lofo-importance-on-gpu-via-rapids-xgboost
https://www.kaggle.com/aerdem4/m5-lofo-importance-on-gpu-via-rapids-xgboost

kkraus14 · 2020-03-16T16:18:40Z

cc @shwina does your groupby refactor cover any of these?

Specifically: https://github.com/aerdem4/rapids-kaggle-utils/blob/master/cu_utils/transform.py#L5-L41

For rolling per group I imagine we'll need libcudf support 😄

shwina · 2020-03-16T16:54:16Z

I may be am missing something here, but aren't min, max and mean aggregations and not transforms?

When used with transform rather than agg, Pandas basically broadcasts the result of the aggregation to the size of the group:

In [5]: a = pd.DataFrame({'a': [1, 1, 2, 1, 3], 'b': [1, 1, 1, 2, 3], 'c': [1, 1, 3, 4, 5], 'd': ['a', 'b', 'c', 'd', 'e']})

In [6]: a
Out[6]:
   a  b  c  d
0  1  1  1  a
1  1  1  1  b
2  2  1  3  c
3  1  2  4  d
4  3  3  5  e

In [7]: a.groupby('a').agg('max')
Out[7]:
   b  c  d
a
1  2  4  d
2  1  3  c
3  3  5  e

In [8]: a.groupby('a').transform('max')
Out[8]:
   b  c  d
0  2  4  d
1  2  4  d
2  1  3  c
3  2  4  d
4  3  5  e

Is this the desired functionality?

aerdem4 · 2020-03-16T19:29:36Z

Indeed, the desired functionality is to broadcast the aggregations. As a workaround, we can also do groupby agg and merge afterwards.

shwina · 2020-03-16T19:54:06Z

Without express support for transformations from libcudf, I'd +1 the agg+loc or agg+merge at least for now.

aerdem4 · 2020-04-18T19:45:58Z

We have recently got 3 more function requests additional to what we have so far:

groupby rolling_std transform
groupby nunique transform
groupby count transform

harrism · 2020-04-20T02:01:55Z

@shwina if this requires libcudf support please add more detail and appropriate labels. Thanks!

shwina · 2020-04-20T14:45:00Z

Groupby aggregations typically are reduction-like functions that result in a single row per group. The result of a groupby transformation on the other hand, is the same size as the group.

As an example, here's the difference between doing a max aggregation v/s a max transform in Pandas -- we see that Pandas "broadcasts" the result to the size of the group.

In [10]: a = pd.DataFrame({'a': [1, 1, 2, 3], 'b': [1, 2, 3, 4]})

In [11]: a
Out[11]:
   a  b
0  1  1
1  1  2
2  2  3
3  3  4

In [12]: a.groupby('a').agg('max')
Out[12]:
   b
a
1  2
2  3
3  4

In [13]: a.groupby('a').transform('max')
Out[13]:
   b
0  2
1  2
2  3
3  4

Not all transformations are a combination of aggregation+broadcast. Transformations can also be UDFs that are applied to each column in a group. Here's a contrived example in which the UDF fills nulls in each column in a group with the maximum value of that column:

In [32]: a = pd.DataFrame({'a': [1, 1, 1, 2, 2, 3], 'b': [1, 2, None, 3, None, 4], 'c': [2, 3, None, 4, None, 5]})

In [33]: a
Out[33]:
   a    b    c
0  1  1.0  2.0
1  1  2.0  3.0
2  1  NaN  NaN
3  2  3.0  4.0
4  2  NaN  NaN
5  3  4.0  5.0

In [34]: a.groupby('a').transform(lambda x: x.fillna(x.max()))
Out[34]:
     b    c
0  1.0  2.0
1  2.0  3.0
2  2.0  3.0
3  3.0  4.0
4  3.0  4.0
5  4.0  5.0

Finally, groupby rolling is also considered a groupby-transform.

kkraus14 · 2020-04-20T14:45:34Z

@harrism Yes libcudf support is needed otherwise we have very expensive workarounds.

For explanation, groupby transform is basically a fused aggregation + broadcast. Where the aggregation is calculated per group, but the value is output per row. For example, with input table df:

a	b
1	10
1	11
1	12
2	13
2	14
2	15

df.groupby(['a']).transform('max')

a	b
1	12
1	12
1	12
2	15
2	15
2	15

df.groupby(['a']).transform('nunique')

a	b
1	3
1	3
1	3
2	3
2	3
2	3

kkraus14 · 2020-07-13T16:38:53Z

Had some further offline discussions today about this today and one of the driving use cases is to allow doing something like:

df['newcol'] = df.groupby(['oldcol'])['othercol'].transform('max')

In order to enable this nicely we need to make sure to retain the index in the transform operation.

pretzelpy · 2020-09-23T00:47:27Z

Had some further offline discussions today about this today and one of the driving use cases is to allow doing something like:
df['newcol'] = df.groupby(['oldcol'])['othercol'].transform('max')
In order to enable this nicely we need to make sure to retain the index in the transform operation.

This describes my use case exactly.

Is there a action pending on this feature request?

kkraus14 · 2020-09-23T01:28:11Z

This describes my use case exactly.

Is there a action pending on this feature request?

Development hasn't started on this feature. One thing to note: Pandas is order maintaining when using groupby.transform to make things more complicated:

import pandas as pd

pdf = pd.DataFrame({'a': [1, 2, 3, 1], 'b': [1, 2, 3, 4]})
pdf.groupby('a').transform('max')

pretzelpy · 2020-09-23T22:54:09Z

This describes my use case exactly.
Is there a action pending on this feature request?

Development hasn't started on this feature. One thing to note: Pandas is order maintaining when using groupby.transform to make things more complicated:
import pandas as pd

pdf = pd.DataFrame({'a': [1, 2, 3, 1], 'b': [1, 2, 3, 4]})
pdf.groupby('a').transform('max')
   b
0  4
1  2
2  3
3  4

Thanks for the reply. I am working on a cudf concept for replacing a clunky SQL electric load forecasting model. My working dateset is 600M rows x 10 columns and I am stuck at needing the transform functionality. I'm not clear if the workaround links at the top of this thread are applicable to my use case. Can a function be written to accomplish the example from your Jul 13 post using a gpu? Any other suggestions of topics I can investigate?

Working Pandas Example:

1 - starting df = [ 'account', 'year', 'month', 'day', 'minute', 'kwh_use']

2 - working pandas line: df['max_kwh'] = df.groupby(['account', 'year', 'month', day'])['kwh_use'].transform(max)

3 - desired result df = [ 'account', 'year', 'month', 'day', 'minute', 'kwh_use', 'kwh_max' ]

kkraus14 · 2020-09-23T23:00:47Z

@pretzelpy the workaround is to use a groupby + merge:

import cudf

gdf = cudf.DataFrame({'a': [1, 2, 3, 1], 'b': [1, 2, 3, 4]})
aggs = gdf.groupby('a', as_index=False)['b'].max()

result = gdf[['a']].merge(aggs, on=['a'], how='inner')
# Note that result is not guaranteed to maintain the order of `gdf`

pretzelpy · 2020-09-26T03:13:18Z

@pretzelpy the workaround is to use a groupby + merge:


import cudf



gdf = cudf.DataFrame({'a': [1, 2, 3, 1], 'b': [1, 2, 3, 4]})

aggs = gdf.groupby('a', as_index=False)['b'].max()



result = gdf[['a']].merge(aggs, on=['a'], how='inner')

# Note that result is not guaranteed to maintain the order of `gdf`

Do you guys intend RAPIDS will reach feature parody with packages like pandas? Or are there fundamental limits in architecture that limit where a GPU can accelerate?
I am astonished by the performance of cuDF, it's unreal. Thank you.

kkraus14 · 2020-09-26T03:24:43Z

Do you guys intend RAPIDS will reach feature parody with packages like pandas?

We will continue to increase the feature parity of cuDF and Pandas, but generally our approach is to listen to user feedback and provide functionality that users need to effectively solve their problems. A prime example of this is we recently added support for List dtype columns and are working on Struct, Map, and Decimal types as well based on user feedback.

Or are there fundamental limits in architecture that limit where a GPU can accelerate?

Certain things like iterating over column values / Dataframe rows we cannot efficiently support on the GPU (even though its quite inefficient to do in Pandas as well!) so we clearly error instead. Also, we cannot handle true object type columns since in Pandas it is effectively a column of pointers to Python objects, and those Python objects cannot reside in GPU memory.

github-actions · 2021-03-14T19:14:07Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions · 2021-03-14T19:14:09Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Closes #4522 This PR adds support for doing groupby aggregations via the `transform()` API, where the result of the aggregation is broadcasted to the size of the group. Note that more general transformations are not supported at this time. Authors: - Ashwin Srinath (https://github.com/shwina) Approvers: - Michael Wang (https://github.com/isVoid) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #10005

aerdem4 added Needs Triage Need team to review and classify feature request New feature or request labels Mar 16, 2020

kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Mar 16, 2020

kkraus14 added the libcudf Affects libcudf (C++/CUDA) code. label Apr 20, 2020

github-actions bot added the inactive-90d label Mar 14, 2021

github-actions bot added the inactive-30d label Mar 14, 2021

beckernick added this to the Pandas API Alignment and Coverage milestone Aug 2, 2021

shwina self-assigned this Nov 22, 2021

shwina mentioned this issue Jan 10, 2022

Add groupby.transform (only support for aggregations) #10005

Merged

rapids-bot bot closed this as completed in #10005 Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Groupby transform functions #4522

[FEA] Groupby transform functions #4522

aerdem4 commented Mar 16, 2020

kkraus14 commented Mar 16, 2020

shwina commented Mar 16, 2020 •

edited

Loading

aerdem4 commented Mar 16, 2020

shwina commented Mar 16, 2020

aerdem4 commented Apr 18, 2020

harrism commented Apr 20, 2020

shwina commented Apr 20, 2020

kkraus14 commented Apr 20, 2020

kkraus14 commented Jul 13, 2020

pretzelpy commented Sep 23, 2020

kkraus14 commented Sep 23, 2020

pretzelpy commented Sep 23, 2020

kkraus14 commented Sep 23, 2020

pretzelpy commented Sep 26, 2020

kkraus14 commented Sep 26, 2020

github-actions bot commented Mar 14, 2021

github-actions bot commented Mar 14, 2021

[FEA] Groupby transform functions #4522

[FEA] Groupby transform functions #4522

Comments

aerdem4 commented Mar 16, 2020

kkraus14 commented Mar 16, 2020

shwina commented Mar 16, 2020 • edited Loading

aerdem4 commented Mar 16, 2020

shwina commented Mar 16, 2020

aerdem4 commented Apr 18, 2020

harrism commented Apr 20, 2020

shwina commented Apr 20, 2020

kkraus14 commented Apr 20, 2020

kkraus14 commented Jul 13, 2020

pretzelpy commented Sep 23, 2020

kkraus14 commented Sep 23, 2020

pretzelpy commented Sep 23, 2020

kkraus14 commented Sep 23, 2020

pretzelpy commented Sep 26, 2020

kkraus14 commented Sep 26, 2020

github-actions bot commented Mar 14, 2021

github-actions bot commented Mar 14, 2021

shwina commented Mar 16, 2020 •

edited

Loading