Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Groupby scans and segmented shift operations do not preserve ordering with original data #8714

Closed
beckernick opened this issue Jul 12, 2021 · 4 comments · Fixed by #8720
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

beckernick commented Jul 12, 2021

If I use groupby.cumsum or other scan or segmented shift operations from Python for feature engineering, my results are sorted by key (but in the original order within each key). This can make it a challenging to add the resultant data as a column back into an original dataframe, a common use case when creating lagged or scan-based features. I'd like to be able to return results in the original row order.

Adding results from shift (segmented shift) and cumsum|max|etc. operations as new columns in the original dataframe might potentially require running one boolean masking + setitem operation per unique key in the groupby, which would not scale well.

import cudf
import numpy as np

np.random.seed(12)
nrows = 100

df = cudf.DataFrame({
    "key":[0] * (nrows//2) + [1] * (nrows//2),
    "val1": range(nrows),
}).sample(nrows).reset_index(drop=True) # shuffle data

df.head(8)
	key	val1
0	1	70
1	1	86
2	0	18
3	1	91
4	1	74
5	1	97
6	0	43
7	0	48

df.groupby("key").val1.shift(1).head()
key
0    <NA>
0      18
0      43
0      48
0       6
Name: val1, dtype: int64
df.groupby("key").val1.cumsum()
key
0      18
0      61
0     109
0     115
0     157
     ... 
1    3475
1    3553
1    3612
1    3664
1    3725
Name: val1, Length: 100, dtype: int64
# packages in environment at /raid/nicholasb/miniconda3/envs/rapids-21.08:
cucim                     21.08.00a210712 cuda_11.2_py38_g8da6ca9_11    rapidsai-nightly
cudf                      21.08.00a210711 cuda_11.2_py38_g8320a15cda_263    rapidsai-nightly
cudf_kafka                21.08.00a210711 py38_g8320a15cda_263    rapidsai-nightly
cugraph                   21.08.00a210712 py38_g1a636029_76    rapidsai-nightly
cuml                      21.08.00a210712 cuda11.2_py38_gc9abba1a4_113    rapidsai-nightly
cusignal                  21.08.00a210712 py37_gb704464_21    rapidsai-nightly
cuspatial                 21.08.00a210712 py38_g8c31c2c_22    rapidsai-nightly
custreamz                 21.08.00a210711 py38_g8320a15cda_263    rapidsai-nightly
cuxfilter                 21.08.00a210712 py38_g652bf1c_16    rapidsai-nightly
dask-cuda                 21.08.00a210712         py38_33    rapidsai-nightly
dask-cudf                 21.08.00a210711 py38_g8320a15cda_263    rapidsai-nightly
libcucim                  21.08.00a210712 cuda11.2_g8da6ca9_11    rapidsai-nightly
libcudf                   21.08.00a210712 cuda11.2_g0b9ea0176c_264    rapidsai-nightly
libcudf_kafka             21.08.00a210711 g8320a15cda_263    rapidsai-nightly
libcugraph                21.08.00a210712 cuda11.2_g1a636029_76    rapidsai-nightly
libcuml                   21.08.00a210712 cuda11.2_gc9abba1a4_113    rapidsai-nightly
libcumlprims              21.08.00a210605 cuda11.2_g8d4e6b0_2    rapidsai-nightly
libcuspatial              21.08.00a210712 cuda11.2_g8c31c2c_22    rapidsai-nightly
librmm                    21.08.00a210712 cuda11.2_geb2b991_34    rapidsai-nightly
libxgboost                1.4.2dev.rapidsai21.08      cuda11.2_0    rapidsai-nightly
py-xgboost                1.4.2dev.rapidsai21.08  cuda11.2py38_0    rapidsai-nightly
rapids                    21.08.00a210709 cuda11.2_py38_g6430beb_23    rapidsai-nightly
rapids-xgboost            21.08.00a210709 cuda11.2_py38_g6430beb_23    rapidsai-nightly
rmm                       21.08.00a210712 cuda_11.2_py38_geb2b991_34    rapidsai-nightly
ucx                       1.9.0+gcd9efd3       cuda11.2_0    rapidsai-nightly
ucx-proc                  1.0.0                       gpu    rapidsai-nightly
ucx-py                    0.21.0a210712   py38_gcd9efd3_29    rapidsai-nightly
xgboost                   1.4.2dev.rapidsai21.08  cuda11.2py38_0    rapidsai-nightly
@beckernick beckernick added feature request New feature or request Needs Triage Need team to review and classify Python Affects Python cuDF API. labels Jul 12, 2021
@beckernick
Copy link
Member Author

@shwina do you think this would require support from the libcudf layer?

@beckernick beckernick added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jul 12, 2021
@isVoid
Copy link
Contributor

isVoid commented Jul 12, 2021

I remember groupby.shift was implemented prior to the discussion I had with @shwina about the order of return values. Later features (e.g. groupby.fillna) preserves the order of index and was done without libcudf support.

@shwina
Copy link
Contributor

shwina commented Jul 12, 2021

This can (and should) be done outside of libcudf, as @isVoid mentioned we are doing already

@beckernick
Copy link
Member Author

Thanks for the insight. Re-tagging 👍

@beckernick beckernick removed the libcudf Affects libcudf (C++/CUDA) code. label Jul 12, 2021
@isVoid isVoid self-assigned this Jul 12, 2021
@beckernick beckernick added this to the Time Series Analysis milestone Jul 14, 2021
@rapids-bot rapids-bot bot closed this as completed in #8720 Aug 4, 2021
rapids-bot bot pushed a commit that referenced this issue Aug 4, 2021
Closes #8714 

This PR makes transform-like ops return results with orders matching that of inputs. For example: `groupby.shift`

```python
In [21]: df.head(8)
Out[21]:
   key  val1
0    1    70
1    1    86
2    0    18
3    1    91
4    1    74
5    1    97
6    0    43
7    0    48

In [22]: df.groupby('key').shift(1).head(8)
Out[22]:
   val1
0  <NA>
1    70
2  <NA>
3    86
4    91
5    74
6    18
7    43
```

This would affect `groupby.scan` and `groupby.shift`.

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)
  - Ashwin Srinath (https://github.com/shwina)

URL: #8720
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants