[FEA] Groupby scans and segmented shift operations do not preserve ordering with original data #8714

beckernick · 2021-07-12T18:02:54Z

If I use groupby.cumsum or other scan or segmented shift operations from Python for feature engineering, my results are sorted by key (but in the original order within each key). This can make it a challenging to add the resultant data as a column back into an original dataframe, a common use case when creating lagged or scan-based features. I'd like to be able to return results in the original row order.

Adding results from shift (segmented shift) and cumsum|max|etc. operations as new columns in the original dataframe might potentially require running one boolean masking + setitem operation per unique key in the groupby, which would not scale well.

import cudf
import numpy as np

np.random.seed(12)
nrows = 100

df = cudf.DataFrame({
    "key":[0] * (nrows//2) + [1] * (nrows//2),
    "val1": range(nrows),
}).sample(nrows).reset_index(drop=True) # shuffle data

df.head(8)
	key	val1
0	1	70
1	1	86
2	0	18
3	1	91
4	1	74
5	1	97
6	0	43
7	0	48

df.groupby("key").val1.shift(1).head()
key
0    <NA>
0      18
0      43
0      48
0       6
Name: val1, dtype: int64

df.groupby("key").val1.cumsum()
key
0      18
0      61
0     109
0     115
0     157
     ... 
1    3475
1    3553
1    3612
1    3664
1    3725
Name: val1, Length: 100, dtype: int64

# packages in environment at /raid/nicholasb/miniconda3/envs/rapids-21.08:
cucim                     21.08.00a210712 cuda_11.2_py38_g8da6ca9_11    rapidsai-nightly
cudf                      21.08.00a210711 cuda_11.2_py38_g8320a15cda_263    rapidsai-nightly
cudf_kafka                21.08.00a210711 py38_g8320a15cda_263    rapidsai-nightly
cugraph                   21.08.00a210712 py38_g1a636029_76    rapidsai-nightly
cuml                      21.08.00a210712 cuda11.2_py38_gc9abba1a4_113    rapidsai-nightly
cusignal                  21.08.00a210712 py37_gb704464_21    rapidsai-nightly
cuspatial                 21.08.00a210712 py38_g8c31c2c_22    rapidsai-nightly
custreamz                 21.08.00a210711 py38_g8320a15cda_263    rapidsai-nightly
cuxfilter                 21.08.00a210712 py38_g652bf1c_16    rapidsai-nightly
dask-cuda                 21.08.00a210712         py38_33    rapidsai-nightly
dask-cudf                 21.08.00a210711 py38_g8320a15cda_263    rapidsai-nightly
libcucim                  21.08.00a210712 cuda11.2_g8da6ca9_11    rapidsai-nightly
libcudf                   21.08.00a210712 cuda11.2_g0b9ea0176c_264    rapidsai-nightly
libcudf_kafka             21.08.00a210711 g8320a15cda_263    rapidsai-nightly
libcugraph                21.08.00a210712 cuda11.2_g1a636029_76    rapidsai-nightly
libcuml                   21.08.00a210712 cuda11.2_gc9abba1a4_113    rapidsai-nightly
libcumlprims              21.08.00a210605 cuda11.2_g8d4e6b0_2    rapidsai-nightly
libcuspatial              21.08.00a210712 cuda11.2_g8c31c2c_22    rapidsai-nightly
librmm                    21.08.00a210712 cuda11.2_geb2b991_34    rapidsai-nightly
libxgboost                1.4.2dev.rapidsai21.08      cuda11.2_0    rapidsai-nightly
py-xgboost                1.4.2dev.rapidsai21.08  cuda11.2py38_0    rapidsai-nightly
rapids                    21.08.00a210709 cuda11.2_py38_g6430beb_23    rapidsai-nightly
rapids-xgboost            21.08.00a210709 cuda11.2_py38_g6430beb_23    rapidsai-nightly
rmm                       21.08.00a210712 cuda_11.2_py38_geb2b991_34    rapidsai-nightly
ucx                       1.9.0+gcd9efd3       cuda11.2_0    rapidsai-nightly
ucx-proc                  1.0.0                       gpu    rapidsai-nightly
ucx-py                    0.21.0a210712   py38_gcd9efd3_29    rapidsai-nightly
xgboost                   1.4.2dev.rapidsai21.08  cuda11.2py38_0    rapidsai-nightly

The text was updated successfully, but these errors were encountered:

beckernick · 2021-07-12T18:03:32Z

@shwina do you think this would require support from the libcudf layer?

isVoid · 2021-07-12T20:51:59Z

I remember groupby.shift was implemented prior to the discussion I had with @shwina about the order of return values. Later features (e.g. groupby.fillna) preserves the order of index and was done without libcudf support.

shwina · 2021-07-12T20:53:07Z

This can (and should) be done outside of libcudf, as @isVoid mentioned we are doing already

beckernick · 2021-07-12T21:15:52Z

Thanks for the insight. Re-tagging 👍

Closes #8714 This PR makes transform-like ops return results with orders matching that of inputs. For example: `groupby.shift` ```python In [21]: df.head(8) Out[21]: key val1 0 1 70 1 1 86 2 0 18 3 1 91 4 1 74 5 1 97 6 0 43 7 0 48 In [22]: df.groupby('key').shift(1).head(8) Out[22]: val1 0 <NA> 1 70 2 <NA> 3 86 4 91 5 74 6 18 7 43 ``` This would affect `groupby.scan` and `groupby.shift`. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Charles Blackmon-Luca (https://github.com/charlesbluca) - Ashwin Srinath (https://github.com/shwina) URL: #8720

beckernick added feature request New feature or request Needs Triage Need team to review and classify Python Affects Python cuDF API. labels Jul 12, 2021

beckernick mentioned this issue Jul 12, 2021

[BUG] Groupby scans and segmented shift should preserve original index #8715

Closed

beckernick added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jul 12, 2021

beckernick removed the libcudf Affects libcudf (C++/CUDA) code. label Jul 12, 2021

isVoid self-assigned this Jul 12, 2021

isVoid mentioned this issue Jul 12, 2021

Make groupby transform-like op order match original data order #8720

Merged

beckernick added this to the Time Series Analysis milestone Jul 14, 2021

rapids-bot bot closed this as completed in #8720 Aug 4, 2021

isVoid mentioned this issue Aug 4, 2021

[FEA] Allow "no-op" to a sequence column in groupby transform-like operations #8951

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Groupby scans and segmented shift operations do not preserve ordering with original data #8714

[FEA] Groupby scans and segmented shift operations do not preserve ordering with original data #8714

beckernick commented Jul 12, 2021 •

edited

Loading

beckernick commented Jul 12, 2021

isVoid commented Jul 12, 2021

shwina commented Jul 12, 2021

beckernick commented Jul 12, 2021

[FEA] Groupby scans and segmented shift operations do not preserve ordering with original data #8714

[FEA] Groupby scans and segmented shift operations do not preserve ordering with original data #8714

Comments

beckernick commented Jul 12, 2021 • edited Loading

beckernick commented Jul 12, 2021

isVoid commented Jul 12, 2021

shwina commented Jul 12, 2021

beckernick commented Jul 12, 2021

beckernick commented Jul 12, 2021 •

edited

Loading