Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Groupby shift #7183

Closed
beckernick opened this issue Jan 21, 2021 · 6 comments · Fixed by #8131
Closed

[FEA] Groupby shift #7183

beckernick opened this issue Jan 21, 2021 · 6 comments · Fixed by #8131
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

beckernick commented Jan 21, 2021

I'd like to be able to use shift on groupby Series and DataFrame objects. Today, I can do this in pandas but not cudf.

import pandas as pd
import dask.dataframe as dd
import cudfpdf = pd.DataFrame({
    "a": [0,1,0,1,1,0],
    "b": range(6),
    "c": ["a","b","c","d","e","f"]
})
gdf = cudf.from_pandas(pdf)
​
print(pdf.groupby("a").b.shift(1))
gdf.groupby("a").b.shift(1)
0    NaN
1    NaN
2    0.0
3    1.0
4    3.0
5    2.0
Name: b, dtype: float64
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-2e03f810ad06> in <module>
     11 
     12 print(pdf.groupby("a").b.shift(1))
---> 13 gdf.groupby("a").b.shift(1)

/raid/nicholasb/miniconda3/envs/rapids-gpu-bdb-20210120/lib/python3.7/site-packages/cudf/core/groupby/groupby.py in __getattribute__(self, key)
     61     def __getattribute__(self, key):
     62         try:
---> 63             return super().__getattribute__(key)
     64         except AttributeError:
     65             if key in libgroupby._GROUPBY_AGGS:

AttributeError: 'SeriesGroupBy' object has no attribute 'shift'
conda list | grep "rapids\|blazing\|dask\|distr\|pandas"
# packages in environment at /raid/nicholasb/miniconda3/envs/rapids-gpu-bdb-20210120:
blazingsql                0.18.0a0                 pypi_0    pypi
cudf                      0.18.0a210120   cuda_10.2_py37_g02e25b6f3d_183    rapidsai-nightly
cuml                      0.18.0a210120   cuda10.2_py37_g816bb6506_79    rapidsai-nightly
dask                      2021.1.0           pyhd8ed1ab_0    conda-forge
dask-core                 2021.1.0           pyhd8ed1ab_0    conda-forge
dask-cuda                 0.18.0a201211           py37_39    http://conda-mirror.gpuci.io/rapidsai-nightly
dask-cudf                 0.18.0a210120   py37_g02e25b6f3d_183    http://conda-mirror.gpuci.io/rapidsai-nightly
distributed               2021.1.0         py37h89c1867_1    conda-forge
faiss-proc                1.0.0                      cuda    http://conda-mirror.gpuci.io/rapidsai-nightly
libcudf                   0.18.0a210120   cuda10.2_g02e25b6f3d_183    rapidsai-nightly
libcuml                   0.18.0a210120   cuda10.2_g816bb6506_79    rapidsai-nightly
libcumlprims              0.18.0a201203   cuda10.2_gff080f3_0    http://conda-mirror.gpuci.io/rapidsai-nightly
librmm                    0.18.0a210120   cuda10.2_gce99588_23    rapidsai-nightly
pandas                    1.1.5            py37hdc94413_0    conda-forge
rmm                       0.18.0a210120   cuda_10.2_py37_gce99588_23    http://conda-mirror.gpuci.io/rapidsai-nightly
ucx                       1.9.0+gcd9efd3       cuda10.2_0    http://conda-mirror.gpuci.io/rapidsai-nightly
ucx-proc                  1.0.0                       gpu    http://conda-mirror.gpuci.io/rapidsai-nightly
ucx-py                    0.18.0a210120   py37_gcd9efd3_10    http://conda-mirror.gpuci.io/rapidsai-nightly
@beckernick beckernick added feature request New feature or request Needs Triage Need team to review and classify labels Jan 21, 2021
@kkraus14 kkraus14 added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jan 27, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@randerzander
Copy link
Contributor

Still a valid [FEA]

@minhlong94
Copy link

This is a very useful feature. Should be implemented. +1

@taureandyernv
Copy link
Contributor

taureandyernv commented Mar 31, 2021

@harrism @beckernick , this feature request was mentioned in https://stackoverflow.com/questions/66863973/cudf-an-alternative-of-pandas-groupby-shift. Definitely seems like there is demand :)

@kkraus14
Copy link
Collaborator

This is planned for 0.20.

@harrism harrism assigned isVoid and unassigned karthikeyann Mar 31, 2021
rapids-bot bot pushed a commit that referenced this issue Apr 26, 2021
Part 1 (libcudf side) of #7183 

This PR adds `groupby::shift` API, performs group based shifts. The main difference between regular `shift` and `groupby::shift`, is that value gets clipped, and `<NA>` gets introduced at group boundaries. Example:

```
key = [1, 1, 1, 1, 2, 2, 2]
val = [3, 4, 5, 6, 7, 8, 9]
offset = 2
fill_value = <NA> # No fill for boundary values
result = [<NA>, <NA>, 3, 4, <NA>, <NA>, 7]
```
```
key = [1, 1, 1, 1, 2, 2, 2]
val = [3, 4, 5, 6, 7, 8, 9]
offset = 2
fill_value = 42 # Fill 42 for boundary values
result = [42, 42, 3, 4, 42, 42, 7]
```


Implementation notes:
Current implementation is based on `copy_if_else`, where `lhs` is the segmented values iterator with an offset, and `rhs` is a constant iterator to the fill scalar.

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Mike Wendt (https://github.com/mike-wendt)
  - Nghia Truong (https://github.com/ttnghia)
  - Keith Kraus (https://github.com/kkraus14)
  - Karthikeyan (https://github.com/karthikeyann)
  - Mark Harris (https://github.com/harrism)

URL: #7910
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue May 26, 2021
Closes #7183 , follow up of #7910 

This PR:
- refactors existing libcudf `groupby::shift` API, which only takes a single column, to accept multiple columns.
- adds cython and python bindings for `groupby.shift`. Example python usage:

```
df = cudf.DataFrame({"a":[1,2,1,2,2], "b":["x", "y", "z", "42", "7"]})
>>> df.groupby("a").shift(1)
      b
a      
1  <NA>
1     x
2  <NA>
2     y
2    42
```

Minor refactors:
- adds `use_thread` parameter to `dataset_generator.rand_dataframe` to expose thread pool config.

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Robert Maynard (https://github.com/robertmaynard)
  - Ashwin Srinath (https://github.com/shwina)
  - Keith Kraus (https://github.com/kkraus14)
  - Karthikeyan (https://github.com/karthikeyann)
  - Christopher Harris (https://github.com/cwharris)

URL: #8131
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants