-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Groupby shift #7183
Comments
This issue has been labeled |
Still a valid [FEA] |
This is a very useful feature. Should be implemented. +1 |
@harrism @beckernick , this feature request was mentioned in https://stackoverflow.com/questions/66863973/cudf-an-alternative-of-pandas-groupby-shift. Definitely seems like there is demand :) |
This is planned for 0.20. |
Part 1 (libcudf side) of #7183 This PR adds `groupby::shift` API, performs group based shifts. The main difference between regular `shift` and `groupby::shift`, is that value gets clipped, and `<NA>` gets introduced at group boundaries. Example: ``` key = [1, 1, 1, 1, 2, 2, 2] val = [3, 4, 5, 6, 7, 8, 9] offset = 2 fill_value = <NA> # No fill for boundary values result = [<NA>, <NA>, 3, 4, <NA>, <NA>, 7] ``` ``` key = [1, 1, 1, 1, 2, 2, 2] val = [3, 4, 5, 6, 7, 8, 9] offset = 2 fill_value = 42 # Fill 42 for boundary values result = [42, 42, 3, 4, 42, 42, 7] ``` Implementation notes: Current implementation is based on `copy_if_else`, where `lhs` is the segmented values iterator with an offset, and `rhs` is a constant iterator to the fill scalar. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Mike Wendt (https://github.com/mike-wendt) - Nghia Truong (https://github.com/ttnghia) - Keith Kraus (https://github.com/kkraus14) - Karthikeyan (https://github.com/karthikeyann) - Mark Harris (https://github.com/harrism) URL: #7910
This issue has been labeled |
Closes #7183 , follow up of #7910 This PR: - refactors existing libcudf `groupby::shift` API, which only takes a single column, to accept multiple columns. - adds cython and python bindings for `groupby.shift`. Example python usage: ``` df = cudf.DataFrame({"a":[1,2,1,2,2], "b":["x", "y", "z", "42", "7"]}) >>> df.groupby("a").shift(1) b a 1 <NA> 1 x 2 <NA> 2 y 2 42 ``` Minor refactors: - adds `use_thread` parameter to `dataset_generator.rand_dataframe` to expose thread pool config. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Robert Maynard (https://github.com/robertmaynard) - Ashwin Srinath (https://github.com/shwina) - Keith Kraus (https://github.com/kkraus14) - Karthikeyan (https://github.com/karthikeyann) - Christopher Harris (https://github.com/cwharris) URL: #8131
I'd like to be able to use
shift
on groupby Series and DataFrame objects. Today, I can do this in pandas but not cudf.The text was updated successfully, but these errors were encountered: