Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Groupby cumulative sum #1298

Closed
beckernick opened this issue Mar 26, 2019 · 31 comments · Fixed by #7759
Closed

[FEA] Groupby cumulative sum #1298

beckernick opened this issue Mar 26, 2019 · 31 comments · Fixed by #7759
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

beckernick commented Mar 26, 2019

Is your feature request related to a problem? Please describe.
As a user, I'd like to get the cumulative sum of values within each group of a grouped dataframe.

Describe the solution you'd like
I'd like to call df.groupby(col).cumsum() and return a column of the same size containing the cumulative sum within each group.

Describe alternatives you've considered
The alternative to this is messy, requiring me to keep a running tally for each group as well as check the group identity of each value every time.

Additional context
This is blocked by #1269 .

@beckernick beckernick added feature request New feature or request Needs Triage Need team to review and classify labels Mar 26, 2019
@kkraus14 kkraus14 added the Python Affects Python cuDF API. label Mar 27, 2019
@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Apr 8, 2019
@kkraus14
Copy link
Collaborator

kkraus14 commented Apr 8, 2019

@harrism @jrhemstad Since this is specifically for the groupby I imagine this would require a special implementation for the hash based groupby.

@jrhemstad
Copy link
Contributor

This isn't possible with a hash-based groupby.

To implement this feature at the Python level, I'd do two things:

  1. Groupby w/o agg to group equal keys together and find each unique groups offset
  2. Segmented scan with contiguous unique keys for each group. A segmented scan feature doesn't currently exist in libcudf. [FEA] Add segmented scan feature #1365

@jrhemstad
Copy link
Contributor

@beckernick what do you expect the ordering of the values in the group to be? Currently this is non-deterministic, so from one run to the next, a groupby cumsum can easily give different results within each group.

@kkraus14
Copy link
Collaborator

@jrhemstad I would imagine this would only work on a sort based groupby where the order of keys is deterministic.

@jrhemstad
Copy link
Contributor

@jrhemstad I would imagine this would only work on a sort based groupby where the order of keys is deterministic.

But it's not. The order of the keys within a group is definitely not deterministic.

@beckernick
Copy link
Member Author

beckernick commented Apr 10, 2019

In my experience, one of the core use cases for groupby cumulative sums (and cumulative sums in general) is to execute them on data with a temporal component. If I'm trying to get the cumulative sum of sales over the course of a sorted time_of_day column (my grouping column) for downstream use, I would need the cumulative sum ordering to respect the grouping column's ordering. If the ordering of the values in the group is non-deterministic, I wouldn't be able to use this information. Does that answer your question @jrhemstad ?

EDIT: Removing this edit as it doesn't actually help clarify anything.

@kkraus14
Copy link
Collaborator

@jrhemstad I would imagine this would only work on a sort based groupby where the order of keys is deterministic.

But it's not. The order of the keys within a group is definitely not deterministic.

In a sort based groupby they would be... no?

@beckernick
Copy link
Member Author

@AK-ayush do you have any thoughts, given your feature request?

@jrhemstad
Copy link
Contributor

jrhemstad commented Apr 10, 2019

Keys: [1, 2, 1, 2, 1, 2]
Values: [0, 1, 2, 3, 4, 5]

Sorting by keys gives several possible orders...
Keys: [1, 1, 1, 2, 2, 2]
Values: [0, 2, 4, 1, 3, 5]
cumsum: [0, 0, 2] [0, 1, 4]

or

Keys: [1, 1, 1, 2, 2, 2]
Values: [2, 0, 4, 1, 5, 3]
cumsum: [0, 2, 2] [0, 1, 6]

or

Keys: [1, 1, 1, 2, 2, 2]
Values: [4, 0, 2, 3, 5, 1]
cumsum: [0, 4, 4] [0, 3, 8]

etc.

Therefore, the ordering of the values within a group are not deterministic. As such, the cumsum within groups is not deterministic.

@kkraus14
Copy link
Collaborator

kkraus14 commented Apr 10, 2019

@jrhemstad cumsum wouldn't be run in the same aggregation as the original groupby, it would look something like this:

Input:

color | date       | count
--------------------------
red   | 2019-01-01 | 3
blue  | 2019-01-01 | 5
red   | 2019-01-02 | 6
red   | 2019-01-02 | 4
blue  | 2019-01-03 | 7
blue  | 2019-01-03 | 8

df.groupby(['color', 'date'])['count'].sum():

color | date       | count
--------------------------
red   | 2019-01-01 | 3
red   | 2019-01-02 | 10
blue  | 2019-01-01 | 5
blue  | 2019-01-03 | 15

This has to be a sort based groupby otherwise cumsum function doesn't make sense:
df.groupby(['color', 'date'])['count'].sum().groupby(['Name'])['count'].cumsum():

color | date       | count
--------------------------
red   | 2019-01-01 | 3
red   | 2019-01-02 | 13
blue  | 2019-01-01 | 5
blue  | 2019-01-03 | 20

@jrhemstad
Copy link
Contributor

jrhemstad commented Apr 10, 2019

@kkraus14 yes that makes sense, but it's dependent on you "doing the right thing".

To be clear, at the C++ level, we have no way of preventing you from doing something like I outlined in this example: #1298 (comment)

If that's fine with you and @beckernick, then it works for me.

@kkraus14
Copy link
Collaborator

kkraus14 commented Apr 10, 2019

@jrhemstad I'd defer to @felipeblazing if that works for him, but we can easily only allow a cumsum call if the user uses a sort based groupby from the Python side.

@beckernick
Copy link
Member Author

beckernick commented Apr 10, 2019

I'm not sure I fully agree with @kkraus14. The cumsum API makes the most sense being called on one of the non-grouped columns, but I could imagine wanting to use it in the same aggregation. Expanding on your example:

# pandas
data['flag'] = range(len(data))
data['cumsum'] = data.groupby(['color', 'date'])['count'].cumsum()
data
	color	date	count	cumsum	flag
0	red	2019-01-01	3	3	0
1	blue	2019-01-01	5	5	1
2	red	2019-01-02	6	6	2
3	red	2019-01-02	4	10	3
4	blue	2019-01-03	7	7	4
5	blue	2019-01-03	8	15	5

It depends logically what I'm after. If I just want the cumulative sum within the group [color, date], then what Keith has above makes sense. But, if I want to the calculate cumulative sum within those groups but maintain the fact that there may be multiple rows within each group with other values I care about in connection with the cumulative sum, I would want the result in the snippet I just created above (hence why I created the flag variable).

@randerzander would be interested to get your thoughts as well.

@jrhemstad
Copy link
Contributor

@jrhemstad I'd defer to @felipeblazing if that works for him, but we can easily only allow a cumsum call if the user uses a sort based groupby from the Python side.

Simply requiring a sort-based groupby isn't sufficient, as my example showed.

@jrhemstad
Copy link
Contributor

jrhemstad commented Apr 10, 2019

@beckernick you have two instances of the key red 2019-01-02. We can't guarantee the order of those two keys' corresponding values from one run to the next.

@beckernick
Copy link
Member Author

beckernick commented Apr 10, 2019

Yep, I understand. That's what you were describing initially, I believe.

I'm not sure how common what I put in the snippet above is compared to just wanting Keith's example output is:

color | date       | count
--------------------------
red   | 2019-01-01 | 3
red   | 2019-01-02 | 13
blue  | 2019-01-01 | 5
blue  | 2019-01-03 | 20

If this is what most users want (and avoids the non-deterministic ordering problem), then it's probably fine. With that said, that's a rough end user API requiring a groupby, a summation, a second groupby, and then finally a cumsum. I think people are likely used to the calling df.groupby([some columns])[other_col].cumsum() and getting the result which preserves the same number of rows (my example pandas). But I'm not sure.

@jrhemstad
Copy link
Contributor

Yep, I understand. That's what you were describing initially, I believe.

Yeah, exactly. What order is Pandas using? The original order of the column?

I guess this could work if we used a stable sort for the sort-based groupby. @williamBlazing @felipeblazing , thoughts?

@beckernick
Copy link
Member Author

beckernick commented Apr 10, 2019

As far as I know, pandas uses the original order. That's actually the other reason I added the flag column to show that the ordering was consistent.

@harrism
Copy link
Member

harrism commented Apr 10, 2019

Sorting by keys gives several possible orders...

With a stable sort the order should always be the same given the same order of inputs. What kind of sort are we using? We should definitely be using a stable sort!

@jrhemstad
Copy link
Contributor

With a stable sort the order should always be the same given the same order of inputs. What kind of sort are we using? We should definitely be using a stable sort!

We're just using thrust::sort.

There was never any reason to use a stable sort previously.

@rgsl888prabhu rgsl888prabhu removed their assignment Oct 1, 2019
@devavret
Copy link
Contributor

Re-starting this as we now have the bandwidth to tackle this, I think we can do this with the following steps:

  1. Add a parameter to sorted_order() so that it can use thrust::stable_sort.
  2. Add a method to groupby called scan() that can do a list of cumulative aggregations.

@devavret
Copy link
Contributor

The additional scan() method is needed because the results of groupby.aggregate() are all mapped to a set of unique keys which it returns. CUMSUM however, returns result mapped to the original keys.

If we do this using a thrust::inclusive_scan_by_key() then we need the keys to be sorted. That means we'd produce the result in the form:

    color   date         count  cumsum
1   blue    2019-01-01   5      5   
4   blue    2019-01-03   7      7   
5   blue    2019-01-03   8      15  
0   red     2019-01-01   3      3   
2   red     2019-01-02   6      6   
3   red     2019-01-02   4      10  

and the keys would be sorted. Unlike, pandas' result where the keys stay where they are. Which brings me to ask: is that ok?

I can make it return in the original order of the keys but that would add another operation.

@jrhemstad
Copy link
Contributor

I can make it return in the original order of the keys but that would add another operation.

In situations like this, usually we've been okay with deviating from Pandas behavior. @kkraus14 @beckernick would be able to say for sure.

@kkraus14
Copy link
Collaborator

Now that we use a stable sort for sort based groupby we should be able to do cumulative operations correctly. Adding to 0.18.

rapids-bot bot pushed a commit that referenced this issue Mar 23, 2021
Adds support for groupby scan operations. 

Addresses part of 
#1298 cumsum
#1296 cumcount

- sum
- min
- max
- count

Authors:
  - Karthikeyan (@karthikeyann)
  - Michael Wang (@isVoid)

Approvers:
  - Vukasin Milovanovic (@vuule)
  - Jake Hemstad (@jrhemstad)
  - Nghia Truong (@ttnghia)
  - David (@davidwendt)

URL: #7387
rapids-bot bot pushed a commit that referenced this issue Apr 22, 2021
closes #1296 Groupby cumulative count 
closes #1298 Groupby cumulative sum 

- [x] Add cython code for groupby scan (cannot mix reduce aggs and scan aggs)
- [x] Add python code for groupby scan functions - cumsum, cummin, cummax, cumcount, groupby.agg()
- [x] unit tests

Authors:
  - Karthikeyan (https://github.com/karthikeyann)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Keith Kraus (https://github.com/kkraus14)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #7759
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants