[FEA] Groupby cumulative sum #1298

beckernick · 2019-03-26T15:20:53Z

Is your feature request related to a problem? Please describe.
As a user, I'd like to get the cumulative sum of values within each group of a grouped dataframe.

Describe the solution you'd like
I'd like to call df.groupby(col).cumsum() and return a column of the same size containing the cumulative sum within each group.

Describe alternatives you've considered
The alternative to this is messy, requiring me to keep a running tally for each group as well as check the group identity of each value every time.

Additional context
This is blocked by #1269 .

The text was updated successfully, but these errors were encountered:

kkraus14 · 2019-04-08T15:43:13Z

@harrism @jrhemstad Since this is specifically for the groupby I imagine this would require a special implementation for the hash based groupby.

jrhemstad · 2019-04-08T19:09:27Z

This isn't possible with a hash-based groupby.

To implement this feature at the Python level, I'd do two things:

Groupby w/o agg to group equal keys together and find each unique groups offset
Segmented scan with contiguous unique keys for each group. A segmented scan feature doesn't currently exist in libcudf. [FEA] Add segmented scan feature #1365

jrhemstad · 2019-04-10T16:19:30Z

@beckernick what do you expect the ordering of the values in the group to be? Currently this is non-deterministic, so from one run to the next, a groupby cumsum can easily give different results within each group.

kkraus14 · 2019-04-10T16:30:47Z

@jrhemstad I would imagine this would only work on a sort based groupby where the order of keys is deterministic.

jrhemstad · 2019-04-10T16:31:28Z

@jrhemstad I would imagine this would only work on a sort based groupby where the order of keys is deterministic.

But it's not. The order of the keys within a group is definitely not deterministic.

beckernick · 2019-04-10T16:31:31Z

In my experience, one of the core use cases for groupby cumulative sums (and cumulative sums in general) is to execute them on data with a temporal component. If I'm trying to get the cumulative sum of sales over the course of a sorted time_of_day column (my grouping column) for downstream use, I would need the cumulative sum ordering to respect the grouping column's ordering. If the ordering of the values in the group is non-deterministic, I wouldn't be able to use this information. Does that answer your question @jrhemstad ?

EDIT: Removing this edit as it doesn't actually help clarify anything.

kkraus14 · 2019-04-10T16:34:53Z

@jrhemstad I would imagine this would only work on a sort based groupby where the order of keys is deterministic.

But it's not. The order of the keys within a group is definitely not deterministic.

In a sort based groupby they would be... no?

beckernick · 2019-04-10T16:38:45Z

@AK-ayush do you have any thoughts, given your feature request?

jrhemstad · 2019-04-10T16:40:36Z

Keys: [1, 2, 1, 2, 1, 2]
Values: [0, 1, 2, 3, 4, 5]

Sorting by keys gives several possible orders...
Keys: [1, 1, 1, 2, 2, 2]
Values: [0, 2, 4, 1, 3, 5]
cumsum: [0, 0, 2] [0, 1, 4]

or

Keys: [1, 1, 1, 2, 2, 2]
Values: [2, 0, 4, 1, 5, 3]
cumsum: [0, 2, 2] [0, 1, 6]

or

Keys: [1, 1, 1, 2, 2, 2]
Values: [4, 0, 2, 3, 5, 1]
cumsum: [0, 4, 4] [0, 3, 8]

etc.

Therefore, the ordering of the values within a group are not deterministic. As such, the cumsum within groups is not deterministic.

kkraus14 · 2019-04-10T16:53:33Z

@jrhemstad cumsum wouldn't be run in the same aggregation as the original groupby, it would look something like this:

Input:

color | date       | count
--------------------------
red   | 2019-01-01 | 3
blue  | 2019-01-01 | 5
red   | 2019-01-02 | 6
red   | 2019-01-02 | 4
blue  | 2019-01-03 | 7
blue  | 2019-01-03 | 8

df.groupby(['color', 'date'])['count'].sum():

color | date       | count
--------------------------
red   | 2019-01-01 | 3
red   | 2019-01-02 | 10
blue  | 2019-01-01 | 5
blue  | 2019-01-03 | 15

This has to be a sort based groupby otherwise cumsum function doesn't make sense:
df.groupby(['color', 'date'])['count'].sum().groupby(['Name'])['count'].cumsum():

color | date       | count
--------------------------
red   | 2019-01-01 | 3
red   | 2019-01-02 | 13
blue  | 2019-01-01 | 5
blue  | 2019-01-03 | 20

jrhemstad · 2019-04-10T17:04:14Z

@kkraus14 yes that makes sense, but it's dependent on you "doing the right thing".

To be clear, at the C++ level, we have no way of preventing you from doing something like I outlined in this example: #1298 (comment)

If that's fine with you and @beckernick, then it works for me.

kkraus14 · 2019-04-10T17:08:41Z

@jrhemstad I'd defer to @felipeblazing if that works for him, but we can easily only allow a cumsum call if the user uses a sort based groupby from the Python side.

beckernick · 2019-04-10T17:26:11Z

I'm not sure I fully agree with @kkraus14. The cumsum API makes the most sense being called on one of the non-grouped columns, but I could imagine wanting to use it in the same aggregation. Expanding on your example:

# pandas
data['flag'] = range(len(data))
data['cumsum'] = data.groupby(['color', 'date'])['count'].cumsum()
data
	color	date	count	cumsum	flag
0	red	2019-01-01	3	3	0
1	blue	2019-01-01	5	5	1
2	red	2019-01-02	6	6	2
3	red	2019-01-02	4	10	3
4	blue	2019-01-03	7	7	4
5	blue	2019-01-03	8	15	5

It depends logically what I'm after. If I just want the cumulative sum within the group [color, date], then what Keith has above makes sense. But, if I want to the calculate cumulative sum within those groups but maintain the fact that there may be multiple rows within each group with other values I care about in connection with the cumulative sum, I would want the result in the snippet I just created above (hence why I created the flag variable).

@randerzander would be interested to get your thoughts as well.

jrhemstad · 2019-04-10T17:28:33Z

@jrhemstad I'd defer to @felipeblazing if that works for him, but we can easily only allow a cumsum call if the user uses a sort based groupby from the Python side.

Simply requiring a sort-based groupby isn't sufficient, as my example showed.

jrhemstad · 2019-04-10T17:30:05Z

@beckernick you have two instances of the key red 2019-01-02. We can't guarantee the order of those two keys' corresponding values from one run to the next.

beckernick · 2019-04-10T17:30:39Z

Yep, I understand. That's what you were describing initially, I believe.

I'm not sure how common what I put in the snippet above is compared to just wanting Keith's example output is:

color | date       | count
--------------------------
red   | 2019-01-01 | 3
red   | 2019-01-02 | 13
blue  | 2019-01-01 | 5
blue  | 2019-01-03 | 20

If this is what most users want (and avoids the non-deterministic ordering problem), then it's probably fine. With that said, that's a rough end user API requiring a groupby, a summation, a second groupby, and then finally a cumsum. I think people are likely used to the calling df.groupby([some columns])[other_col].cumsum() and getting the result which preserves the same number of rows (my example pandas). But I'm not sure.

jrhemstad · 2019-04-10T17:31:48Z

Yep, I understand. That's what you were describing initially, I believe.

Yeah, exactly. What order is Pandas using? The original order of the column?

I guess this could work if we used a stable sort for the sort-based groupby. @williamBlazing @felipeblazing , thoughts?

beckernick · 2019-04-10T17:36:21Z

As far as I know, pandas uses the original order. That's actually the other reason I added the flag column to show that the ordering was consistent.

harrism · 2019-04-10T22:39:20Z

Sorting by keys gives several possible orders...

With a stable sort the order should always be the same given the same order of inputs. What kind of sort are we using? We should definitely be using a stable sort!

jrhemstad · 2019-04-10T22:43:30Z

With a stable sort the order should always be the same given the same order of inputs. What kind of sort are we using? We should definitely be using a stable sort!

We're just using thrust::sort.

There was never any reason to use a stable sort previously.

devavret · 2020-02-18T18:03:15Z

Re-starting this as we now have the bandwidth to tackle this, I think we can do this with the following steps:

Add a parameter to sorted_order() so that it can use thrust::stable_sort.
Add a method to groupby called scan() that can do a list of cumulative aggregations.

devavret · 2020-02-18T18:55:44Z

The additional scan() method is needed because the results of groupby.aggregate() are all mapped to a set of unique keys which it returns. CUMSUM however, returns result mapped to the original keys.

If we do this using a thrust::inclusive_scan_by_key() then we need the keys to be sorted. That means we'd produce the result in the form:

    color   date         count  cumsum
1   blue    2019-01-01   5      5   
4   blue    2019-01-03   7      7   
5   blue    2019-01-03   8      15  
0   red     2019-01-01   3      3   
2   red     2019-01-02   6      6   
3   red     2019-01-02   4      10

and the keys would be sorted. Unlike, pandas' result where the keys stay where they are. Which brings me to ask: is that ok?

I can make it return in the original order of the keys but that would add another operation.

jrhemstad · 2020-02-18T19:15:05Z

I can make it return in the original order of the keys but that would add another operation.

In situations like this, usually we've been okay with deviating from Pandas behavior. @kkraus14 @beckernick would be able to say for sure.

kkraus14 · 2020-11-30T16:23:56Z

Now that we use a stable sort for sort based groupby we should be able to do cumulative operations correctly. Adding to 0.18.

@karthikeyann

Adds support for groupby scan operations. Addresses part of #1298 cumsum #1296 cumcount - sum - min - max - count Authors: - Karthikeyan (@karthikeyann) - Michael Wang (@isVoid) Approvers: - Vukasin Milovanovic (@vuule) - Jake Hemstad (@jrhemstad) - Nghia Truong (@ttnghia) - David (@davidwendt) URL: #7387

closes #1296 Groupby cumulative count closes #1298 Groupby cumulative sum - [x] Add cython code for groupby scan (cannot mix reduce aggs and scan aggs) - [x] Add python code for groupby scan functions - cumsum, cummin, cummax, cumcount, groupby.agg() - [x] unit tests Authors: - Karthikeyan (https://github.com/karthikeyann) - Vyas Ramasubramani (https://github.com/vyasr) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Keith Kraus (https://github.com/kkraus14) - Vyas Ramasubramani (https://github.com/vyasr) URL: #7759

beckernick added feature request New feature or request Needs Triage Need team to review and classify labels Mar 26, 2019

kkraus14 added the Python Affects Python cuDF API. label Mar 27, 2019

kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Apr 8, 2019

jrhemstad mentioned this issue Apr 8, 2019

[FEA] Add segmented scan feature #1365

Closed

rgsl888prabhu removed their assignment Oct 1, 2019

devavret mentioned this issue Feb 18, 2020

[FEA] Stable sort option in sorted_order() #4189

Closed

harrism mentioned this issue May 19, 2020

[FEA] SeriesGroupBy diff() and cumsum() #5215

Closed

devavret mentioned this issue Oct 12, 2020

[QST] cuDF way to do group by e.g. sum TD #6422

Closed

karthikeyann mentioned this issue Feb 16, 2021

Add groupby scan operations (sort groupby) #7387

Merged

karthikeyann mentioned this issue Mar 30, 2021

Add groupby scan aggregation to cudf #7759

Merged

3 tasks

rapids-bot bot closed this as completed in #7759 Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Groupby cumulative sum #1298

[FEA] Groupby cumulative sum #1298

beckernick commented Mar 26, 2019 •

edited

Loading

kkraus14 commented Apr 8, 2019

jrhemstad commented Apr 8, 2019

jrhemstad commented Apr 10, 2019

kkraus14 commented Apr 10, 2019

jrhemstad commented Apr 10, 2019

beckernick commented Apr 10, 2019 •

edited

Loading

kkraus14 commented Apr 10, 2019

beckernick commented Apr 10, 2019

jrhemstad commented Apr 10, 2019 •

edited

Loading

kkraus14 commented Apr 10, 2019 •

edited

Loading

jrhemstad commented Apr 10, 2019 •

edited

Loading

kkraus14 commented Apr 10, 2019 •

edited

Loading

beckernick commented Apr 10, 2019 •

edited

Loading

jrhemstad commented Apr 10, 2019

jrhemstad commented Apr 10, 2019 •

edited

Loading

beckernick commented Apr 10, 2019 •

edited

Loading

jrhemstad commented Apr 10, 2019

beckernick commented Apr 10, 2019 •

edited

Loading

harrism commented Apr 10, 2019

jrhemstad commented Apr 10, 2019

devavret commented Feb 18, 2020

devavret commented Feb 18, 2020

jrhemstad commented Feb 18, 2020

kkraus14 commented Nov 30, 2020

[FEA] Groupby cumulative sum #1298

[FEA] Groupby cumulative sum #1298

Comments

beckernick commented Mar 26, 2019 • edited Loading

kkraus14 commented Apr 8, 2019

jrhemstad commented Apr 8, 2019

jrhemstad commented Apr 10, 2019

kkraus14 commented Apr 10, 2019

jrhemstad commented Apr 10, 2019

beckernick commented Apr 10, 2019 • edited Loading

kkraus14 commented Apr 10, 2019

beckernick commented Apr 10, 2019

jrhemstad commented Apr 10, 2019 • edited Loading

kkraus14 commented Apr 10, 2019 • edited Loading

jrhemstad commented Apr 10, 2019 • edited Loading

kkraus14 commented Apr 10, 2019 • edited Loading

beckernick commented Apr 10, 2019 • edited Loading

jrhemstad commented Apr 10, 2019

jrhemstad commented Apr 10, 2019 • edited Loading

beckernick commented Apr 10, 2019 • edited Loading

jrhemstad commented Apr 10, 2019

beckernick commented Apr 10, 2019 • edited Loading

harrism commented Apr 10, 2019

jrhemstad commented Apr 10, 2019

devavret commented Feb 18, 2020

devavret commented Feb 18, 2020

jrhemstad commented Feb 18, 2020

kkraus14 commented Nov 30, 2020

beckernick commented Mar 26, 2019 •

edited

Loading

beckernick commented Apr 10, 2019 •

edited

Loading

jrhemstad commented Apr 10, 2019 •

edited

Loading

kkraus14 commented Apr 10, 2019 •

edited

Loading

jrhemstad commented Apr 10, 2019 •

edited

Loading

kkraus14 commented Apr 10, 2019 •

edited

Loading

beckernick commented Apr 10, 2019 •

edited

Loading

jrhemstad commented Apr 10, 2019 •

edited

Loading

beckernick commented Apr 10, 2019 •

edited

Loading

beckernick commented Apr 10, 2019 •

edited

Loading