-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Groupby cumulative sum #1298
Comments
@harrism @jrhemstad Since this is specifically for the groupby I imagine this would require a special implementation for the hash based groupby. |
This isn't possible with a hash-based groupby. To implement this feature at the Python level, I'd do two things:
|
@beckernick what do you expect the ordering of the values in the group to be? Currently this is non-deterministic, so from one run to the next, a groupby cumsum can easily give different results within each group. |
@jrhemstad I would imagine this would only work on a sort based groupby where the order of keys is deterministic. |
But it's not. The order of the keys within a group is definitely not deterministic. |
In my experience, one of the core use cases for groupby cumulative sums (and cumulative sums in general) is to execute them on data with a temporal component. If I'm trying to get the cumulative sum of EDIT: Removing this edit as it doesn't actually help clarify anything. |
In a sort based groupby they would be... no? |
@AK-ayush do you have any thoughts, given your feature request? |
Keys: Sorting by keys gives several possible orders... or Keys: or Keys: etc. Therefore, the ordering of the values within a group are not deterministic. As such, the |
@jrhemstad cumsum wouldn't be run in the same aggregation as the original groupby, it would look something like this: Input:
This has to be a sort based groupby otherwise
|
@kkraus14 yes that makes sense, but it's dependent on you "doing the right thing". To be clear, at the C++ level, we have no way of preventing you from doing something like I outlined in this example: #1298 (comment) If that's fine with you and @beckernick, then it works for me. |
@jrhemstad I'd defer to @felipeblazing if that works for him, but we can easily only allow a |
I'm not sure I fully agree with @kkraus14. The cumsum API makes the most sense being called on one of the non-grouped columns, but I could imagine wanting to use it in the same aggregation. Expanding on your example:
It depends logically what I'm after. If I just want the cumulative sum within the group [color, date], then what Keith has above makes sense. But, if I want to the calculate cumulative sum within those groups but maintain the fact that there may be multiple rows within each group with other values I care about in connection with the cumulative sum, I would want the result in the snippet I just created above (hence why I created the flag variable). @randerzander would be interested to get your thoughts as well. |
Simply requiring a sort-based groupby isn't sufficient, as my example showed. |
@beckernick you have two instances of the key |
Yep, I understand. That's what you were describing initially, I believe. I'm not sure how common what I put in the snippet above is compared to just wanting Keith's example output is:
If this is what most users want (and avoids the non-deterministic ordering problem), then it's probably fine. With that said, that's a rough end user API requiring a groupby, a summation, a second groupby, and then finally a cumsum. I think people are likely used to the calling |
Yeah, exactly. What order is Pandas using? The original order of the column? I guess this could work if we used a stable sort for the sort-based groupby. @williamBlazing @felipeblazing , thoughts? |
As far as I know, pandas uses the original order. That's actually the other reason I added the |
With a stable sort the order should always be the same given the same order of inputs. What kind of sort are we using? We should definitely be using a stable sort! |
We're just using There was never any reason to use a stable sort previously. |
Re-starting this as we now have the bandwidth to tackle this, I think we can do this with the following steps:
|
The additional If we do this using a
and the keys would be sorted. Unlike, pandas' result where the keys stay where they are. Which brings me to ask: is that ok? I can make it return in the original order of the keys but that would add another operation. |
In situations like this, usually we've been okay with deviating from Pandas behavior. @kkraus14 @beckernick would be able to say for sure. |
Now that we use a stable sort for sort based groupby we should be able to do cumulative operations correctly. Adding to 0.18. |
Adds support for groupby scan operations. Addresses part of #1298 cumsum #1296 cumcount - sum - min - max - count Authors: - Karthikeyan (@karthikeyann) - Michael Wang (@isVoid) Approvers: - Vukasin Milovanovic (@vuule) - Jake Hemstad (@jrhemstad) - Nghia Truong (@ttnghia) - David (@davidwendt) URL: #7387
closes #1296 Groupby cumulative count closes #1298 Groupby cumulative sum - [x] Add cython code for groupby scan (cannot mix reduce aggs and scan aggs) - [x] Add python code for groupby scan functions - cumsum, cummin, cummax, cumcount, groupby.agg() - [x] unit tests Authors: - Karthikeyan (https://github.com/karthikeyann) - Vyas Ramasubramani (https://github.com/vyasr) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Keith Kraus (https://github.com/kkraus14) - Vyas Ramasubramani (https://github.com/vyasr) URL: #7759
Is your feature request related to a problem? Please describe.
As a user, I'd like to get the cumulative sum of values within each group of a grouped dataframe.
Describe the solution you'd like
I'd like to call
df.groupby(col).cumsum()
and return a column of the same size containing the cumulative sum within each group.Describe alternatives you've considered
The alternative to this is messy, requiring me to keep a running tally for each group as well as check the group identity of each value every time.
Additional context
This is blocked by #1269 .
The text was updated successfully, but these errors were encountered: