Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add groupby scan aggregation to cudf #7759
Add groupby scan aggregation to cudf #7759
Changes from 7 commits
26bafd0
babcdfc
2b00611
76ab556
1ebde51
4c5d876
4134e43
43cf580
474a179
25e74ef
8a44827
6b5c67f
8e45ad0
81ffe0a
6d3fad3
714742d
ea4ed2e
026bb4e
1259032
6c61806
a14d30f
12caa06
5c71bfe
25811b0
1450f2d
4a0f27d
596496f
1f5dadd
f02231f
5835f56
87f9a59
5145501
c3602f9
1d48585
f557406
0e09786
63897c1
5fbde6f
f9508fe
4c04b5f
b21dfaf
e3ae492
bac0de5
4c7f8d2
8d64ae5
5ebbca2
46e0545
26f1016
a7148a4
4dc1d79
0c18a16
794aad1
c9892b1
61e84b2
c51a7d4
35deb2f
ae88281
621216d
7f35bde
351afb6
67c6ebc
9cd725e
9f01b60
cb3f5a0
e96a795
2d304ae
f52fffb
affc9c9
925f303
18abe66
b37b1d2
0240825
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that we should be matching the behavior of Pandas on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas gives
cumcount
for entire groupby object. Not on individual columns.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be
groupby::count()
instead?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few cases to consider here:
df.groupby(...).cumcount()
df.groupby(...).agg('cumcount')
df.groupby(...).agg(['cumcount'])
Pandas returns a Series for (1) and (2), and a DataFrame for (3). We currently return a DataFrame in all cases.
(1) is easy for us to do: we define a
cumcount()
method and make sure we return aSeries
. (2) is less trivial for us: one of the first things we do is normalize the argument passed toagg()
into a dictionary. Thus, we lose information about whether the user passed a string, a list or a dictionary toagg()
.Handling (2) is also not necessarily relevant to this PR: we have the exact same problem for the
size
aggreegation as well.My suggestion is let's make sure we handle (1) correctly for now, and not worry about (2) in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you raise an issue about (2)?
Agreed lets tackle it in a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shwina did you already write up the issue? If so, this conversation can be resolved.