-
Notifications
You must be signed in to change notification settings - Fork 927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use list of columns for methods in Groupby.pyx
#10419
Use list of columns for methods in Groupby.pyx
#10419
Conversation
…cessing of result frame.
…ent/ListOfColumnRefactor/groupby
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Vyas Ramasubramani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we're not quite there yet, but so close!
Co-authored-by: Vyas Ramasubramani <[email protected]>
…:isVoid/cudf into improvement/ListOfColumnRefactor/groupby
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume you address my one outstanding comment this LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One suggestion, and one question/suggestion-if-applicable. Overall this is ready to go, so I'll approve.
…:isVoid/cudf into improvement/ListOfColumnRefactor/groupby
rerun tests |
…/ListOfColumnRefactor/groupby
@isVoid The failed CI issue should be solved by rapidsai/rapids-cmake#168. I merged in |
@gpucibot merge |
Part of #10153
This PR changes the APIs in
groupby.pyx
to accept a list of columns as input, not a Frame. This change affects both keys and values. TheGroupby
object now only stores a list of columns in thekeys
attribute and other APIs (groups
,aggregate
,shift
,replace_nulls
) now only accept a list of columns as its value columns. Theaggregation
communication protocol has changed from a dictionary mapping column names to list of agg names to a list of list of agg names. See changes in_normalize_aggs
for detail.This PR also tries to simplify post-processing of
result
frame inagg
method now that we have a finer control in pure python.I gave an attempt to rewrite
aggregate_internal
andscan_internal
but ended up in futile because the unified aggregation object is a cdef type and precludes separating the aggregation filtering step outside of it's current place. Besides, I tried unifying aggregation and scan with cython fused type but didn't make it due to limitation of using fused type with c++ templated type in cython.Overall, the performance of
agg
call is on par with main branch. With -3%-13% performance diff depending on agg types.Raw Benchmark
Benchmark code