-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add handling for string by-columns in dask-cudf groupby #10830
Conversation
for c in self.obj.columns: | ||
if c != self.by: | ||
yield c | ||
|
||
@_dask_cudf_nvtx_annotate | ||
@_check_groupby_supported | ||
def count(self, split_every=None, split_out=1): | ||
return groupby_agg( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like these aggregations are still highly repetitive. Could we do this (perhaps in a follow-up PR) and reduce a few dozen lines?
# Internal helper method
def _make_groupby_agg(self, agg_name, split_every=None, split_out=1):
return groupby_agg(
self.obj,
self.by,
{c: agg_name for c in self._columns_not_in_by()},
split_every=split_every,
split_out=split_out,
sep=self.sep,
sort=self.sort,
as_index=self.as_index,
**self.dropna,
)
and then call it in each function?
@_dask_cudf_nvtx_annotate
@_check_groupby_supported
def count(self, split_every=None, split_out=1):
return self._make_groupby_agg("count", split_every, split_out)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggestion may also apply to Series
groupby, so I'd do this in a later PR and keep the scope constrained here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me - happy to open up a separate PR handling this 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good aside from my suggestion to simplify/wrap the calls to groupby_agg
to avoid repetition. That can be done in a later PR. This addresses the core issue so I'm approving.
Codecov Report
@@ Coverage Diff @@
## branch-22.06 #10830 +/- ##
================================================
- Coverage 86.40% 86.32% -0.09%
================================================
Files 143 144 +1
Lines 22448 22654 +206
================================================
+ Hits 19396 19555 +159
- Misses 3052 3099 +47
Continue to review full report at Codecov.
|
@gpucibot merge |
Motivated by #10830 (comment), this PR attempts to consolidate some repetitive aspects of dask-cudf's groupby code with `_make_groupby_agg_call`, which replaces all `groupby_agg` calls made in groupby.py, which takes as input the few things that vary between calls. Note that while this doesn't depend on #10830, it will be much easier to review once that is merged in, as I have based my work off the initial consolidation efforts that were made there. cc @bdice Authors: - Charles Blackmon-Luca (https://github.com/charlesbluca) Approvers: - Richard (Rick) Zamora (https://github.com/rjzamora) URL: #10835
Converts string
by
-columns to lists when calling aggregation methods, which expectGroupby.by
to be a list or tuple.We might be able to do this conversion when initializing the groupby object, just started off with this approach as it seems like upstream Dask is pretty careful not to overwrite the original
by
input if it's a string.Closes #10829