-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add handling for nested dicts in dask-cudf groupby #9054
Conversation
agg_array.append( | ||
aggs_renames.get(_make_name(col, agg, sep=sep), agg) | ||
) | ||
_meta.columns = pd.MultiIndex.from_arrays([col_array, agg_array]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that we have to do the aggregation renames for both _meta
and the groupby result, but this is required so that we have the correct final_columns
for the last step of _finalize_gb_agg()
. It would be nice if we also supported nested dict aggregations in cuDF's groupby so that _meta
would have the correct index without any additional steps in dask-cuDF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @shwina said this could be done but would require some effort. Since pandas does not support nested dicts it seemed like cuDF did not have to go down this path. We could be wrong and if you feel strongly you should speak up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since pandas does not support nested dicts it seemed like cuDF did not have to go down this path
If pandas doesn't support something ugly, I'd lean away from doing it in cudf for the sake of dask-cudf logic :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I generally agree - there could be larger motivations to want nested renaming support for groupby in cuDF, but I don't think this case alone is a good enough reason to work on it
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #9054 +/- ##
===============================================
Coverage ? 10.78%
===============================================
Files ? 114
Lines ? 18716
Branches ? 0
===============================================
Hits ? 2018
Misses ? 16698
Partials ? 0 Continue to review full report at Codecov.
|
@@ -367,6 +382,8 @@ def _is_supported(arg, supported: set): | |||
for col in arg: | |||
if isinstance(arg[col], list): | |||
_global_set = _global_set.union(set(arg[col])) | |||
elif isinstance(arg[col], dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does order matter for _global_set
? If it does, using set
can sometimes change the order and give unexpected results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't matter here, since we only need _global_set
to check if our aggs are a subset of supported
. Ordering is more of a concern with _redirect_aggs()
since that function returns a copy of the aggs that's used for the remainder of the groupby. AFAIK that should be good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, sounds good! For the most part the code looks pretty good to me!
rerun tests |
@gpucibot merge |
Closes #9017
Adds handling for nested dict (renamed) aggregations supplied to dask-cudf's groupby, by storing the new aggregation names when standardizing the
aggs
input and applying them in_finalize_gb_agg()
.