Add handling for string by-columns in dask-cudf groupby #10830

charlesbluca · 2022-05-11T18:02:38Z

Converts string by-columns to lists when calling aggregation methods, which expect Groupby.by to be a list or tuple.

We might be able to do this conversion when initializing the groupby object, just started off with this approach as it seems like upstream Dask is pretty careful not to overwrite the original by input if it's a string.

Closes #10829

python/dask_cudf/dask_cudf/groupby.py

python/dask_cudf/dask_cudf/tests/test_groupby.py

python/dask_cudf/dask_cudf/groupby.py

bdice · 2022-05-11T19:07:45Z

python/dask_cudf/dask_cudf/groupby.py

+            for c in self.obj.columns:
+                if c != self.by:
+                    yield c
+
    @_dask_cudf_nvtx_annotate
    @_check_groupby_supported
    def count(self, split_every=None, split_out=1):
        return groupby_agg(


It looks like these aggregations are still highly repetitive. Could we do this (perhaps in a follow-up PR) and reduce a few dozen lines?

# Internal helper method def _make_groupby_agg(self, agg_name, split_every=None, split_out=1): return groupby_agg( self.obj, self.by, {c: agg_name for c in self._columns_not_in_by()}, split_every=split_every, split_out=split_out, sep=self.sep, sort=self.sort, as_index=self.as_index, **self.dropna, )

and then call it in each function?

@_dask_cudf_nvtx_annotate @_check_groupby_supported def count(self, split_every=None, split_out=1): return self._make_groupby_agg("count", split_every, split_out)

This suggestion may also apply to Series groupby, so I'd do this in a later PR and keep the scope constrained here.

That makes sense to me - happy to open up a separate PR handling this 🙂

bdice

All looks good aside from my suggestion to simplify/wrap the calls to groupby_agg to avoid repetition. That can be done in a later PR. This addresses the core issue so I'm approving.

codecov · 2022-05-11T20:06:29Z

Codecov Report

Merging #10830 (d652157) into branch-22.06 (8d861ce) will decrease coverage by 0.08%.
The diff coverage is 94.98%.

@@               Coverage Diff                @@
##           branch-22.06   #10830      +/-   ##
================================================
- Coverage         86.40%   86.32%   -0.09%     
================================================
  Files               143      144       +1     
  Lines             22448    22654     +206     
================================================
+ Hits              19396    19555     +159     
- Misses             3052     3099      +47

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/string.py	`88.78% <ø> (-0.31%)`	⬇️
python/cudf/cudf/core/frame.py	`93.41% <ø> (ø)`
python/cudf/cudf/core/indexed_frame.py	`91.70% <ø> (ø)`
python/cudf/cudf/io/__init__.py	`100.00% <ø> (ø)`
python/cudf/cudf/testing/testing.py	`81.81% <50.00%> (+0.12%)`	⬆️
python/cudf/cudf/io/parquet.py	`90.83% <86.60%> (-1.87%)`	⬇️
python/cudf/cudf/core/index.py	`92.06% <88.88%> (-0.25%)`	⬇️
python/cudf/cudf/core/scalar.py	`89.01% <90.90%> (-0.31%)`	⬇️
python/cudf/cudf/core/dataframe.py	`93.78% <96.36%> (+0.08%)`	⬆️
python/cudf/cudf/__init__.py	`90.69% <100.00%> (+0.22%)`	⬆️
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2b204d0...d652157. Read the comment docs.

galipremsagar · 2022-05-11T22:54:15Z

@gpucibot merge

@bdice

Motivated by #10830 (comment), this PR attempts to consolidate some repetitive aspects of dask-cudf's groupby code with `_make_groupby_agg_call`, which replaces all `groupby_agg` calls made in groupby.py, which takes as input the few things that vary between calls. Note that while this doesn't depend on #10830, it will be much easier to review once that is merged in, as I have based my work off the initial consolidation efforts that were made there. cc @bdice Authors: - Charles Blackmon-Luca (https://github.com/charlesbluca) Approvers: - Richard (Rick) Zamora (https://github.com/rjzamora) URL: #10835

Convert groupby by-column to list if it isn't already

ada7fe0

charlesbluca added bug Something isn't working 3 - Ready for Review Ready for review by team dask-cudf non-breaking Non-breaking change labels May 11, 2022

charlesbluca requested a review from a team as a code owner May 11, 2022 18:02

github-actions bot added the Python Affects Python cuDF API. label May 11, 2022

bdice requested changes May 11, 2022

View reviewed changes

python/dask_cudf/dask_cudf/groupby.py Outdated Show resolved Hide resolved

python/dask_cudf/dask_cudf/tests/test_groupby.py Show resolved Hide resolved

bdice mentioned this pull request May 11, 2022

Revise 10 minutes notebook. #10738

Merged

charlesbluca added 3 commits May 11, 2022 11:46

Use helper method to generate groupby result columns

ed04f66

Add comments noting the change to test_groupby_basic

1dcf47f

Add nvtx annotation decorator to new helper method

08d8e7f

charlesbluca commented May 11, 2022

View reviewed changes

python/dask_cudf/dask_cudf/groupby.py Show resolved Hide resolved

bdice reviewed May 11, 2022

View reviewed changes

bdice approved these changes May 11, 2022

View reviewed changes

Remove more repetition from agg dictionary construction

d652157

charlesbluca mentioned this pull request May 11, 2022

Consolidate dask-cudf groupby_agg calls in one place #10835

Merged

galipremsagar approved these changes May 11, 2022

View reviewed changes

galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels May 11, 2022

rapids-bot bot merged commit 16d9a92 into rapidsai:branch-22.06 May 11, 2022

charlesbluca deleted the fix-10829 branch July 19, 2022 14:26

vyasr added dask Dask issue and removed dask-cudf labels Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add handling for string by-columns in dask-cudf groupby #10830

Add handling for string by-columns in dask-cudf groupby #10830

charlesbluca commented May 11, 2022

bdice May 11, 2022 •

edited

Loading

bdice May 11, 2022

charlesbluca May 11, 2022

bdice left a comment

codecov bot commented May 11, 2022 •

edited

Loading

galipremsagar commented May 11, 2022

Add handling for string by-columns in dask-cudf groupby #10830

Add handling for string by-columns in dask-cudf groupby #10830

Conversation

charlesbluca commented May 11, 2022

bdice May 11, 2022 • edited Loading

Choose a reason for hiding this comment

bdice May 11, 2022

Choose a reason for hiding this comment

charlesbluca May 11, 2022

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

codecov bot commented May 11, 2022 • edited Loading

Codecov Report

galipremsagar commented May 11, 2022

bdice May 11, 2022 •

edited

Loading

codecov bot commented May 11, 2022 •

edited

Loading