Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow string aggs for dask_cudf.CudfDataFrameGroupBy.aggregate #10222

Merged

Conversation

charlesbluca
Copy link
Member

I noticed that CudfDataFrameGroupBy.aggregate doesn't actually support passing aggregations as strings, for example something like

import cudf
import dask_cudf

gdf = cudf.DataFrame({'id4': 4*list(range(6)), 'id5': 4*list(reversed(range(6))), 'v3': 6*list(range(4))})
gddf = dask_cudf.from_cudf(gdf, npartitions=5)

gddf.groupby("id4").agg("mean")

Would actually end up using the upstream aggregate implementation. This is because:

  • CudfDataFrameGroupBy.aggregate does not convert string aggs to a dict before calling _is_supported on them
  • _is_supported only handles list / dict aggs, returning false otherwise

I've resolved this by adding string support to _is_supported, and moving the conversion of aggs to the internal groupby_agg.

It looks like this is exposing some failures for first and last groupby aggs, as tests that were originally using upstream Dask to compute these aggregations (I assume accidentally since these aggregations are listed as supported) are now using dask-cuDF and getting the wrong result.

@charlesbluca charlesbluca added bug Something isn't working 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. non-breaking Non-breaking change labels Feb 4, 2022
@charlesbluca charlesbluca requested a review from a team as a code owner February 4, 2022 19:23
@github-actions
Copy link

github-actions bot commented Mar 6, 2022

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@charlesbluca charlesbluca changed the base branch from branch-22.04 to branch-22.06 March 22, 2022 13:59
@codecov
Copy link

codecov bot commented May 10, 2022

Codecov Report

Merging #10222 (a54ee42) into branch-22.06 (e0d94f3) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@               Coverage Diff                @@
##           branch-22.06   #10222      +/-   ##
================================================
+ Coverage         86.28%   86.32%   +0.03%     
================================================
  Files               144      144              
  Lines             22654    22656       +2     
================================================
+ Hits              19548    19558      +10     
+ Misses             3106     3098       -8     
Impacted Files Coverage Δ
python/cudf/cudf/core/resample.py 88.97% <ø> (ø)
python/dask_cudf/dask_cudf/groupby.py 97.42% <100.00%> (+0.01%) ⬆️
python/dask_cudf/dask_cudf/tests/test_groupby.py 100.00% <100.00%> (ø)
python/cudf/cudf/core/dataframe.py 93.78% <0.00%> (+0.04%) ⬆️
python/cudf/cudf/core/column/string.py 88.78% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.79% <0.00%> (+0.22%) ⬆️
python/dask_cudf/dask_cudf/core.py 73.62% <0.00%> (+0.26%) ⬆️
python/cudf/cudf/core/column/numerical.py 96.17% <0.00%> (+0.29%) ⬆️
python/cudf/cudf/core/tools/datetimes.py 84.49% <0.00%> (+0.30%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0d94f3...a54ee42. Read the comment docs.

@charlesbluca charlesbluca requested a review from a team as a code owner May 12, 2022 15:36
@charlesbluca charlesbluca requested review from vyasr and bdice May 12, 2022 15:36
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No comments -- looks good to me!

@galipremsagar
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 4ad1e51 into rapidsai:branch-22.06 May 13, 2022
@charlesbluca charlesbluca deleted the dask-dataframe-groupby-str branch July 19, 2022 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants