Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve coverage of dask-cudf's groupby aggregation, add tests for dropna support #10449

Merged

Conversation

charlesbluca
Copy link
Member

This PR does the following:

  • Make sure that all of dask-cudf's SUPPORTED_AGGS have an overriding method for upstream Dask's series / dataframe groupby methods
  • Add tests comparing dask-cudf's dropna support to upstream Dask's, as at the moment we are only comparing against cuDF
  • Fix the resulting failures of these changes (by properly parsing self.dropna in dask-cudf's groupby code)

As a side note, I think that a larger rethinking of dask-cudf's groupby would pay off well, as currently it seems like we have some "duplicate" tests and aren't really able to discern if groupby_agg was called for a supported aggregation

@charlesbluca charlesbluca added bug Something isn't working 3 - Ready for Review Ready for review by team dask-cudf non-breaking Non-breaking change labels Mar 17, 2022
@charlesbluca charlesbluca requested a review from a team as a code owner March 17, 2022 14:59
@github-actions github-actions bot added the Python Affects Python cuDF API. label Mar 17, 2022
@charlesbluca
Copy link
Member Author

rerun tests

@charlesbluca charlesbluca changed the base branch from branch-22.04 to branch-22.06 March 22, 2022 13:54
@charlesbluca
Copy link
Member Author

rerun tests

@codecov
Copy link

codecov bot commented Apr 11, 2022

Codecov Report

Merging #10449 (12c558b) into branch-22.06 (4913a9b) will decrease coverage by 0.12%.
The diff coverage is 100.00%.

@@               Coverage Diff                @@
##           branch-22.06   #10449      +/-   ##
================================================
- Coverage         86.42%   86.30%   -0.13%     
================================================
  Files               143      143              
  Lines             22493    22631     +138     
================================================
+ Hits              19440    19532      +92     
- Misses             3053     3099      +46     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/groupby.py 97.36% <100.00%> (+0.69%) ⬆️
python/dask_cudf/dask_cudf/tests/test_groupby.py 100.00% <100.00%> (ø)
...thon/dask_cudf/dask_cudf/tests/test_distributed.py 18.86% <0.00%> (-67.93%) ⬇️
python/cudf/cudf/io/parquet.py 90.47% <0.00%> (-2.23%) ⬇️
python/dask_cudf/dask_cudf/backends.py 85.51% <0.00%> (-0.94%) ⬇️
python/cudf/cudf/core/column/categorical.py 89.37% <0.00%> (-0.61%) ⬇️
python/cudf/cudf/core/column/decimal.py 90.60% <0.00%> (-0.50%) ⬇️
python/cudf/cudf/core/column/lists.py 91.66% <0.00%> (-0.42%) ⬇️
python/cudf/cudf/core/column/string.py 88.78% <0.00%> (-0.31%) ⬇️
python/cudf/cudf/core/index.py 92.06% <0.00%> (-0.25%) ⬇️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4913a9b...12c558b. Read the comment docs.

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @charlesbluca! Most of my comments are just minor style suggestions.

python/dask_cudf/dask_cudf/groupby.py Show resolved Hide resolved
python/dask_cudf/dask_cudf/groupby.py Outdated Show resolved Hide resolved
python/dask_cudf/dask_cudf/groupby.py Outdated Show resolved Hide resolved
Comment on lines 275 to 278
if by == "d" or "d" in by:
pytest.skip(
"Dask CPU doesn't have support for dropna with categorical columns"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to include those "by" parameters if we will always be skipping them? Where these tests only meant to be skipped when dropna=True?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - IIRC they don't work when dropna=False, but I will try things out locally to double check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running the tests locally, it looks like there's a few separate (but likely related) issues in Dask CPU going on:

  • multi-column groupbys grouping on categorical columns don't work at all
  • dropna=False doesn't work on any multi-column groupbys
  • dropna=False doesn't work on single-column groupbys on categorical columns

I think maybe instead of skipping here, we should xfail the currently failing cases - I suggest this so we can track the ongoing work in Dask CPU and remove the xfail once these tests are passing, without having to substantially change the test parameter. How do you feel about this? If you would rather just remove the categorical column cases for now and add a TODO to add those in later on when the dask issue is resolved, I am okay with that 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for mising this 26 days ago @charlesbluca ! I do think xfail makes a lot more sense than skipping.

@charlesbluca
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit ee8cd59 into rapidsai:branch-22.06 May 10, 2022
@charlesbluca charlesbluca deleted the dask-cudf-dropna-coverage branch July 19, 2022 14:26
@vyasr vyasr added dask Dask issue and removed dask-cudf labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working dask Dask issue non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants