-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve coverage of dask-cudf's groupby aggregation, add tests for dropna
support
#10449
Improve coverage of dask-cudf's groupby aggregation, add tests for dropna
support
#10449
Conversation
rerun tests |
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-22.06 #10449 +/- ##
================================================
- Coverage 86.42% 86.30% -0.13%
================================================
Files 143 143
Lines 22493 22631 +138
================================================
+ Hits 19440 19532 +92
- Misses 3053 3099 +46
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @charlesbluca! Most of my comments are just minor style suggestions.
if by == "d" or "d" in by: | ||
pytest.skip( | ||
"Dask CPU doesn't have support for dropna with categorical columns" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to include those "by" parameters if we will always be skipping them? Where these tests only meant to be skipped when dropna=True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - IIRC they don't work when dropna=False
, but I will try things out locally to double check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running the tests locally, it looks like there's a few separate (but likely related) issues in Dask CPU going on:
- multi-column groupbys grouping on categorical columns don't work at all
dropna=False
doesn't work on any multi-column groupbysdropna=False
doesn't work on single-column groupbys on categorical columns
I think maybe instead of skipping here, we should xfail the currently failing cases - I suggest this so we can track the ongoing work in Dask CPU and remove the xfail once these tests are passing, without having to substantially change the test parameter. How do you feel about this? If you would rather just remove the categorical column cases for now and add a TODO
to add those in later on when the dask issue is resolved, I am okay with that 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for mising this 26 days ago @charlesbluca ! I do think xfail makes a lot more sense than skipping.
@gpucibot merge |
This PR does the following:
SUPPORTED_AGGS
have an overriding method for upstream Dask's series / dataframe groupby methodsdropna
support to upstream Dask's, as at the moment we are only comparing against cuDFself.dropna
in dask-cudf's groupby code)As a side note, I think that a larger rethinking of dask-cudf's groupby would pay off well, as currently it seems like we have some "duplicate" tests and aren't really able to discern if
groupby_agg
was called for a supported aggregation