-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor sorting APIs #9464
Refactor sorting APIs #9464
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, Did a first pass of review.
2b189e4
to
ea0d075
Compare
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #9464 +/- ##
================================================
- Coverage 10.79% 10.66% -0.13%
================================================
Files 116 117 +1
Lines 18869 19719 +850
================================================
+ Hits 2036 2104 +68
- Misses 16833 17615 +782
Continue to review full report at Codecov.
|
Co-authored-by: GALI PREM SAGAR <[email protected]>
@@ -8732,12 +8732,12 @@ def test_explode(data, labels, ignore_index, p_index, label_to_explode): | |||
( | |||
cudf.DataFrame({"a": [10, 0, 2], "b": [-10, 10, 1]}), | |||
True, | |||
cudf.Series([1, 2, 0], dtype="int32"), | |||
cupy.array([1, 2, 0], dtype="int32"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @dantegd for visibility, we are making a breaking change as part of this PR, i.e., returning a cupy ndarray instead of a cudf.Series as the result of argsort
calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we? Skimming the implementation of argsort
above, it looks like we're still returning a Series
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For DataFrame
& Index
this PR is returning cupy.ndarray
, for Series
, a Series
is being returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's a cupy.array now, see here where we're returning _get_sorted_inds(...).values
. In my opinion it's more intuitive when returning integer indexes like this to also have them not be associated with any index and instead be suitable for iloc indexing, and pd.DataFrame
doesn't support argsort
(pd.Series
does) so we have the flexibility to make such a choice. However we can revert this change if you two disagree or if this is break will cause problems for other RAPIDS libraries relying on the existing behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably also worth checking with cuGraph, CC @rlratzel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads up @galipremsagar @vyasr , I just checked and this shouldn't affect the cuML codebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not using argsort
. I also tested this branch against cugraph locally and the python tests passed, so I think this change is safe for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If pd.DataFrame
doesn't support argsort
, can we:
- Remove it for
cudf.DataFrame
? - and/or not test it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately we can't remove it (at least not right now) because we rely on it in our implementation of merging (and possibly in other places, but that's the obvious one that I'm aware of). I suppose that we could choose not to test it and rely on the join(..., sort=True)
tests, but that would make our lives a little bit more difficult when it comes to identifying bugs.
@galipremsagar looks like we're good now that the other teams have approved the change. Unfortunately black won't let me make the formatting change that you requested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @vyasr
@gpucibot merge |
This PR refactors most sorting APIs of Frame and its subclasses. To support these changes, it also refactors the implementation of
take
.New Features:
Performance:
Deprecations/Removals/Breaking Changes:
indices
frompositions
for consistency with pandas. This is a breaking change. If reviewers think it's important to still support positions as a kwarg we could add a backwards compatibility layer. My thinking is that this is probably not a frequently used API, and where it is used it's almost always used with a positional argument so renaming the first argument is not a huge issue.There's one additional note that fits under a couple of the headings. While unifying implementations of argsort it made sense to change the behavior of DataFrame.argsort to return a cupy array instead of a Series. There's no corresponding pandas API so we have some freedom to choose the appropriate output, and I think an array makes more sense. However,
Column.values
is not that fast (yet, I plan to optimize soon), so it's actually slower right now to return the array than to return a Series constructed via_from_data
. I think this is OK for now, but if reviewers feel strongly about it I can change it back to returning a Series.