Refactor sorting APIs #9464

vyasr · 2021-10-18T23:26:11Z

This PR refactors most sorting APIs of Frame and its subclasses. To support these changes, it also refactors the implementation of take.

New Features:

DataFrame nlargest/nsmallest will accept multiple columns. Previously this would fail unexpectedly.
BaseIndex.sort_values now accepts na_position to be consistent with other sorts.
DataFrame.argsort now accepts an (optional) by parameter to indicate what columns to order by.

Performance:

DataFrame nlargest/nsmallest are up to 10x faster for small inputs.
take is significantly faster for all classes. For instance I see about a 2x speedup for Series.
DataFrame.sort_values is ~10% faster for small inputs.

Deprecations/Removals/Breaking Changes:

Deprecating arguments to take other than numerical indexes. Boolean masks are deprecated and will no longer be supported in the future. This matches pandas behavior and allows us to simplify our code.
The parameter for take has been renamed to indices from positions for consistency with pandas. This is a breaking change. If reviewers think it's important to still support positions as a kwarg we could add a backwards compatibility layer. My thinking is that this is probably not a frequently used API, and where it is used it's almost always used with a positional argument so renaming the first argument is not a huge issue.

There's one additional note that fits under a couple of the headings. While unifying implementations of argsort it made sense to change the behavior of DataFrame.argsort to return a cupy array instead of a Series. There's no corresponding pandas API so we have some freedom to choose the appropriate output, and I think an array makes more sense. However, Column.values is not that fast (yet, I plan to optimize soon), so it's actually slower right now to return the array than to return a Series constructed via _from_data. I think this is OK for now, but if reviewers feel strongly about it I can change it back to returning a Series.

galipremsagar

Overall looks good, Did a first pass of review.

python/cudf/cudf/core/series.py

python/cudf/cudf/tests/test_sorting.py

codecov · 2021-10-25T23:27:33Z

Codecov Report

Merging #9464 (ca4faad) into branch-21.12 (ab4bfaa) will decrease coverage by 0.12%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-21.12    #9464      +/-   ##
================================================
- Coverage         10.79%   10.66%   -0.13%     
================================================
  Files               116      117       +1     
  Lines             18869    19719     +850     
================================================
+ Hits               2036     2104      +68     
- Misses            16833    17615     +782

Impacted Files	Coverage Δ
python/dask_cudf/dask_cudf/sorting.py	`92.90% <0.00%> (-1.21%)`	⬇️
python/cudf/cudf/io/csv.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/hdf.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/orc.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_version.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/abc.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/api/types.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/dlpack.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
... and 66 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e632f97...ca4faad. Read the comment docs.

Co-authored-by: GALI PREM SAGAR <[email protected]>

python/cudf/cudf/core/frame.py

python/cudf/cudf/core/_base_index.py

python/cudf/cudf/core/dataframe.py

python/cudf/cudf/core/indexed_frame.py

python/cudf/cudf/core/multiindex.py

galipremsagar · 2021-11-01T09:32:37Z

python/cudf/cudf/tests/test_dataframe.py

@@ -8732,12 +8732,12 @@ def test_explode(data, labels, ignore_index, p_index, label_to_explode):
        (
            cudf.DataFrame({"a": [10, 0, 2], "b": [-10, 10, 1]}),
            True,
-            cudf.Series([1, 2, 0], dtype="int32"),
+            cupy.array([1, 2, 0], dtype="int32"),


cc: @dantegd for visibility, we are making a breaking change as part of this PR, i.e., returning a cupy ndarray instead of a cudf.Series as the result of argsort calls.

Are we? Skimming the implementation of argsort above, it looks like we're still returning a Series.

For DataFrame & Index this PR is returning cupy.ndarray, for Series, a Series is being returned.

Yeah it's a cupy.array now, see here where we're returning _get_sorted_inds(...).values. In my opinion it's more intuitive when returning integer indexes like this to also have them not be associated with any index and instead be suitable for iloc indexing, and pd.DataFrame doesn't support argsort (pd.Series does) so we have the flexibility to make such a choice. However we can revert this change if you two disagree or if this is break will cause problems for other RAPIDS libraries relying on the existing behavior.

Probably also worth checking with cuGraph, CC @rlratzel.

Thanks for the heads up @galipremsagar @vyasr , I just checked and this shouldn't affect the cuML codebase

We're not using argsort. I also tested this branch against cugraph locally and the python tests passed, so I think this change is safe for us.

If pd.DataFrame doesn't support argsort, can we:

Remove it for cudf.DataFrame?

and/or not test it?

Unfortunately we can't remove it (at least not right now) because we rely on it in our implementation of merging (and possibly in other places, but that's the obvious one that I'm aware of). I suppose that we could choose not to test it and rely on the join(..., sort=True) tests, but that would make our lives a little bit more difficult when it comes to identifying bugs.

vyasr · 2021-11-02T15:55:15Z

@galipremsagar looks like we're good now that the other teams have approved the change. Unfortunately black won't let me make the formatting change that you requested.

galipremsagar

LGTM @vyasr

vyasr · 2021-11-02T18:52:44Z

@gpucibot merge

vyasr added 3 - Ready for Review Ready for review by team code quality Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function breaking Breaking change labels Oct 18, 2021

vyasr added this to the Pandas API Alignment and Coverage milestone Oct 18, 2021

vyasr self-assigned this Oct 18, 2021

vyasr requested a review from a team as a code owner October 18, 2021 23:26

vyasr requested review from cwharris and rgsl888prabhu October 18, 2021 23:26

galipremsagar reviewed Oct 20, 2021

View reviewed changes

python/cudf/cudf/core/series.py Outdated Show resolved Hide resolved

python/cudf/cudf/tests/test_sorting.py Outdated Show resolved Hide resolved

vyasr added 17 commits October 25, 2021 14:18

Unify argsort implementations.

85151e0

Standardize implementation of take.

785c7ed

Deprecate parameters and remove unnecessary impls.

c98bc02

Remove MultiIndex take implementation.

f8ad3b8

Share sort_values between Series and DataFrame.

13e9911

Standardize argsort to return cupy arrays except for Series.

c1c7ef9

Fix MultiIndex.__getitem__ for slices.

3d14373

Unify index sort_values implementations.

6bbcaa4

Remove BaseIndex.gpu_values.

79cb909

Unify implementations of nsmallest and nlargest.

8fe0038

Make all argsort signatures consistent with pandas.

4b405a0

Put back groupby overrides for this PR.

128a00c

Don't construct list of columns unnecessarily.

d89fe29

Add DataFrame tests of multiple column sorts.

e2d57ba

Remove now superfluous take implementation.

a71c408

Rename positions to indices for pandas consistency.

1c2549e

Allow deserialization to return RangeIndex.

3fac33b

Parameterize the columns.

ea0d075

vyasr force-pushed the refactor/sorting branch from 2b189e4 to ea0d075 Compare October 25, 2021 21:28

Update python/cudf/cudf/core/series.py

3e412bc

Co-authored-by: GALI PREM SAGAR <[email protected]>

vyasr requested a review from galipremsagar October 29, 2021 20:42

galipremsagar requested changes Nov 1, 2021

View reviewed changes

vyasr added 2 commits November 1, 2021 17:15

Address PR reviews.

b06a43e

Merge remote-tracking branch 'origin/branch-21.12' into refactor/sorting

ca4faad

vyasr requested a review from galipremsagar November 2, 2021 15:54

galipremsagar approved these changes Nov 2, 2021

View reviewed changes

galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Nov 2, 2021

rapids-bot bot merged commit 2ecebe1 into rapidsai:branch-21.12 Nov 2, 2021

vyasr mentioned this pull request Jan 7, 2022

[FEA] Move the take() method to Frame #3981

Closed

vyasr deleted the refactor/sorting branch January 14, 2022 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor sorting APIs #9464

Refactor sorting APIs #9464

vyasr commented Oct 18, 2021

galipremsagar left a comment

codecov bot commented Oct 25, 2021 •

edited

Loading

galipremsagar Nov 1, 2021

shwina Nov 1, 2021

galipremsagar Nov 1, 2021

vyasr Nov 1, 2021

vyasr Nov 2, 2021

dantegd Nov 2, 2021

rlratzel Nov 2, 2021

shwina Nov 2, 2021

vyasr Nov 2, 2021 •

edited

Loading

vyasr commented Nov 2, 2021

galipremsagar left a comment

vyasr commented Nov 2, 2021

Refactor sorting APIs #9464

Refactor sorting APIs #9464

Conversation

vyasr commented Oct 18, 2021

galipremsagar left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 25, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr Nov 2, 2021 • edited Loading

Choose a reason for hiding this comment

vyasr commented Nov 2, 2021

galipremsagar left a comment

Choose a reason for hiding this comment

vyasr commented Nov 2, 2021

codecov bot commented Oct 25, 2021 •

edited

Loading

vyasr Nov 2, 2021 •

edited

Loading