Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor sorting APIs #9464

Merged
merged 21 commits into from
Nov 2, 2021
Merged

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Oct 18, 2021

This PR refactors most sorting APIs of Frame and its subclasses. To support these changes, it also refactors the implementation of take.

New Features:

  • DataFrame nlargest/nsmallest will accept multiple columns. Previously this would fail unexpectedly.
  • BaseIndex.sort_values now accepts na_position to be consistent with other sorts.
  • DataFrame.argsort now accepts an (optional) by parameter to indicate what columns to order by.

Performance:

  • DataFrame nlargest/nsmallest are up to 10x faster for small inputs.
  • take is significantly faster for all classes. For instance I see about a 2x speedup for Series.
  • DataFrame.sort_values is ~10% faster for small inputs.

Deprecations/Removals/Breaking Changes:

  • Deprecating arguments to take other than numerical indexes. Boolean masks are deprecated and will no longer be supported in the future. This matches pandas behavior and allows us to simplify our code.
  • The parameter for take has been renamed to indices from positions for consistency with pandas. This is a breaking change. If reviewers think it's important to still support positions as a kwarg we could add a backwards compatibility layer. My thinking is that this is probably not a frequently used API, and where it is used it's almost always used with a positional argument so renaming the first argument is not a huge issue.

There's one additional note that fits under a couple of the headings. While unifying implementations of argsort it made sense to change the behavior of DataFrame.argsort to return a cupy array instead of a Series. There's no corresponding pandas API so we have some freedom to choose the appropriate output, and I think an array makes more sense. However, Column.values is not that fast (yet, I plan to optimize soon), so it's actually slower right now to return the array than to return a Series constructed via _from_data. I think this is OK for now, but if reviewers feel strongly about it I can change it back to returning a Series.

@vyasr vyasr added 3 - Ready for Review Ready for review by team code quality Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function breaking Breaking change labels Oct 18, 2021
@vyasr vyasr self-assigned this Oct 18, 2021
@vyasr vyasr requested a review from a team as a code owner October 18, 2021 23:26
Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, Did a first pass of review.

python/cudf/cudf/core/series.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_sorting.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Oct 25, 2021

Codecov Report

Merging #9464 (ca4faad) into branch-21.12 (ab4bfaa) will decrease coverage by 0.12%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9464      +/-   ##
================================================
- Coverage         10.79%   10.66%   -0.13%     
================================================
  Files               116      117       +1     
  Lines             18869    19719     +850     
================================================
+ Hits               2036     2104      +68     
- Misses            16833    17615     +782     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/sorting.py 92.90% <0.00%> (-1.21%) ⬇️
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/dlpack.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
... and 66 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e632f97...ca4faad. Read the comment docs.

@vyasr vyasr requested a review from galipremsagar October 29, 2021 20:42
python/cudf/cudf/core/frame.py Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Show resolved Hide resolved
python/cudf/cudf/core/_base_index.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/_base_index.py Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Show resolved Hide resolved
python/cudf/cudf/core/multiindex.py Outdated Show resolved Hide resolved
@@ -8732,12 +8732,12 @@ def test_explode(data, labels, ignore_index, p_index, label_to_explode):
(
cudf.DataFrame({"a": [10, 0, 2], "b": [-10, 10, 1]}),
True,
cudf.Series([1, 2, 0], dtype="int32"),
cupy.array([1, 2, 0], dtype="int32"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @dantegd for visibility, we are making a breaking change as part of this PR, i.e., returning a cupy ndarray instead of a cudf.Series as the result of argsort calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we? Skimming the implementation of argsort above, it looks like we're still returning a Series.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For DataFrame & Index this PR is returning cupy.ndarray, for Series, a Series is being returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a cupy.array now, see here where we're returning _get_sorted_inds(...).values. In my opinion it's more intuitive when returning integer indexes like this to also have them not be associated with any index and instead be suitable for iloc indexing, and pd.DataFrame doesn't support argsort (pd.Series does) so we have the flexibility to make such a choice. However we can revert this change if you two disagree or if this is break will cause problems for other RAPIDS libraries relying on the existing behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably also worth checking with cuGraph, CC @rlratzel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up @galipremsagar @vyasr , I just checked and this shouldn't affect the cuML codebase

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not using argsort. I also tested this branch against cugraph locally and the python tests passed, so I think this change is safe for us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If pd.DataFrame doesn't support argsort, can we:

  1. Remove it for cudf.DataFrame?
  2. and/or not test it?

Copy link
Contributor Author

@vyasr vyasr Nov 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we can't remove it (at least not right now) because we rely on it in our implementation of merging (and possibly in other places, but that's the obvious one that I'm aware of). I suppose that we could choose not to test it and rely on the join(..., sort=True) tests, but that would make our lives a little bit more difficult when it comes to identifying bugs.

@vyasr vyasr requested a review from galipremsagar November 2, 2021 15:54
@vyasr
Copy link
Contributor Author

vyasr commented Nov 2, 2021

@galipremsagar looks like we're good now that the other teams have approved the change. Unfortunately black won't let me make the formatting change that you requested.

Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @vyasr

@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Nov 2, 2021
@vyasr
Copy link
Contributor Author

vyasr commented Nov 2, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 2ecebe1 into rapidsai:branch-21.12 Nov 2, 2021
@vyasr vyasr deleted the refactor/sorting branch January 14, 2022 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants