Allow custom sort functions for dask-cudf `sort_values` #9789

charlesbluca · 2021-11-29T18:26:41Z

Similar to dask/dask#8345, this PR allows the sorting function called on each partition in last step of dask-cudf's sort_values to be generalized, along with the kwargs that are supplied to it. This allows sort_values to be extended to support more complex ascending / null position handling.

The context for this PR is a desire to simplify the sorting algorithm used by dask-sql; since it only really differs from dask-cudf's sorting algorithm in that it uses a custom sorting function, it seems like it would be easier to allow for that extension upstream rather than duplicate code in dask-sql.

codecov · 2021-11-29T20:07:27Z

Codecov Report

Merging #9789 (bc9291c) into branch-22.02 (967a333) will decrease coverage by 0.04%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.02    #9789      +/-   ##
================================================
- Coverage         10.49%   10.44%   -0.05%     
================================================
  Files               119      119              
  Lines             20305    20476     +171     
================================================
+ Hits               2130     2139       +9     
- Misses            18175    18337     +162

Impacted Files	Coverage Δ
python/custreamz/custreamz/kafka.py	`29.16% <0.00%> (-0.63%)`	⬇️
python/dask_cudf/dask_cudf/sorting.py	`92.66% <0.00%> (-0.25%)`	⬇️
python/dask_cudf/dask_cudf/core.py	`70.85% <0.00%> (-0.17%)`	⬇️
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/series.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/utils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/dtypes.py	`0.00% <0.00%> (ø)`
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d5ec7f...bc9291c. Read the comment docs.

charlesbluca · 2021-12-01T15:46:46Z

rerun tests

charlesbluca · 2021-12-08T15:30:24Z

rerun tests

…-functions

github-actions · 2022-01-08T22:04:23Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

rjzamora

The general change here makes sense to me, thanks for working on this @charlesbluca !

My main comment/suggestion is to avoid "breaking" API changes by moving the default-handling logic into sorting.sort_values.

python/dask_cudf/dask_cudf/core.py

rjzamora · 2022-01-10T16:48:29Z

python/dask_cudf/dask_cudf/sorting.py

-    df4 = df3.map_partitions(
-        M.sort_values, by, ascending=ascending, na_position=na_position
-    )
+    df4 = df3.map_partitions(sort_function, **sort_function_kwargs)


Something feels off here. We are requiring that the user specify sort_function, but the API makes it seem optional. I worry that we are now silently ignoring acsending and na_position (and maybe even by?).

What if down-stream users are implementing code with sorting.sort_values directly? I don't think that is good/recommended practice, but the API we are changing seems "public" to me (making this a breaking change).

Perhaps a simpler (non-breaking) solution would be to remove most of the changes from DataFrame.sort_values, pass through sort_function and sort_function_kwargs into here, and implement the sort_function/sort_function_kwargs default logic here (in sorting.sort_values). Does this seem reasonable?

That makes sense and is a valid concern - my only comment is that we ideally still want to allow for custom sorting functions in the npartitions == 1 case that is handled directly in DataFrame.sort_values, so I think it might also make sense to move the following logic:

if self.npartitions == 1: df = self.map_partitions(sort_function, **sort_kwargs)

into sorting.sort_values as well, unless there's a reason that's not immediately obvious to me why we would want to keep the single partition case separate?

Also noting that this is also a concern for the upstream implementation of this, so depending on what we decide on here I will open up a follow up PR to address this in Dask.

Also noting that this is also a concern for the upstream implementation of this, so depending on what we decide on here I will open up a follow up PR to address this in Dask.

Good point! I definitely like the simplification you made here. So it probably makes sense to do something similar upstream.

python/dask_cudf/dask_cudf/tests/test_sort.py

…-functions

rjzamora

Thanks for revising this @charlesbluca! Everything looks great to me.

charlesbluca · 2022-01-14T15:11:44Z

@gpucibot merge

This PR moves the handling of custom sorting functions to `shuffle.sort_values`, so that usages of the internal `sort_values` function will not have to manually specify a default `sort_function` and `sort_function_kwargs`. This originated as a concern in the downstream implementation of this in rapidsai/cudf#9789

Allow custom sorting functions for dask-cudf sort_values

c2e0fec

charlesbluca added 2 - In Progress Currently a work in progress dask-cudf improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 29, 2021

github-actions bot added the Python Affects Python cuDF API. label Nov 29, 2021

Add custom sort function test

ca8e497

charlesbluca added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 29, 2021

charlesbluca marked this pull request as ready for review November 29, 2021 19:00

charlesbluca requested a review from a team as a code owner November 29, 2021 19:00

charlesbluca requested review from rjzamora and galipremsagar November 30, 2021 15:03

galipremsagar approved these changes Nov 30, 2021

View reviewed changes

charlesbluca mentioned this pull request Dec 1, 2021

Use upstream Dask for complex sorting operations dask-contrib/dask-sql#336

Merged

charlesbluca added 2 commits December 9, 2021 10:32

Merge remote-tracking branch 'upstream/branch-22.02' into custom-sort…

6550072

…-functions

Merge remote-tracking branch 'upstream/branch-22.02' into custom-sort…

4770ef0

…-functions

github-actions bot added the inactive-30d label Jan 8, 2022

rjzamora reviewed Jan 10, 2022

View reviewed changes

charlesbluca added 3 commits January 10, 2022 14:12

Merge remote-tracking branch 'upstream/branch-22.02' into custom-sort…

ed2785d

…-functions

Move custom sort function logic to internal sort_values

e54f1bf

Use correct sort kwargs for map_partitions call

bc9291c

rjzamora approved these changes Jan 14, 2022

View reviewed changes

rapids-bot bot merged commit ca77542 into rapidsai:branch-22.02 Jan 14, 2022

charlesbluca mentioned this pull request Jan 14, 2022

Move custom sort function logic to internal sort_values dask/dask#8571

Merged

3 tasks

charlesbluca deleted the custom-sort-functions branch July 19, 2022 14:26

vyasr added dask Dask issue and removed dask-cudf labels Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow custom sort functions for dask-cudf `sort_values` #9789

Allow custom sort functions for dask-cudf `sort_values` #9789

charlesbluca commented Nov 29, 2021

codecov bot commented Nov 29, 2021 •

edited

Loading

charlesbluca commented Dec 1, 2021

charlesbluca commented Dec 8, 2021

github-actions bot commented Jan 8, 2022

rjzamora left a comment

rjzamora Jan 10, 2022

charlesbluca Jan 10, 2022

rjzamora Jan 14, 2022

rjzamora left a comment

charlesbluca commented Jan 14, 2022

Allow custom sort functions for dask-cudf sort_values #9789

Allow custom sort functions for dask-cudf sort_values #9789

Conversation

charlesbluca commented Nov 29, 2021

codecov bot commented Nov 29, 2021 • edited Loading

Codecov Report

charlesbluca commented Dec 1, 2021

charlesbluca commented Dec 8, 2021

github-actions bot commented Jan 8, 2022

rjzamora left a comment

Choose a reason for hiding this comment

rjzamora Jan 10, 2022

Choose a reason for hiding this comment

charlesbluca Jan 10, 2022

Choose a reason for hiding this comment

rjzamora Jan 14, 2022

Choose a reason for hiding this comment

rjzamora left a comment

Choose a reason for hiding this comment

charlesbluca commented Jan 14, 2022

Allow custom sort functions for dask-cudf `sort_values` #9789

Allow custom sort functions for dask-cudf `sort_values` #9789

codecov bot commented Nov 29, 2021 •

edited

Loading