-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow custom sort functions for dask-cudf sort_values
#9789
Allow custom sort functions for dask-cudf sort_values
#9789
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #9789 +/- ##
================================================
- Coverage 10.49% 10.44% -0.05%
================================================
Files 119 119
Lines 20305 20476 +171
================================================
+ Hits 2130 2139 +9
- Misses 18175 18337 +162
Continue to review full report at Codecov.
|
rerun tests |
rerun tests |
This PR has been labeled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general change here makes sense to me, thanks for working on this @charlesbluca !
My main comment/suggestion is to avoid "breaking" API changes by moving the default-handling logic into sorting.sort_values
.
df4 = df3.map_partitions( | ||
M.sort_values, by, ascending=ascending, na_position=na_position | ||
) | ||
df4 = df3.map_partitions(sort_function, **sort_function_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something feels off here. We are requiring that the user specify sort_function
, but the API makes it seem optional. I worry that we are now silently ignoring acsending
and na_position
(and maybe even by
?).
What if down-stream users are implementing code with sorting.sort_values
directly? I don't think that is good/recommended practice, but the API we are changing seems "public" to me (making this a breaking change).
Perhaps a simpler (non-breaking) solution would be to remove most of the changes from DataFrame.sort_values
, pass through sort_function
and sort_function_kwargs
into here, and implement the sort_function
/sort_function_kwargs
default logic here (in sorting.sort_values
). Does this seem reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense and is a valid concern - my only comment is that we ideally still want to allow for custom sorting functions in the npartitions == 1
case that is handled directly in DataFrame.sort_values
, so I think it might also make sense to move the following logic:
if self.npartitions == 1:
df = self.map_partitions(sort_function, **sort_kwargs)
into sorting.sort_values
as well, unless there's a reason that's not immediately obvious to me why we would want to keep the single partition case separate?
Also noting that this is also a concern for the upstream implementation of this, so depending on what we decide on here I will open up a follow up PR to address this in Dask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also noting that this is also a concern for the upstream implementation of this, so depending on what we decide on here I will open up a follow up PR to address this in Dask.
Good point! I definitely like the simplification you made here. So it probably makes sense to do something similar upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for revising this @charlesbluca! Everything looks great to me.
@gpucibot merge |
This PR moves the handling of custom sorting functions to `shuffle.sort_values`, so that usages of the internal `sort_values` function will not have to manually specify a default `sort_function` and `sort_function_kwargs`. This originated as a concern in the downstream implementation of this in rapidsai/cudf#9789
Similar to dask/dask#8345, this PR allows the sorting function called on each partition in last step of dask-cudf's
sort_values
to be generalized, along with the kwargs that are supplied to it. This allowssort_values
to be extended to support more complex ascending / null position handling.The context for this PR is a desire to simplify the sorting algorithm used by dask-sql; since it only really differs from dask-cudf's sorting algorithm in that it uses a custom sorting function, it seems like it would be easier to allow for that extension upstream rather than duplicate code in dask-sql.