-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add a new cuDF option stable_sort that provides ordering guarantees for otherwise nondeterministic APIs #12236
Comments
FWIW, I think we should also consider the (ongoing) maintenance cost of implementing this scheme. We would double the test-time run in a naive setup since we'd need to do everything twice (once with, once without order preservation). I read through a bunch of the past discussions, I didn't see really any compelling use-cases for order-preserving API that would not be better served by using algorithmic approaches at the user-level that don't rely on order-preservation. I agree that sometimes you have to think a bit harder, but ... |
I'm assuming this comment was triggered by the discussion on #12939 (comment). I think it boils down to what degree we promise to match pandas, which I think will vary depending on how people use cudf. My 2c: in the cases where we are truly claiming to be a drop-in replacement, that includes ordering guarantees (even ones that pandas may have introduced unintentionally but is now maintaining in perpetuity anyway). |
I see that Vyas has already covered |
Just FYI, #13657 is adding a workaround for non-deterministic behavior in |
@vyasr I'm unsure what the scope of this issue includes at this point. We have implemented stable forms of |
I agree. Let's close this for now and open more targeted issues as we find specific APIs that are missing ordering guarantees in cudf.pandas. |
Is your feature request related to a problem? Please describe.
Currently various cuDF APIs (examples include
merge
,groupby
, anddrop_duplicates
) rely on libcudf APIs that do not promise stable ordering. libcudf will not (and should not) ever be forced to make such guarantees since requiring ordering hamstrings potential optimizations and can always be achieved by calling code with the addition of suitable index columns and sorts. However, the requirements of cuDF Python are different. In particular, for pandas users transitioning to cuDF the lack of stability may be confusing at best and workflow-breaking at worst. Since there are multiple APIs that may exhibit this API divergence, and these APIs may be invoked many times in any particular piece of code, the responsibility should not fall to users to wrap those calls in appropriate sorting logic every time.Describe the solution you'd like
cuDF should leverage the options framework introduced in #11193 to add a new option
stable_sort
that, when True, will change the behavior of all cuDF APIs to guarantee that the input order will be preserve. The default value should beFalse
to force users to opt in to performance-inhibiting changes. The implementation of this option will depend on the API; for instance, withmerge
it will require consistent ordering in both tables, whereas fordrop_duplicates
it only affects one. I would recommend implementing the option as_stable_sort
at first to indicate that it is internal, then adding support to one algorithm at a time, then exposing the option publicly asstable_sort
once we have sufficient algorithmic coverage.Describe alternatives you've considered
We have rejected similar requests in the past, in part because of the lack of clarity on how this should be handled in cuDF Python vs. libcudf, as well as because we did not wish to slow down all calls to these APIs by default. The use of
cudf.options
at the Python level provides an elegant and well-scoped solution to the problem that avoids these problems.Additional context
Addressing this issue will close #1781 and #5286.
The text was updated successfully, but these errors were encountered: