Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add a new cuDF option stable_sort that provides ordering guarantees for otherwise nondeterministic APIs #12236

Closed
vyasr opened this issue Nov 23, 2022 · 7 comments
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request Python Affects Python cuDF API.

Comments

@vyasr
Copy link
Contributor

vyasr commented Nov 23, 2022

Is your feature request related to a problem? Please describe.
Currently various cuDF APIs (examples include merge, groupby, and drop_duplicates) rely on libcudf APIs that do not promise stable ordering. libcudf will not (and should not) ever be forced to make such guarantees since requiring ordering hamstrings potential optimizations and can always be achieved by calling code with the addition of suitable index columns and sorts. However, the requirements of cuDF Python are different. In particular, for pandas users transitioning to cuDF the lack of stability may be confusing at best and workflow-breaking at worst. Since there are multiple APIs that may exhibit this API divergence, and these APIs may be invoked many times in any particular piece of code, the responsibility should not fall to users to wrap those calls in appropriate sorting logic every time.

Describe the solution you'd like
cuDF should leverage the options framework introduced in #11193 to add a new option stable_sort that, when True, will change the behavior of all cuDF APIs to guarantee that the input order will be preserve. The default value should be False to force users to opt in to performance-inhibiting changes. The implementation of this option will depend on the API; for instance, with merge it will require consistent ordering in both tables, whereas for drop_duplicates it only affects one. I would recommend implementing the option as _stable_sort at first to indicate that it is internal, then adding support to one algorithm at a time, then exposing the option publicly as stable_sort once we have sufficient algorithmic coverage.

Describe alternatives you've considered
We have rejected similar requests in the past, in part because of the lack of clarity on how this should be handled in cuDF Python vs. libcudf, as well as because we did not wish to slow down all calls to these APIs by default. The use of cudf.options at the Python level provides an elegant and well-scoped solution to the problem that avoids these problems.

Additional context
Addressing this issue will close #1781 and #5286.

@vyasr vyasr added feature request New feature or request 0 - Backlog In queue waiting for assignment pandas Python Affects Python cuDF API. labels Nov 23, 2022
@wence-
Copy link
Contributor

wence- commented Mar 14, 2023

FWIW, I think we should also consider the (ongoing) maintenance cost of implementing this scheme. We would double the test-time run in a naive setup since we'd need to do everything twice (once with, once without order preservation).

I read through a bunch of the past discussions, I didn't see really any compelling use-cases for order-preserving API that would not be better served by using algorithmic approaches at the user-level that don't rely on order-preservation. I agree that sometimes you have to think a bit harder, but ...

@vyasr
Copy link
Contributor Author

vyasr commented Mar 15, 2023

I'm assuming this comment was triggered by the discussion on #12939 (comment). I think it boils down to what degree we promise to match pandas, which I think will vary depending on how people use cudf. My 2c: in the cases where we are truly claiming to be a drop-in replacement, that includes ordering guarantees (even ones that pandas may have introduced unintentionally but is now maintaining in perpetuity anyway).

@bdice
Copy link
Contributor

bdice commented Jun 6, 2023

FWIW, #11656 made drop_duplicates have stable ordering. That removes part of the scope of the issue (#5286 is resolved). 🚀

@galipremsagar
Copy link
Contributor

I see that Vyas has already covered groupby in this issue. So making note here that a python side solution to guarantee order with sort=False in groupby is being added for the time being: #13551, this will kick in only when the pandas compatibility(mode.pandas_compatible) mode option is enabled.

@galipremsagar
Copy link
Contributor

Just FYI, #13657 is adding a workaround for non-deterministic behavior in MultiIndex.loc because of join/merge.

@bdice
Copy link
Contributor

bdice commented Jun 27, 2024

@vyasr I'm unsure what the scope of this issue includes at this point. We have implemented stable forms of drop_duplicates and groupby. Is join/merge the only remaining piece that is unstable compared to pandas? We have already closed the two linked issues that this issue says it will close. A task list would be helpful, and maybe we can close this in favor of more narrowly scoped issues.

@vyasr
Copy link
Contributor Author

vyasr commented Jun 27, 2024

I agree. Let's close this for now and open more targeted issues as we find specific APIs that are missing ordering guarantees in cudf.pandas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request Python Affects Python cuDF API.
Projects
Status: Done
Development

No branches or pull requests

4 participants