Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should internal usages of sorting with numpy use kind="stable"? #53558

Open
mroeschke opened this issue Jun 8, 2023 · 3 comments
Open

Should internal usages of sorting with numpy use kind="stable"? #53558

mroeschke opened this issue Jun 8, 2023 · 3 comments
Labels
Compat pandas objects compatability with Numpy or Python functions Needs Discussion Requires discussion from core team before further action

Comments

@mroeschke
Copy link
Member

mroeschke commented Jun 8, 2023

There are several places where we call np.sort/argsort/etc. internally, i.e. not cases where users can specify a sorting kind like in sort_values, and use the default unstable kind="quicksort"

In numpy 1.25, it appears that CPUs that can use AVX will have a modified quicksort and recently broke some tests xref #53548 in our numpy dev build where we were testing these unstable sorting results.

Is it worth transitioning to a stable sorting algorithm internally for consistency?

Alternatively we could dynamically transition to use a stable sorting algorithm if duplicate values are being sorted?

@mroeschke mroeschke added the Compat pandas objects compatability with Numpy or Python functions label Jun 8, 2023
@WillAyd
Copy link
Member

WillAyd commented Jun 14, 2023

Sorry I missed the first half of the call today where this was discussed; I see the result was an agreement to move to a stable sort. Do we know the performance implications of that though? Seems like it opens up the possibility of performance bottlenecks so would be hesitant to commit to that

@mroeschke
Copy link
Member Author

Do we know the performance implications of that though?

Not definitely, but most the application of numpy sorting internally is sorting numerical factors as one part of multiple operations so it seems unlikely that it could be the bottleneck in the main operation.

Additionally during the call it seemed the consistency of results is worth the tradeoff of performance implications

@Gabriel-p
Copy link

I just spent a full day figuring out why Pandas was giving me different results for the same array and today I found I'd been burnt by this issue #39877.

I 100% support kind="stable" being the default. Anything else is entirely unintuitive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants