-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unstable (lex)sort_to_indices #4545
Comments
@tustvold in this particular use case it is 1 sorted column and |
Yes, I would consider this a bug that sort_to_indices is not stable. This should be a simple case of updating the comparator function to fallback to comparing the indices if equal. Hopefully this won't regress performance too significantly. I will have a play Edit: I can't see an easy way to fix this without it significantly regressing performance... |
My naive approach was also to do the same with double sort, thinking if its possible to create a stable sort key but preserve single sorting. like |
I mean a short-term hack would be to include a row number column, and include this in the sort. The performance would not be ideal though... |
for the test above I was able to make correct behavior with comparator
this looks like less expensive than double sort |
if you think this is something we can go with, I'll create a PR |
I experimented with this and other approaches, see tustvold@b9adfea The problem is this regresses performance by a factor of 4... |
stable sort is expectedly more expensive than unstable as alternatives can we consider: |
I feel like someone (maybe @sundy-li ) originally changed the sort in arrow to be unstable precisely for performance reasons I agree with @comphead that we should not force a slower stable sort on everyone and instead let users decide for themselves if they want faster unstable or slower stable sorts |
any way how to let users to decide? I havent found the arrow-rs config or runtime parameters to define the behavior based on user preferences UPD: can we use different features in arrow-rs for stable and unstable sort? |
Another aspect of this is that is sort_to_indices were stable, it would be relatively straightforward to use it to implement lexsort without needing to use DynComparator... 🤔 |
Update here is that @tustvold reports:
Thus I am not sure what to do with this issue for now |
Probably the better is to move this stable sort implementations to datafusion level? |
Datafusion is a pretty heavy dependency to bring in for only a stable sort. Is the stability something that can maybe be added to |
Would arrow-row work for your use-case? I.e. https://docs.rs/arrow-row/latest/arrow_row/#lexsort But using sort instead of sort_unstable. We could consider adding stable variants, my major reservation is paying for the additional codegen |
I think so? I'll give it a try... |
Describe the bug
Arrow ordering uses the unstable sort which doesn't guarantee sort order between same elements, this causing undesired behavior for systems that expects stable sort
To Reproduce
The test outputs array indices for 1 and this is correct, in sync with input data
For n = 21 which shouldn't change the output, but instead indices are messed
Expected behavior
Additional context
Related to #2871
The text was updated successfully, but these errors were encountered: