-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add radix sort #44230
Add radix sort #44230
Conversation
The build error is |
Yeah, this is exciting. I'll link to my own implementation that lives in the InlineStrings.jl package. Some notable changes I made to the SortingAlgorithms.jl implementation:
For InlineStrings.jl, there are fixed-size string types from 1-byte all the way up to 255-bytes, so there were a lot of permutations of performance to check; here's one sample graph of the |
@quinnj's thresholds (64 bits -> 500 & 128 bits -> 2 million) line up okay with the thresholds I've chosen. I expect that the thresholds and heuristics in this PR could be tuned much more preciesly. If you would like to play around with thresholds, everything runs as is with Revise, I think it's just not building from source with the PR. @quinnj, are there any changes you would recommend here based on your experience with InlineStrings? |
0abb3d3
to
a22b120
Compare
Hooray! Everything but buildkite passed on 7b665ef |
I'm giving up on understanding the ci errors and switching to blind bisection. I'm going to start by rebasing to the latest master and reverting everything. |
2fc70fb
to
548f561
Compare
Great! Looks good to me too. It could make sense to use the same threshold for switching to insertion sort (currently 40) and for checking Adding a check for reverse sorting also sounds like a good idea to me, as long as we use a high enough threshold to keep the overhead < 1%. I imagine this kind of pattern can happen in real life if you sort data in one direction and then in the other one, and making this up to 7× slower than it could would be too bad. BTW, it would be nice to keep these benchmarks somewhere. Maybe you make a PR against BaseBenchmarks after this one is merged? |
Co-authored-by: Milan Bouchet-Valat <[email protected]>
For adjusting the minimum presorted input threshold, I find your reasoning quite convincing. I like how you provide a theoretical consideration (1) and then back it up with a practical example of why this matters (2: instead of three cases with a bad middle, there are only two cases). For reverse sorted input, this makes the algorithm marginally more complicated, though the runtime cost-benefit is huge. This also makes a nice pattern where very short (< 40) reverse-sorted input is pathological for insertion sort, medium-length (40 < len < 500) is handled decently by radix sort, and long (> 500) is handled perfectly with a reverse sorted check. Intuitively, as length increases, special-case detection improves. I added it. |
The benchmarks here are pretty similar to the benchmarks already in BaseBenchmarks/sort. The difference is that these span more input distributions, more types, more lengths, and longer lengths, and those at BaseBenchmarks cover sorting functions other than |
Okay, this looks good to me too. Thanks so much, @nalimilan, @oscardssmith, and everyone else who has had conversations and helped along the way! |
thanks so much for setting this through! |
Thanks @LilithHafner! Don't hesitate if you feel like extending radix sort to |
Brava! Excellent work here. Thought on the issorted check: we could introduce a |
Thanks! I'm having trouble coming up with an implementation of |
A little rough benchmarking seems to find that this can be substantially more efficient in some cases: function sortcheck(v::Vector, lt=isless)
cmp::UInt8 = 0b11
if !isempty(v)
y = v[1]
@inbounds for i = 2:length(v)
x = v[i]
cmp &= 2*!lt(x, y) + !lt(y, x)
iszero(cmp) && break
y = x
end
end
!iszero(cmp)
end The worst case for julia> v = [fill(0, 1000); -1; 1];
julia> @btime issorted($v) || issorted($v, rev=true)
990.154 ns (0 allocations: 0 bytes)
false
julia> @btime sortcheck($v)
678.273 ns (0 allocations: 0 bytes)
false In this case two |
Sure enough, this slightly incorrect 8x unrolling of the inner loop is 2x faster: function sortcheck(v::Vector, lt=isless)
cmp::UInt8 = 0b11
if !isempty(v)
y = v[1]
@inbounds for i = 2:8:length(v)
for j = 0:7
x = v[i+j]
cmp &= 2*!lt(x, y) + !lt(y, x)
y = x
end
iszero(cmp) && break
end
end
!iszero(cmp)
end If the end of the loop was handled correctly, this could work. Paying attention to memory alignment might improve the speed more. I do wish the compiler could figure that transformation out based on the fact that the inner loop has no side effects and once |
The compiler problem is very similar to vectorizing |
I'm more concerned with runtime on entirely unsorted input than on input with long sorted prefixes. On random data, the latest sortcheck is fairly expensive: @belapsed((sort!(x); sort(x; rev=true)), setup=(x=rand(40)))/2 #1.6455882352941177e-7
@belapsed issorted(x) setup=(x=rand(40)) #3.382e-9 2%
@belapsed sortcheck(x) setup=(x=rand(40)) #1.292778335005015e-8 8% |
I was trying to make something faster than two |
I'm sorting twice and dividing time by 2 because otherwise it would be repeatedly resorting sorted data, giving an unrealistically fast sorting time. On random data on my system, I'm measuring @belapsed((sort!(x); sort(x; rev=true)), setup=(x=rand(40)))/2 #1.684660633484163e-7
@belapsed issorted(x) setup=(x=rand(40)) #3.383e-9. 2%
@belapsed issorted(x) || issorted(x, Reverse) setup=(x=rand(40)) #4.9079079079079075e-9 3%
@belapsed sortcheck(x) setup=(x=rand(40)) #1.2937813440320962e-8 8%
@belapsed sort!(x) setup=(x=rand(40)) #1.1508637873754152e-7 70% |
I've implemented radix sort and integrated it into the Julia sorting architecture. Micro-benchmarking reports a substantial speedup in many cases and minor regressions in a few cases. I hope that with a bit of collaboration, we can minimize those regressions as well.
This grows out of a discussion on discourse and a PR to SortingAlgorithm.jl. It is inspired in part by the radix sort algorithm in SortingAlgorithms.jl. This is different from the radix sort algorithm in SortingAlgorithms.jl because it is substantially faster in most cases and more easily extensible.
This PR is based on a fast radix sort for unsigned integers in increasing order. To sort other types and orders—currently all signed and unsigned integers, Float32, Float64, and Char, for Forward and Reverse orderings are supported—we have a function
Sort.Serial.serialize
that converts objects to unsigned integers which are sorted and then the original objects are reconstructed viaSort.Serial.deserialize
. These two functions come with a traitSort.Serial.Serializable
that indicates whether a type is serializable for a given ordering. If a package defines a custom type that can be sorted via radix sort, it may either a) do nothing and we will continue to dispatch to a comparison sort at compile time or b) define these three functions and we will dispatch to radix sort when appropriate.I have left out many features and minor optimizations that have come up in discussion surrounding this algorithm, preferring simplicity over minor performance benefits. Those optimizations can be added now, later, or never.
A small sample of benchmarking vs. mar 1 master (3a5dc09)
(run 2 times, speedups computed by dividing minimum time on master by minimum time on ffcc043)
Problems & todos
@benchmark sort!(x) setup=(x=rand(1000))
is 300ns slower for radix sort than quick sort.It should immediately dispatch to quick sort based only on length, so it is unclear why there
is such a substantial slowdown. This is probably related to the 100% slowdown for sorting
single floats. (addressed in Add radix sort #44230 (comment))
I've run millions of tests with SortMark.jl, but this still needs efficient comprehensive
testing in Base, and more hand crafted tests may still find bugs.
Sort.Serial is a bit of a name conflict with the existing package Serialization
Perhaps we should support passing in the temporary array. Is that performance feature used anywhere?
The heuristics right now are pretty primitive. Perhaps they can be improved or tuned.
Potentially rework integration with
fpsort!
Perhaps we should integrate this with sortperm, right now it is incompatible, but with some changes, it could work.