-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster radix sort #48459
Faster radix sort #48459
Conversation
Unrolling loop in Base.Sort.radix_sort_pass! increases speed by about 50%.
Thanks!
50% speedup for 20 lines of code is definitely worth it! I wish the Julia compiler could do this automatically and I'd like to add a I'm having trouble reproducing your performance results. On my computer (2019 MacBook Air), pasting your script into a Julia REPL results in the following output:
I also tried directly measuring calls to `sort!`, but found no change with this script:using BenchmarkTools, Random
function benchmark(k = 8)
for n in 2:k-2
println(n)
v = rand(10^n)
@btime sort!($v) setup=(rand!($v)) evals=1 gctrial=false gcsample=false samples=10^(k-n);
v = rand(Int, 10^n)
@btime sort!($v) setup=(rand!($v)) evals=1 gctrial=false gcsample=false samples=10^(k-n);
end
end
benchmark()
@eval Base.Sort function radix_sort_pass!(t, lo, hi, offset, counts, v, shift, chunk_size)
mask = UInt(1) << chunk_size - 1 # mask is defined in pass so that the compiler
@inbounds begin # ↳ knows it's shape
# counts[2:mask+2] will store the number of elements that fall into each bucket.
# if chunk_size = 8, counts[2] is bucket 0x00 and counts[257] is bucket 0xff.
counts .= 0
for k in lo:hi
x = v[k] # lookup the element
i = (x >> shift)&mask + 2 # compute its bucket's index for this pass
counts[i] += 1 # increment that bucket's count
end
counts[1] = lo # set target index for the first bucket
cumsum!(counts, counts) # set target indices for subsequent buckets
# counts[1:mask+1] now stores indices where the first member of each bucket
# belongs, not the number of elements in each bucket. We will put the first element
# of bucket 0x00 in t[counts[1]], the next element of bucket 0x00 in t[counts[1]+1],
# and the last element of bucket 0x00 in t[counts[2]-1].
#loop unrolled 4x
k = lo
while k <= hi - 4
Base.Cartesian.@nexprs 4 _ -> begin
x = v[k] # lookup the element
i = (x >> shift)&mask + 1 # compute its bucket's index for this pass
j = counts[i] # lookup the target index
t[j + offset] = x # put the element where it belongs
counts[i] = j + 1 # increment the target index for the next
k += 1 # ↳ element in this bucket
end
end
while k <= hi
x = v[k] # lookup the element
i = (x >> shift)&mask + 1 # compute its bucket's index for this pass
j = counts[i] # lookup the target index
t[j + offset] = x # put the element where it belongs
counts[i] = j + 1 # increment the target index for the next
k += 1 # ↳ element in this bucket
end
end
end
benchmark() Output:
Time it takes to
|
I doubt this will reveal much of note before JuliaCI/BaseBenchmarks.jl#305 merges, but giving it a go anyway: @nanosoldier |
I did some more benchmarks using my test script on different hardware. It seems the speedup is not really reproducible. Maybe there is something wrong with my system? I think this should not be merged, unless someone else can reproduce the performance gain.
|
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
The extra icache pressure can make this undesirable, even if it performs well in benchmarks, so it is not always better for the compiler to do this automatically where the user did not ask for it. |
Thank you everyone. I reran the tests with different RAM configurations to close in on the reason for these performance discrepancies. And memory speeds made all the difference:
Is there any way to exploit this without hurting the performance on other hardware? I am sad to pass up this significant speedup opportunity, even though it is the right thing to do in light of the performance regressions for more common hardware configurations. |
Not a plausible near-term solution, but perhaps we could have ai tuned compiler heuristics that are trained separately on each detectable platform. (have to be AI tuned because there are more platforms than humans willing and able to tune compiler heuristics) |
It would have been nice to get some performance numbers using more modern hardware (and more AMD CPUs), to see if this suboptimal performace without unrolled loop is more than a fluke. But without that, Im gonna close the PR. Writing an autotuning sort library is not my intention. ;) If the standard library ever includes multithreaded (sort) functions this would be resolved automatically, because it is easier to saturate the memory bandwith using multiple threads. But maybe another algorithm with more cache-friendly memory accesses than LSD radix sort would be faster in that case anyways. |
I think I can help with that a bit:
I've just added a |
@LSchwerdt, thank you for identifying this possible room for improvement, making a PR, and being receptive to feedback from others with different hardware that shed a different light on the performance. This sort of thing doesn't always pan out, but I encourage you to try again if you run across another opportunity for improvement. |
Unrolling the loop in Base.Sort.radix_sort_pass! increases performance by about 50%.
(on my Ryzen 3900x)
This makes the code somewhat less clean, but in my opinion the performance benefit is worth it.
Disclaimer: I did not fully build Julia including this PR (yet). The change is very small and I did test the modified function, but I do not want to setup everything to build Julia if the PR might be rejected based on a choice of code clarity vs. performance.
Here is a simple test script to try out the changed function in isolation.
This is the output: