Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster radix sort #48459

Closed
wants to merge 3 commits into from
Closed

Faster radix sort #48459

wants to merge 3 commits into from

Conversation

LSchwerdt
Copy link
Contributor

@LSchwerdt LSchwerdt commented Jan 30, 2023

Unrolling the loop in Base.Sort.radix_sort_pass! increases performance by about 50%.
(on my Ryzen 3900x)

This makes the code somewhat less clean, but in my opinion the performance benefit is worth it.

Disclaimer: I did not fully build Julia including this PR (yet). The change is very small and I did test the modified function, but I do not want to setup everything to build Julia if the PR might be rejected based on a choice of code clarity vs. performance.

Here is a simple test script to try out the changed function in isolation.
This is the output:

2000/20000
...
20000/20000

base:     5.063 seconds
unrolled: 3.052 seconds
speedup factor: 1.66
Test Passed

Unrolling loop in Base.Sort.radix_sort_pass! increases speed by about 50%.
@giordano giordano added performance Must go faster sorting Put things in order labels Jan 30, 2023
base/sort.jl Outdated Show resolved Hide resolved
@LilithHafner
Copy link
Member

Thanks!

This makes the code somewhat less clean, but in my opinion the performance benefit is worth it.

50% speedup for 20 lines of code is definitely worth it!

I wish the Julia compiler could do this automatically and I'd like to add a #TODO: remove this once the compiler is smart enough to do it automatically, but in the interim (which might be a very long time) I support this PR for its immediate performance gains.

I'm having trouble reproducing your performance results. On my computer (2019 MacBook Air), pasting your script into a Julia REPL results in the following output:

base:     7.235 seconds
unrolled: 6.829 seconds
speedup factor: 1.06
I also tried directly measuring calls to `sort!`, but found no change with this script:
using BenchmarkTools, Random

function benchmark(k = 8)
    for n in 2:k-2
        println(n)
        v = rand(10^n)
        @btime sort!($v) setup=(rand!($v)) evals=1 gctrial=false gcsample=false samples=10^(k-n);
        v = rand(Int, 10^n)
        @btime sort!($v) setup=(rand!($v)) evals=1 gctrial=false gcsample=false samples=10^(k-n);
    end
end

benchmark()

@eval Base.Sort function radix_sort_pass!(t, lo, hi, offset, counts, v, shift, chunk_size)
    mask = UInt(1) << chunk_size - 1  # mask is defined in pass so that the compiler
    @inbounds begin                   #  ↳ knows it's shape
        # counts[2:mask+2] will store the number of elements that fall into each bucket.
        # if chunk_size = 8, counts[2] is bucket 0x00 and counts[257] is bucket 0xff.
        counts .= 0
        for k in lo:hi
            x = v[k]                  # lookup the element
            i = (x >> shift)&mask + 2 # compute its bucket's index for this pass
            counts[i] += 1            # increment that bucket's count
        end
        counts[1] = lo                # set target index for the first bucket
        cumsum!(counts, counts)       # set target indices for subsequent buckets
        # counts[1:mask+1] now stores indices where the first member of each bucket
        # belongs, not the number of elements in each bucket. We will put the first element
        # of bucket 0x00 in t[counts[1]], the next element of bucket 0x00 in t[counts[1]+1],
        # and the last element of bucket 0x00 in t[counts[2]-1].

		#loop unrolled 4x
        k = lo
        while k <= hi - 4
            Base.Cartesian.@nexprs 4 _ -> begin
                x = v[k]                  # lookup the element
                i = (x >> shift)&mask + 1 # compute its bucket's index for this pass
                j = counts[i]             # lookup the target index
                t[j + offset] = x         # put the element where it belongs
                counts[i] = j + 1         # increment the target index for the next
                k += 1                    #  ↳ element in this bucket
            end
        end
        while k <= hi
            x = v[k]                  # lookup the element
            i = (x >> shift)&mask + 1 # compute its bucket's index for this pass
            j = counts[i]             # lookup the target index
            t[j + offset] = x         # put the element where it belongs
            counts[i] = j + 1         # increment the target index for the next
            k += 1                    #  ↳ element in this bucket
        end
    end
end

benchmark()

Output:

2
  1.779 μs (1 allocation: 896 bytes)
  1.532 μs (1 allocation: 896 bytes)
3
  15.251 μs (2 allocations: 10.12 KiB)
  15.976 μs (2 allocations: 10.12 KiB)
4
  180.667 μs (3 allocations: 86.36 KiB)
  199.008 μs (3 allocations: 86.36 KiB)
5
  1.866 ms (3 allocations: 789.48 KiB)
  2.106 ms (3 allocations: 789.48 KiB)
6
  23.867 ms (3 allocations: 7.64 MiB)
  27.281 ms (3 allocations: 7.64 MiB)
2
  1.794 μs (1 allocation: 896 bytes)
  1.567 μs (1 allocation: 896 bytes)
3
  15.819 μs (2 allocations: 10.12 KiB)
  16.955 μs (2 allocations: 10.12 KiB)
4
  180.848 μs (3 allocations: 86.36 KiB)
  192.534 μs (3 allocations: 86.36 KiB)
5
  1.778 ms (3 allocations: 789.48 KiB)
  2.069 ms (3 allocations: 789.48 KiB)
6
  23.991 ms (3 allocations: 7.64 MiB)
  27.122 ms (3 allocations: 7.64 MiB)

Time it takes to sort! a vector

Input Before After
rand(10^2) 1.779 μs 1.794 μs
rand(Int, 10^2) 1.532 μs 1.567 μs
rand(10^3) 15.251 μs 15.819 μs
rand(Int, 10^3) 15.976 μs 16.955 μs
rand(10^4) 180.667 μs 180.848 μs
rand(Int, 10^4) 199.008 μs 192.534 μs
rand(10^5) 1.866 ms 1.778 ms
rand(Int, 10^5) 2.106 ms 2.069 ms
rand(10^6) 23.867 ms 23.991 ms
rand(Int, 10^6) 27.281 ms 27.122 ms

@LilithHafner
Copy link
Member

I doubt this will reveal much of note before JuliaCI/BaseBenchmarks.jl#305 merges, but giving it a go anyway:

@nanosoldier runbenchmarks("sort", vs=":master")

@LSchwerdt
Copy link
Contributor Author

#TODO: remove this once the compiler is smart enough to do it automatically
Yeah, I think you are right. It is not so much the amount of code, it is the feeling that it is somehow not necessary, because the compiler should be able to to this.

I did some more benchmarks using my test script on different hardware. It seems the speedup is not really reproducible. Maybe there is something wrong with my system? I think this should not be merged, unless someone else can reproduce the performance gain.

n = 10^3, nRepeats = 10^8

Hardware time (base) time (unrolled) speedup factor
AMD 3900X 156.316s 164.073s 0.95
Intel 6700HQ 210.218s 243.852s 0.86
Intel 7820X 174.933s 193.171s 0.91

n = 10^7, nRepeats = 10^3

Hardware time (base) time (unrolled) speedup factor
AMD 3900X 31.560s 20.221s 1.56
Intel 6700HQ 67.730s 67.554s 1.00
Intel 7820X 38.566s 39.933s 0.97

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@vtjnash
Copy link
Member

vtjnash commented Jan 30, 2023

50% speedup for 20 lines of code is definitely worth it!

The extra icache pressure can make this undesirable, even if it performs well in benchmarks, so it is not always better for the compiler to do this automatically where the user did not ask for it.

@LSchwerdt
Copy link
Contributor Author

LSchwerdt commented Jan 30, 2023

Thank you everyone. I reran the tests with different RAM configurations to close in on the reason for these performance discrepancies. And memory speeds made all the difference:

n = 10^3, nRepeats = 2*10^7

Hardware time (base) time (unrolled) speedup factor
AMD 3900X, 2 DIMMs DDR4-2333 31.051s 31.999s 0.97
AMD 3900X, 4 DIMMs DDR4-2333 31.173s 32.346s 0.96
AMD 3900X, 4 DIMMs DDR4-3600 33.012s 32.536s 1.01

n = 10^7, nRepeats = 10^3

Hardware time (base) time (unrolled) speedup factor
AMD 3900X, 2 DIMMs DDR4-2333 37.940s 36.017s 1.05
AMD 3900X, 4 DIMMs DDR4-2333 37.342s 35.032s 1.07
AMD 3900X, 4 DIMMs DDR4-3600 32.111s 20.108s 1.60

Is there any way to exploit this without hurting the performance on other hardware? I am sad to pass up this significant speedup opportunity, even though it is the right thing to do in light of the performance regressions for more common hardware configurations.

@LilithHafner
Copy link
Member

Not a plausible near-term solution, but perhaps we could have ai tuned compiler heuristics that are trained separately on each detectable platform. (have to be AI tuned because there are more platforms than humans willing and able to tune compiler heuristics)

@LSchwerdt
Copy link
Contributor Author

It would have been nice to get some performance numbers using more modern hardware (and more AMD CPUs), to see if this suboptimal performace without unrolled loop is more than a fluke. But without that, Im gonna close the PR. Writing an autotuning sort library is not my intention. ;)

If the standard library ever includes multithreaded (sort) functions this would be resolved automatically, because it is easier to saturate the memory bandwith using multiple threads. But maybe another algorithm with more cache-friendly memory accesses than LSD radix sort would be faster in that case anyways.

@LSchwerdt LSchwerdt closed this Jan 31, 2023
@Seelengrab
Copy link
Contributor

Seelengrab commented Jan 31, 2023

It would have been nice to get some performance numbers using more modern hardware (and more AMD CPUs), to see if this suboptimal performace without unrolled loop is more than a fluke.

I think I can help with that a bit:

[sukera@tower radixunroll]$ julia unroll.jl 
2000/20000
4000/20000
6000/20000
8000/20000
10000/20000
12000/20000
14000/20000
16000/20000
18000/20000
20000/20000

base:     1.892 seconds
unrolled: 2.152 seconds
speedup factor: 0.88

Julia Version 1.10.0-DEV.455
Commit 57f82bce39* (2023-01-30 10:26 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 24 × AMD Ryzen 9 7900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 24 on 24 virtual cores
Environment:
  JULIA_NUM_THREADS = 24

I've just added a versioninfo() call to the end of the let block.

@LilithHafner
Copy link
Member

@LSchwerdt, thank you for identifying this possible room for improvement, making a PR, and being receptive to feedback from others with different hardware that shed a different light on the performance. This sort of thing doesn't always pan out, but I encourage you to try again if you run across another opportunity for improvement.

@LSchwerdt LSchwerdt mentioned this pull request Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster sorting Put things in order
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants