Faster radix sort #48459

LSchwerdt · 2023-01-30T15:48:55Z

Unrolling the loop in Base.Sort.radix_sort_pass! increases performance by about 50%.
(on my Ryzen 3900x)

This makes the code somewhat less clean, but in my opinion the performance benefit is worth it.

Disclaimer: I did not fully build Julia including this PR (yet). The change is very small and I did test the modified function, but I do not want to setup everything to build Julia if the PR might be rejected based on a choice of code clarity vs. performance.

Here is a simple test script to try out the changed function in isolation.
This is the output:

2000/20000
...
20000/20000

base:     5.063 seconds
unrolled: 3.052 seconds
speedup factor: 1.66
Test Passed

Unrolling loop in Base.Sort.radix_sort_pass! increases speed by about 50%.

base/sort.jl

LilithHafner · 2023-01-30T17:18:13Z

Thanks!

This makes the code somewhat less clean, but in my opinion the performance benefit is worth it.

50% speedup for 20 lines of code is definitely worth it!

I wish the Julia compiler could do this automatically and I'd like to add a #TODO: remove this once the compiler is smart enough to do it automatically, but in the interim (which might be a very long time) I support this PR for its immediate performance gains.

I'm having trouble reproducing your performance results. On my computer (2019 MacBook Air), pasting your script into a Julia REPL results in the following output:

base:     7.235 seconds
unrolled: 6.829 seconds
speedup factor: 1.06

I also tried directly measuring calls to `sort!`, but found no change with this script:

using BenchmarkTools, Random

function benchmark(k = 8)
    for n in 2:k-2
        println(n)
        v = rand(10^n)
        @btime sort!($v) setup=(rand!($v)) evals=1 gctrial=false gcsample=false samples=10^(k-n);
        v = rand(Int, 10^n)
        @btime sort!($v) setup=(rand!($v)) evals=1 gctrial=false gcsample=false samples=10^(k-n);
    end
end

benchmark()

@eval Base.Sort function radix_sort_pass!(t, lo, hi, offset, counts, v, shift, chunk_size)
    mask = UInt(1) << chunk_size - 1  # mask is defined in pass so that the compiler
    @inbounds begin                   #  ↳ knows it's shape
        # counts[2:mask+2] will store the number of elements that fall into each bucket.
        # if chunk_size = 8, counts[2] is bucket 0x00 and counts[257] is bucket 0xff.
        counts .= 0
        for k in lo:hi
            x = v[k]                  # lookup the element
            i = (x >> shift)&mask + 2 # compute its bucket's index for this pass
            counts[i] += 1            # increment that bucket's count
        end
        counts[1] = lo                # set target index for the first bucket
        cumsum!(counts, counts)       # set target indices for subsequent buckets
        # counts[1:mask+1] now stores indices where the first member of each bucket
        # belongs, not the number of elements in each bucket. We will put the first element
        # of bucket 0x00 in t[counts[1]], the next element of bucket 0x00 in t[counts[1]+1],
        # and the last element of bucket 0x00 in t[counts[2]-1].

		#loop unrolled 4x
        k = lo
        while k <= hi - 4
            Base.Cartesian.@nexprs 4 _ -> begin
                x = v[k]                  # lookup the element
                i = (x >> shift)&mask + 1 # compute its bucket's index for this pass
                j = counts[i]             # lookup the target index
                t[j + offset] = x         # put the element where it belongs
                counts[i] = j + 1         # increment the target index for the next
                k += 1                    #  ↳ element in this bucket
            end
        end
        while k <= hi
            x = v[k]                  # lookup the element
            i = (x >> shift)&mask + 1 # compute its bucket's index for this pass
            j = counts[i]             # lookup the target index
            t[j + offset] = x         # put the element where it belongs
            counts[i] = j + 1         # increment the target index for the next
            k += 1                    #  ↳ element in this bucket
        end
    end
end

benchmark()

Output:

2
  1.779 μs (1 allocation: 896 bytes)
  1.532 μs (1 allocation: 896 bytes)
3
  15.251 μs (2 allocations: 10.12 KiB)
  15.976 μs (2 allocations: 10.12 KiB)
4
  180.667 μs (3 allocations: 86.36 KiB)
  199.008 μs (3 allocations: 86.36 KiB)
5
  1.866 ms (3 allocations: 789.48 KiB)
  2.106 ms (3 allocations: 789.48 KiB)
6
  23.867 ms (3 allocations: 7.64 MiB)
  27.281 ms (3 allocations: 7.64 MiB)
2
  1.794 μs (1 allocation: 896 bytes)
  1.567 μs (1 allocation: 896 bytes)
3
  15.819 μs (2 allocations: 10.12 KiB)
  16.955 μs (2 allocations: 10.12 KiB)
4
  180.848 μs (3 allocations: 86.36 KiB)
  192.534 μs (3 allocations: 86.36 KiB)
5
  1.778 ms (3 allocations: 789.48 KiB)
  2.069 ms (3 allocations: 789.48 KiB)
6
  23.991 ms (3 allocations: 7.64 MiB)
  27.122 ms (3 allocations: 7.64 MiB)

Time it takes to sort! a vector

Input	Before	After
rand(10^2)	1.779 μs	1.794 μs
rand(Int, 10^2)	1.532 μs	1.567 μs
rand(10^3)	15.251 μs	15.819 μs
rand(Int, 10^3)	15.976 μs	16.955 μs
rand(10^4)	180.667 μs	180.848 μs
rand(Int, 10^4)	199.008 μs	192.534 μs
rand(10^5)	1.866 ms	1.778 ms
rand(Int, 10^5)	2.106 ms	2.069 ms
rand(10^6)	23.867 ms	23.991 ms
rand(Int, 10^6)	27.281 ms	27.122 ms

LilithHafner · 2023-01-30T18:00:05Z

I doubt this will reveal much of note before JuliaCI/BaseBenchmarks.jl#305 merges, but giving it a go anyway:

@nanosoldier runbenchmarks("sort", vs=":master")

LSchwerdt · 2023-01-30T18:26:33Z

#TODO: remove this once the compiler is smart enough to do it automatically
Yeah, I think you are right. It is not so much the amount of code, it is the feeling that it is somehow not necessary, because the compiler should be able to to this.

I did some more benchmarks using my test script on different hardware. It seems the speedup is not really reproducible. Maybe there is something wrong with my system? I think this should not be merged, unless someone else can reproduce the performance gain.

n = 10^3, nRepeats = 10^8

Hardware	time (base)	time (unrolled)	speedup factor
AMD 3900X	156.316s	164.073s	0.95
Intel 6700HQ	210.218s	243.852s	0.86
Intel 7820X	174.933s	193.171s	0.91

n = 10^7, nRepeats = 10^3

Hardware	time (base)	time (unrolled)	speedup factor
AMD 3900X	31.560s	20.221s	1.56
Intel 6700HQ	67.730s	67.554s	1.00
Intel 7820X	38.566s	39.933s	0.97

nanosoldier · 2023-01-30T18:37:52Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

vtjnash · 2023-01-30T19:18:31Z

50% speedup for 20 lines of code is definitely worth it!

The extra icache pressure can make this undesirable, even if it performs well in benchmarks, so it is not always better for the compiler to do this automatically where the user did not ask for it.

LSchwerdt · 2023-01-30T21:10:32Z

Thank you everyone. I reran the tests with different RAM configurations to close in on the reason for these performance discrepancies. And memory speeds made all the difference:

n = 10^3, nRepeats = 2*10^7

Hardware	time (base)	time (unrolled)	speedup factor
AMD 3900X, 2 DIMMs DDR4-2333	31.051s	31.999s	0.97
AMD 3900X, 4 DIMMs DDR4-2333	31.173s	32.346s	0.96
AMD 3900X, 4 DIMMs DDR4-3600	33.012s	32.536s	1.01

n = 10^7, nRepeats = 10^3

Hardware	time (base)	time (unrolled)	speedup factor
AMD 3900X, 2 DIMMs DDR4-2333	37.940s	36.017s	1.05
AMD 3900X, 4 DIMMs DDR4-2333	37.342s	35.032s	1.07
AMD 3900X, 4 DIMMs DDR4-3600	32.111s	20.108s	1.60

Is there any way to exploit this without hurting the performance on other hardware? I am sad to pass up this significant speedup opportunity, even though it is the right thing to do in light of the performance regressions for more common hardware configurations.

LilithHafner · 2023-01-30T23:52:37Z

Not a plausible near-term solution, but perhaps we could have ai tuned compiler heuristics that are trained separately on each detectable platform. (have to be AI tuned because there are more platforms than humans willing and able to tune compiler heuristics)

LSchwerdt · 2023-01-31T18:21:24Z

It would have been nice to get some performance numbers using more modern hardware (and more AMD CPUs), to see if this suboptimal performace without unrolled loop is more than a fluke. But without that, Im gonna close the PR. Writing an autotuning sort library is not my intention. ;)

If the standard library ever includes multithreaded (sort) functions this would be resolved automatically, because it is easier to saturate the memory bandwith using multiple threads. But maybe another algorithm with more cache-friendly memory accesses than LSD radix sort would be faster in that case anyways.

Seelengrab · 2023-01-31T23:41:34Z

It would have been nice to get some performance numbers using more modern hardware (and more AMD CPUs), to see if this suboptimal performace without unrolled loop is more than a fluke.

I think I can help with that a bit:

[sukera@tower radixunroll]$ julia unroll.jl 
2000/20000
4000/20000
6000/20000
8000/20000
10000/20000
12000/20000
14000/20000
16000/20000
18000/20000
20000/20000

base:     1.892 seconds
unrolled: 2.152 seconds
speedup factor: 0.88

Julia Version 1.10.0-DEV.455
Commit 57f82bce39* (2023-01-30 10:26 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 24 × AMD Ryzen 9 7900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 24 on 24 virtual cores
Environment:
  JULIA_NUM_THREADS = 24

I've just added a versioninfo() call to the end of the let block.

LilithHafner · 2023-02-01T15:32:29Z

@LSchwerdt, thank you for identifying this possible room for improvement, making a PR, and being receptive to feedback from others with different hardware that shed a different light on the performance. This sort of thing doesn't always pan out, but I encourage you to try again if you run across another opportunity for improvement.

Faster radix sort

745ec38

Unrolling loop in Base.Sort.radix_sort_pass! increases speed by about 50%.

giordano added performance Must go faster sorting Put things in order labels Jan 30, 2023

giordano reviewed Jan 30, 2023

View reviewed changes

base/sort.jl Outdated Show resolved Hide resolved

LSchwerdt added 2 commits January 30, 2023 17:43

Remove superfluous white space

62d234b

Use Cartesian.@nexprs instead of custom macro

cb97f9f

oscardssmith requested a review from LilithHafner January 30, 2023 17:05

LSchwerdt closed this Jan 31, 2023

LSchwerdt mentioned this pull request Feb 2, 2023

3% faster radix sort #48497

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster radix sort #48459

Faster radix sort #48459

LSchwerdt commented Jan 30, 2023 •

edited

Loading

LilithHafner commented Jan 30, 2023

LilithHafner commented Jan 30, 2023

LSchwerdt commented Jan 30, 2023

nanosoldier commented Jan 30, 2023

vtjnash commented Jan 30, 2023

LSchwerdt commented Jan 30, 2023 •

edited

Loading

LilithHafner commented Jan 30, 2023

LSchwerdt commented Jan 31, 2023

Seelengrab commented Jan 31, 2023 •

edited

Loading

LilithHafner commented Feb 1, 2023

Faster radix sort #48459

Faster radix sort #48459

Conversation

LSchwerdt commented Jan 30, 2023 • edited Loading

LilithHafner commented Jan 30, 2023

LilithHafner commented Jan 30, 2023

LSchwerdt commented Jan 30, 2023

nanosoldier commented Jan 30, 2023

vtjnash commented Jan 30, 2023

LSchwerdt commented Jan 30, 2023 • edited Loading

LilithHafner commented Jan 30, 2023

LSchwerdt commented Jan 31, 2023

Seelengrab commented Jan 31, 2023 • edited Loading

LilithHafner commented Feb 1, 2023

LSchwerdt commented Jan 30, 2023 •

edited

Loading

LSchwerdt commented Jan 30, 2023 •

edited

Loading

Seelengrab commented Jan 31, 2023 •

edited

Loading