3% faster radix sort #48497

LSchwerdt · 2023-02-02T16:37:36Z

In radix_sort_pass! the offset for indexing into the target array is calculated repeatedly for each element. This PR moves the offset calculation out of the hot loop.

When not memory-bound (i.e. for small arrays), this leads to a performance gain of about 3% in my tests. For larger arrays, the performance gain tends to zero.

Here is a test script, and here are the results on different hardware:

1_000_000 radix sort passes with 1_000 elements:
base: 2.034 seconds
PR:   1.946 seconds
speedup factor: 1.045

100 radix sort passes with 10_000_000 elements:
base: 4.199 seconds
PR:   4.156 seconds
speedup factor: 1.010

Julia Version 1.9.0-beta3
Commit 24204a7344 (2023-01-18 07:20 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i7-7820X CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 1 on 16 virtual cores
Test Passed

1_000_000 radix sort passes with 1_000 elements:
base: 2.353 seconds
PR:   2.267 seconds
speedup factor: 1.038

100 radix sort passes with 10_000_000 elements:
base: 6.370 seconds
PR:   6.336 seconds
speedup factor: 1.005

Julia Version 1.9.0-beta3
Commit 24204a7344 (2023-01-18 07:20 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Test Passed

1_000_000 radix sort passes with 1_000 elements:
base: 1.764 seconds
PR:   1.712 seconds
speedup factor: 1.030

100 radix sort passes with 10_000_000 elements:
base: 2.562 seconds
PR:   2.801 seconds
speedup factor: 0.915

Julia Version 1.9.0-beta3
Commit 24204a7344 (2023-01-18 07:20 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 24 on 24 virtual cores
Environment:
  JULIA_EDITOR = code
Test Passed

The performance regression on the 3900X seems to be a strange outlier and not representative of normal hardware.

Disclaimer: I did not fully build Julia including this PR, but I did test the modified function using the linked test script, and a modified test script with offset != 0.

Calculate offset once instead of repeatedly for each element.

LilithHafner · 2023-02-03T01:24:41Z

Thanks! A 3% performance increase is negligible with respect to noise and systemic error in most cases but this arithmetic refactoring is correct, adds no code complexity (arguably simplifies the code by making the code in the hot loop simpler) and is clearly a performance improvement, even if small.

Faster radix sort

995cc51

Calculate offset once instead of repeatedly for each element.

LSchwerdt marked this pull request as ready for review February 2, 2023 17:50

giordano added performance Must go faster sorting Put things in order labels Feb 2, 2023

LilithHafner merged commit 9b5f39e into JuliaLang:master Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3% faster radix sort #48497

3% faster radix sort #48497

LSchwerdt commented Feb 2, 2023

LilithHafner commented Feb 3, 2023

3% faster radix sort #48497

3% faster radix sort #48497

Conversation

LSchwerdt commented Feb 2, 2023

LilithHafner commented Feb 3, 2023