Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3% faster radix sort #48497

Merged
merged 1 commit into from
Feb 3, 2023
Merged

Conversation

LSchwerdt
Copy link
Contributor

In radix_sort_pass! the offset for indexing into the target array is calculated repeatedly for each element. This PR moves the offset calculation out of the hot loop.

When not memory-bound (i.e. for small arrays), this leads to a performance gain of about 3% in my tests. For larger arrays, the performance gain tends to zero.

Here is a test script, and here are the results on different hardware:

1_000_000 radix sort passes with 1_000 elements:
base: 2.034 seconds
PR:   1.946 seconds
speedup factor: 1.045

100 radix sort passes with 10_000_000 elements:
base: 4.199 seconds
PR:   4.156 seconds
speedup factor: 1.010

Julia Version 1.9.0-beta3
Commit 24204a7344 (2023-01-18 07:20 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i7-7820X CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 1 on 16 virtual cores
Test Passed
1_000_000 radix sort passes with 1_000 elements:
base: 2.353 seconds
PR:   2.267 seconds
speedup factor: 1.038

100 radix sort passes with 10_000_000 elements:
base: 6.370 seconds
PR:   6.336 seconds
speedup factor: 1.005

Julia Version 1.9.0-beta3
Commit 24204a7344 (2023-01-18 07:20 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Test Passed
1_000_000 radix sort passes with 1_000 elements:
base: 1.764 seconds
PR:   1.712 seconds
speedup factor: 1.030

100 radix sort passes with 10_000_000 elements:
base: 2.562 seconds
PR:   2.801 seconds
speedup factor: 0.915

Julia Version 1.9.0-beta3
Commit 24204a7344 (2023-01-18 07:20 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 24 on 24 virtual cores
Environment:
  JULIA_EDITOR = code
Test Passed

The performance regression on the 3900X seems to be a strange outlier and not representative of normal hardware.

Disclaimer: I did not fully build Julia including this PR, but I did test the modified function using the linked test script, and a modified test script with offset != 0.

Calculate offset once instead of repeatedly for each element.
@LSchwerdt LSchwerdt marked this pull request as ready for review February 2, 2023 17:50
@giordano giordano added performance Must go faster sorting Put things in order labels Feb 2, 2023
@LilithHafner
Copy link
Member

Thanks! A 3% performance increase is negligible with respect to noise and systemic error in most cases but this arithmetic refactoring is correct, adds no code complexity (arguably simplifies the code by making the code in the hot loop simpler) and is clearly a performance improvement, even if small.

@LilithHafner LilithHafner merged commit 9b5f39e into JuliaLang:master Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster sorting Put things in order
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants