Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rust like c for Levenshtein #351

Merged
merged 13 commits into from
Jan 16, 2025
Merged

Conversation

sammysheep
Copy link
Contributor

Description of changes

Modifies the Rust implementation for Levenshtein to make it more like the C one.

  • Switches from usize (usually 64-bit) to u32
  • Uses iterators to get rid of some bounds checking
  • Clean up style and make more idiomatic.

Passed the check, compiles, and runs via the test harness.

@sammysheep
Copy link
Contributor Author

sammysheep commented Jan 13, 2025

Switching to something like 1..m+1, as mentioned in other PRs, has the same performance as using the iterators in this PR. On Apple processors, I saw a big jump just changing usize to u32, however.

@zierf
Copy link
Contributor

zierf commented Jan 13, 2025

Initializing the prev_row like this:

// Use two rows instead of full matrix for space optimization
let mut curr_row = vec![0; m + 1];
let mut prev_row: Vec<u32> = Vec::with_capacity(m + 1);

// Initialize first row
prev_row.extend(0..=m);

is faster for me than initializing it all with just zeroes:

// Use two rows instead of full matrix for space optimization
let mut prev_row = vec![0u32; m + 1];
let mut curr_row = vec![0u32; m + 1];

But extend isn't implemented for u32. This way the u32 code is way slower for me than the usize with initialization.


Using enumerate seems to have a big improvement though.
If you derefence s1 and s2, the copied() function on the iterator won't be needed.

    // Main computation loop
    for (j, s2) in s2_bytes.iter().enumerate() {
        curr_row[0] = j + 1;

        for (i, s1) in s1_bytes.iter().enumerate() {
            let cost = if *s1 == *s2 { 0 } else { 1 };

For me enumerate is also faster than 1..n + 1 and 1..m + 1. I guess you were referencing to PR #346. The exclusive range runs with approximately 455ms, the enumeration brings it down to 408ms.

Reason seems to be described in PR #335. It mentions that iterating avoids permanent bound checks for accessing every single vector element.

@zierf
Copy link
Contributor

zierf commented Jan 13, 2025

Another observation, not using the original termination checks also reduces performance drastically.

This was removed in Commit c08ef04:

/// Calculates the Levenshtein distance between two strings using Wagner-Fischer algorithm
/// Space Complexity: O(min(m,n)) - only uses two rows instead of full matrix
/// Time Complexity: O(m*n) where m and n are the lengths of the input strings
fn levenshtein_distance(s1: &str, s2: &str) -> usize {
    // Early termination checks
    if s1 == s2 {
        return 0;
    }
    if s1.is_empty() {
        return s2.len();
    }
    if s2.is_empty() {
        return s1.len();
    }

    // Convert to bytes for faster access
    let s1_bytes = s1.as_bytes();
    let s2_bytes = s2.as_bytes();

    // Make s1 the shorter string for space optimization
    let (s1_bytes, s2_bytes) = if s1_bytes.len() > s2_bytes.len() {
        (s2_bytes, s1_bytes)
    } else {
        (s1_bytes, s2_bytes)
    };

Reinserting this code improves performance by ~10% for me.

Having &str instead of &[u8] as arguments for levenshtein_distance function also seems a bit more intuitive while still being faster.

@sammysheep
Copy link
Contributor Author

You were right, that was an initialization bug.

@sammysheep
Copy link
Contributor Author

Another observation, not using the original termination checks also reduces performance drastically.

This was removed in Commit c08ef04:

/// Calculates the Levenshtein distance between two strings using Wagner-Fischer algorithm
/// Space Complexity: O(min(m,n)) - only uses two rows instead of full matrix
/// Time Complexity: O(m*n) where m and n are the lengths of the input strings
fn levenshtein_distance(s1: &str, s2: &str) -> usize {
    // Early termination checks
    if s1 == s2 {
        return 0;
    }
    if s1.is_empty() {
        return s2.len();
    }
    if s2.is_empty() {
        return s1.len();
    }

    // Convert to bytes for faster access
    let s1_bytes = s1.as_bytes();
    let s2_bytes = s2.as_bytes();

    // Make s1 the shorter string for space optimization
    let (s1_bytes, s2_bytes) = if s1_bytes.len() > s2_bytes.len() {
        (s2_bytes, s1_bytes)
    } else {
        (s1_bytes, s2_bytes)
    };

Reinserting this code improves performance by ~10% for me.

Having &str instead of &[u8] as arguments for levenshtein_distance function also seems a bit more intuitive while still being faster.

Does the C version use these checks?

@zierf
Copy link
Contributor

zierf commented Jan 13, 2025

Looks like the C version doesn't use it atm.
But even with termination checks it still uses the Wagner-Fischer algorithm and other implementations like C# make use of it too.

@sammysheep
Copy link
Contributor Author

Looks like the C version doesn't use it atm. But even with termination checks it still uses the Wagner-Fischer algorithm and other implementations like C# make use of it too.

Sure and I'd use it too in my own code, but I guess this PR was about matching the C version for better comparison.

What processor are you on? Mine was M4 Max.

@zierf
Copy link
Contributor

zierf commented Jan 13, 2025

Funny, your Range->map->collect isn't faster than the vector with capacity and extend.

let mut prev_row = Vec::with_capacity(m + 1);
// Initialize first row
prev_row.extend(0..=m);

But by using (0..m + 1) instead of (0..=m) it makes your code faster than the version with extend.

// Use two rows instead of full matrix for space optimization
let mut curr_row: Vec<usize> = vec![0; m + 1];
let mut prev_row: Vec<usize> = (0..m + 1).collect();

Using a AMD Ryzen 9 5950X 16-Core Processor.

@sammysheep
Copy link
Contributor Author

sammysheep commented Jan 13, 2025

Funny, your Range->map->collect isn't faster than the vector with capacity and extend.

let mut prev_row = Vec::with_capacity(m + 1);
// Initialize first row
prev_row.extend(0..=m);

But by using (0..m + 1) instead of (0..=m) it makes your code faster than the version with extend.

// Use two rows instead of full matrix for space optimization
let mut curr_row: Vec<usize> = vec![0; m + 1];
let mut prev_row: Vec<usize> = (0..m + 1).collect();

Using a AMD Ryzen 9 5950X 16-Core Processor.

Yeah, inclusive sometimes has worse code gen. I'll fix that.

Can you do me a favor and do a quick printf("%d",sizeof(int)); in your C code? Mine is 4 and that's why I'm using u32, but wondering if it could be different on your computer.

@sammysheep
Copy link
Contributor Author

*inclusive

@sammysheep
Copy link
Contributor Author

Apple M4 Max:

Checking C
Check passed
Benchmarking C
Benchmark 1: ./c/code aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ...
ciladucljdamhnafrxyabwuihrusmthypjxormffegjwxioyztsdwqemzjdf ...
  Time (mean ± σ):     399.3 ms ±   2.5 ms    [User: 396.8 ms, System: 1.9 ms]
  Range (min … max):   397.2 ms … 402.1 ms    3 runs
 

Checking Rust
Check passed
Benchmarking Rust
Benchmark 1: ./rust/target/release/code aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ...
ciladucljdamhnafrxyabwuihrusmthypjxormffegjwxioyztsdwqemzjdf ...
  Time (mean ± σ):     387.0 ms ±   1.1 ms    [User: 384.4 ms, System: 2.0 ms]
  Range (min … max):   386.3 ms … 388.3 ms    3 runs

@zierf
Copy link
Contributor

zierf commented Jan 13, 2025

Can you do me a favor and do a quick printf("%d",sizeof(int)); in your C code? Mine is 4 and that's why I'm using u32, but wondering if it could be different on your computer.

Output is 4.

@sammysheep
Copy link
Contributor Author

Mmmmm, maybe x86 is better optimized for the 64-bit case?

On M4, different from C, if I switch back to usize it's about 488ms, so much worse on Apple.

Microbenchmarking between architectures is hard.

On the other hand, if you compile Rust on an M4, because the M series is only a few years old, it's more similar to compiling for native. Maybe on x86 you'd see different properties for native targets.

@zierf
Copy link
Contributor

zierf commented Jan 13, 2025

Yeah it's hard.

The compile.sh contains following line:

compile 'rust' 'RUSTFLAGS="-Zlocation-detail=none" cargo +nightly build --manifest-path rust/Cargo.toml --release'

This fails when compiling with the script, though it runs fine if i call it directly.
image

Is this even relevant? Directly after there is another rust --release compilation for the same files with the same Cargo.toml. The last compilation should win anyway.

@PEZ
Copy link
Collaborator

PEZ commented Jan 13, 2025

I don't know why there are two Rust compiles in compile.sh. Do you know, @bddicken?

@zierf
Copy link
Contributor

zierf commented Jan 13, 2025

My numbers are now scattered all over the place and there are somehow differences between using the benchmark scripts and calling hyperfine manually.

But as far as I can see atm, there is no clear improvement of “u32” over “usize”.


But could you also try to remove the call to .copied()? It should shave off a few more milliseconds.

    // Main computation loop
    for (j, s2) in s2_bytes.iter().enumerate() {
        curr_row[0] = (j + 1) as u32;

        for (i, s1) in s1_bytes.iter().enumerate() {
            let cost = if *s1 == *s2 { 0 } else { 1 };

@zierf
Copy link
Contributor

zierf commented Jan 13, 2025

The other question would be whether early termination is permitted @bddicken?

While the C(PP) examples and ReadMe do not currently include it, there is still some code where terminating early saves some operations.

Easy fast findings:

If it is not intended, we might leave it out to stay true to the task. Otherwise, leaving it out as a simple and well-known optimization would be nonsense.

@BradLewis
Copy link

I was looking at creating a PR to fix this with very similar changes, including removing the early termination (didn't really make a difference for me though). On my end (M1 Pro) the difference from using iterators is very substantial:

Benchmark 1: c
  Time (mean ± σ):     715.2 ms ±   0.3 ms    [User: 711.4 ms, System: 2.8 ms]
  Range (min … max):   714.9 ms … 715.5 ms    3 runs

Benchmark 1: old
  Time (mean ± σ):      1.478 s ±  0.002 s    [User: 1.471 s, System: 0.005 s]
  Range (min … max):    1.476 s …  1.479 s    3 runs

Benchmark 1: new
  Time (mean ± σ):     685.4 ms ±   0.9 ms    [User: 681.5 ms, System: 3.0 ms]
  Range (min … max):   684.5 ms … 686.4 ms    3 runs

I also switched from usize to i32 but on further testing that also didn't make any difference.

@sammysheep
Copy link
Contributor Author

Interesting. At least M1 Pro seems to behave similar to M4 Max. Here is messing again with the integer width.

My compiler is: nightly-aarch64-apple-darwin - rustc 1.86.0-nightly (48a426eca 2025-01-12)

u32, M4 Max

Checking Rust
Check passed
Benchmarking Rust
Benchmark 1: ./rust/target/release/code aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ...
ciladucljdamhnafrxyabwuihrusmthypjxormffegjwxioyztsdwqemzjdf ...
  Time (mean ± σ):     387.5 ms ±   0.6 ms    [User: 384.8 ms, System: 2.1 ms]
  Range (min … max):   386.9 ms … 388.1 ms    3 runs

usize, M4 Max

Checking Rust
Check passed
Benchmarking Rust
Benchmark 1: ./rust/target/release/code aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ...
ciladucljdamhnafrxyabwuihrusmthypjxormffegjwxioyztsdwqemzjdf ...
  Time (mean ± σ):     490.2 ms ±   0.6 ms    [User: 487.3 ms, System: 2.3 ms]
  Range (min … max):   489.7 ms … 490.8 ms    3 runs

I added ba3dee6 for testing usize more easily.

@zierf
Copy link
Contributor

zierf commented Jan 14, 2025

Could the impact of integer sizes on performance come from Clang?
I think Rust always ends up using the GNU compiler on Linux.


But look at the comparison of the C and Cpp times with the GNU compiler:

compile 'c' 'gcc -march=native -O3 c/code.c -o c/code'
compile 'cpp' 'g++ -std=c++23 -march=native -O3 -Ofast -o cpp/code cpp/code.cpp'
Benchmark Comparison GNU Compiler

gcc

While C(pp) above still runs faster than Rust, it becomes slower when I switch to Clang.
(It's also slower if I would omit -march=native from gcc commands.)

compile 'c' 'clang -std=c17 -O4 -Ofast -fno-exceptions c/code.c -o c/code'
compile 'cpp' 'clang++ -std=c++23 -stdlib=libstdc++ -O4 -Ofast -fno-exceptions -o cpp/code cpp/code.cpp'
Benchmark Comparison Clang Compiler

clang

Basically same negative performance impact with Zig compiler for C(pp).
Zig itself also runs way too slow.

compile 'c' 'zig cc -std=c17 -O3 c/code.c -o c/code'
compile 'cpp' 'zig c++ -std=c++23 -stdlib=libstdc++ -O3 -o cpp/code cpp/code.cpp'

On the Mac, Rust probably always uses Clang instead of GCC.
Perhaps you can test the influence of different integer sizes with the C(pp) examples if you compile them with Clang.
There may be not much difference by using different integer sizes with a GNU compiler on a Linux system.

@sammysheep
Copy link
Contributor Author

Rust uses LLVM as its backend compiler, similar to clang, for all platforms. The GCC backend is still in development AFAIK. I can look at the C widths tonight.

@sammysheep
Copy link
Contributor Author

Thanks for asking me to do the experiment, it seems conclusive.

My compiler:

Apple clang version 16.0.0 (clang-1600.0.26.6)
Target: arm64-apple-darwin24.2.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

C, M4 Max, long (64-bit)

Checking C
Check passed
Benchmarking C
Benchmark 1: ./c/code aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ...
ciladucljdamhnafrxyabwuihrusmthypjxormffegjwxioyztsdwqemzjdf ...
  Time (mean ± σ):     513.2 ms ±   1.7 ms    [User: 510.7 ms, System: 2.0 ms]
  Range (min … max):   511.9 ms … 515.2 ms    3 runs

C, M4 Max, int (32-bit)

Checking C
Check passed
Benchmarking C
Benchmark 1: ./c/code aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ...
ciladucljdamhnafrxyabwuihrusmthypjxormffegjwxioyztsdwqemzjdf ...
  Time (mean ± σ):     397.6 ms ±   0.2 ms    [User: 395.4 ms, System: 1.6 ms]
  Range (min … max):   397.4 ms … 397.7 ms    3 runs

C Patch

Patch for the C if you want to try it yourself:

diff --git a/levenshtein/c/code.c b/levenshtein/c/code.c
index 12e631a..f9a8ba2 100644
--- a/levenshtein/c/code.c
+++ b/levenshtein/c/code.c
@@ -4,14 +4,14 @@
 
 // Can either define your own min function 
 // or use a language / standard library function
-int min(int a, int b, int c) {
-  int min = a;
+long min(long a, long b, long c) {
+  long min = a;
   if (b < min) min = b;
   if (c < min) min = c;
   return min;
 }
 
-int levenshtein_distance(const char *str1t, const char *str2t) {
+long levenshtein_distance(const char *str1t, const char *str2t) {
   // Get lengths of both strings
   int mt = strlen(str1t);
   int nt = strlen(str2t);
@@ -23,8 +23,8 @@ int levenshtein_distance(const char *str1t, const char *str2t) {
   int n = str1 == str1t ? nt : mt;
  
   // Create two rows, previous and current
-  int prev[m+1];
-  int curr[m+1];
+  long prev[m+1];
+  long curr[m+1];
  
   // initialize the previous row
   for (int i = 0; i <= m; i++) {
@@ -72,7 +72,7 @@ int main(int argc, char *argv[]) {
   // and min distance calculated of all comparisons. Two total lines of output, 
   // formatted exactly like this.
   printf("times: %d\n", times);
-  printf("min_distance: %d\n", min_distance);
+  printf("min_distance: %ld\n", min_distance);
   
   return 0;
 }

@omalley
Copy link

omalley commented Jan 14, 2025

It doesn't matter for performance, but I'd change the main like this:

    let args_str: Vec<String> = env::args().skip(1).collect();
    let args: Vec<&[u8]> = args_str.iter().map(|s| s.as_bytes()).collect();
   ...
    let min_distance = args.iter()
      .flat_map(|s1| args.iter().map(move |s2| (s1,s2)))
      .filter(|(s1, s2)| s1 != s2)
      .map(|(s1, s2)| { times += 1; levenshtein_distance(s1, s2)})
      .min();

@zierf
Copy link
Contributor

zierf commented Jan 15, 2025

Hmm, very interesting. Just out of curiosity, is the performance of int higher or lower with GCC than with Clang on Apple platforms?

I suspect that Clang/LLVM is better optimized on Macs and does not have as big a performance drop as it does for me on Linux.

Apart from that, I still find it very strange that the programs compiled with LLVM and the exact same C code runs slower even with 32bit ints. The Cpp code loses almost 100ms with Clang instead of g++.

gcc (GCC) 14.2.1 20241116

clang version 19.1.5
Target: x86_64-unknown-linux-gnu
Thread model: posix

Zig in particular is doing really poorly compared to everyone else.


I also compared your C example briefly.
Only the C timings are relevant, the others are just for reference.
My Rust timings already include changes from this PR and also from PR #360.

Clang int (i32)

Clang i32

Clang long (i64)

Clang i64

I have tried several runs and it looks like the program runs about 12-20ms longer with long (how appropriate) than with int variables.


Despite the longer runtimes of LLVM compiled programs, the absolutely abysmal performance of Zig is still strange, as it was built with ReleaseFast, which drops safety checks like array index out of bounds (that even caused a bug to be overlooked, see PR #355).

Would be great if someone else could check the differences between GCC and LLVM compilers on Linux, ideally also with i32 and i64.

I actually had in mind that Clang programs ran faster for me than with gcc/g++ years ago. I'm wondering if they have caught up immensely over time.

@omalley
Copy link

omalley commented Jan 15, 2025

I get a large speed up using your patch on my M2 Max, by changing the swap to:

        // Swap rows
        (prev_row, curr_row) = (curr_row, prev_row);

It takes it from 606.7 ms to 483.4 ms.

@zierf
Copy link
Contributor

zierf commented Jan 15, 2025

Yeah PR #360 uses this alternative swap method. The compiler seems to optimize it for us. Interestingly, replacing std::mem::swap doesn't have too much of an impact for me, maybe 2ms if it's not measurement inaccuracies.

Adding more of their changes in addition to this one saved another ~10-15ms for me too. It also uses enuemrate() in the main-function loops, drops the conditions around the min_distance variable and does even more iterator magic for the hotspot.

However, the early termination takes away another ~10-13ms of runtime. Since many other languages ​​use it and it has not been banned anywhere yet, we should also use it again.

@BradLewis
Copy link

However, the early termination takes away another ~10-13ms of runtime. Since many other languages ​​use it and it has not been banned anywhere yet, we should also use it again.

Unless I'm missing something, I'm not sure how this could be the case since in the main function loop it already skips the case when i == j (same as the with the c code) and so we never actually hit any of those early return checks.

Are you sure you haven't removed that check in the main loop?

@sammysheep
Copy link
Contributor Author

sammysheep commented Jan 15, 2025

@zierf I installed GCC 14 and have summarized the results for M4 Max.

It looks like you were correct. GCC does a better job than LLVM for this particular problem.

Language Compiler 32-bit integer (u32/int) 64-bit integer (usize/long)
C GCC 14 268.7 ms 416.4 ms
Rust rustc 1.86 387.9 ms 489.4 ms
C Clang 16 398.8 ms 511.9 ms

I would wager this general pattern might hold for Linux. If nothing else, someone more adept than myself could look at the assembly in Godbolt to see why: https://godbolt.org/z/KE51bT4To

@sammysheep
Copy link
Contributor Author

I get a large speed up using your patch on my M2 Max, by changing the swap to:

        // Swap rows
        (prev_row, curr_row) = (curr_row, prev_row);

It takes it from 606.7 ms to 483.4 ms.

It is very slightly slower to use reassignment on M4 Max than calling the swap method. No telling why the difference between our machines. :/

@zierf
Copy link
Contributor

zierf commented Jan 15, 2025

Unless I'm missing something, I'm not sure how this could be the case since in the main function loop it already skips the case when i == j (same as the with the c code) and so we never actually hit any of those early return checks.

Hmm good catch! Although I think I did a few runs, something must have messed up that timing or I got really unlucky. Can't verify this anymore. Same with swapping by (prev_row, curr_row) = (curr_row, prev_row) then, although I find it a bit cleaner and let the compiler figure it out. But even if it makes any difference, for me it's too tiny to be sure and you guys have to figure out what's best.

Probably rust-analyzer got in the way, sometimes it thinks it has to do stuff and maybe it ran a bit longer.

Are you sure you haven't removed that check in the main loop?

The check in the main loop was present all the time, just using the guard from PR #360.

if i == j {
    continue;
}

Wow, the Mac numbers jump really high when using 64bit integers and/or Clang.

However, I did a similar table for my Linux System.

AMD Ryzen 9 5950X 16-Core Processor
NixOS/nixpkgs/130595eba61081acde9001f43de3248d8888ac4a (2025-01-10 15:43:18)

$> nix-info -m
 - system: `"x86_64-linux"`
 - host os: `Linux 6.12.8-xanmod1, NixOS, 25.05 (Warbler), 25.05.20250110.130595e`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.24.11`
 - nixpkgs: `/nix/store/cginla74w7h4gln9b3mva6l4nmj6gj30-source`
compile 'cpp' 'g++ -std=c++23 -march=native -O3 -Ofast -o cpp/code cpp/code.cpp'
compile 'cpp' 'clang++ -std=c++23 -stdlib=libstdc++ -O4 -Ofast -fno-exceptions -o cpp/code cpp/code.cpp'

compile 'c' 'gcc -march=native -O3 c/code.c -o c/code'
compile 'c' 'clang -std=c17 -O4 -Ofast -fno-exceptions c/code.c -o c/code'

compile 'rust' 'cargo build --manifest-path rust/Cargo.toml --release'
  • I didn't use -march=native for Clang, because the C code would then need more than 810ms. 🤔
  • I took your changes in principle, but replaced long with size_t, like Rust uses usize instead of u64.
  • My Rust code contains Changes from PR #360, making it the fastest version I could get on my machine so far. (converted an inclusive range for prev_row to an exclusive range)
Language Compiler int/u32 size_t/usize
Cpp G++ 14.2.1 470.7 ms -
Cpp Clang++ 19.1.5 575.9 ms -
C GCC 14.2.1 510.7 ms 532.0 ms
C Clang 19.1.5 517.0 ms 534.8 ms
Rust rustc 1.84.0 492.0 ms 491.7 ms

G++ does some real arcane magic for Cpp, it runs so much faster than all the others. However with Clang it also falls the most.

While integer sizes have a visible impact in C, in Rust it doesn't really make a difference on my machine.
Likewise, changing the memory swap doesn't really make any difference.

@PEZ
Copy link
Collaborator

PEZ commented Jan 15, 2025

This is an awesome thread. I don't know how relevant but in #358, it is noted that gcc unrolls one level of recursion for the C fibonacci program. clang on my M1 and M4 macs didn't do that. For levenshtein there probably aren't any similar smart tricks to for the compilers to find, but anyway, my point is that gcc and clang, at least on Apple Silicon, really are different beasts.

The changes so far in the PR look good to me (who don't speak Rust). Is it ready to merge as far as you are concerned, @sammysheep?

@Janmm14
Copy link

Janmm14 commented Jan 15, 2025

Some people here have benchmarked with int/i32 and some with uint/u32, please don't get confused!

@zierf
Copy link
Contributor

zierf commented Jan 15, 2025

For me, the following integer sizes result in basically the same runtime:
(Clang is slower than GCC, but signed or unsigned didn't matter for both of them.)

  • C
    • int / unsigned int
    • size_t / long / unsigned long
  • Rust
    • usize / u32 / i32

Personally, I tend to choose usize for Rust and let the compiler use the preferred architecture size. At the same time, the program would not run slower for me with u32, as it does on some other machines here.


Ideally, there would be a specific architecture on which the tests are carried out. I have also seen other benchmarks in the past that specify an AWS cloud instance including the hardware installed, so that interested parties could use a roughly comparable system or even spawn an identical instance.

I use this more for experimenting with Rust and a bit of C/C++/Zig and wouldn't rent a cloud instance for this myself, let alone buy a Macbook Pro M1 Max or M4 Max. But there are certainly some people who would spawn a small cloud instance because they are so cheap for them. Perhaps for future benchmarks.


@sammysheep Can you also remove unnecessary and broken Rust compiler line?

compile 'rust' 'RUSTFLAGS="-Zlocation-detail=none" cargo +nightly build --manifest-path rust/Cargo.toml --release'

Aside from not working, the next line would still compile and replace the executable.

@PEZ Afterwards I suggest merging this PR and then visiting PR #360, adopting at least their improved main function. We then also have to check if the changed row swapping and (harder to read) window-iterator-magic give any further improvements. But for this, we need the ranges replaced by enumerate() from this PR, which resulted in a good performance boost and is still very readable.

@sammysheep
Copy link
Contributor Author

Some people here have benchmarked with int/i32 and some with uint/u32, please don't get confused!

Good point. Using i32 is worse than u32 on my machine, albeit, I'd never reach for an i32 for this problem.

@sammysheep
Copy link
Contributor Author

@zierf @PEZ To save time, I attempted to incorporate some of @PizzasBear's suggestions in the last commit. Additionally, I got rid of that extra compile statement.

Runtime is the same as before on M4 Max. If it looks the same for @zierf, then please go ahead and merge. LGTM!

@zierf
Copy link
Contributor

zierf commented Jan 16, 2025

LGTM, no performance regression.

Like the additional argument checks btw. and the early addition of rust-analyzer. 👍

Good point. Using i32 is worse than u32 on my machine, albeit, I'd never reach for an i32 for this problem.

Would also keep unsigned integers, since signed ones don't make sense for this task.

@zierf
Copy link
Contributor

zierf commented Jan 16, 2025

I discovered something else interesting that I wanted to briefly document here.
Above I noted that I left out -march=native in the comparisons with C(pp) for Clang because it resulted in an extreme runtime increase for my AMD Ryzen 9 5950X 16-Core Processor.

The equivalent flag for rustc would be -Ctarget-cpu=native.

  • 492.3 ms (100%) without -Ctarget-cpu=native
  • 861.8 ms (175%) with -Ctarget-cpu=native

Command rustc --print=target-cpus shows:

  • x86-64 => current default target CPU for my x86_64-unknown-linux-gnu system
  • native => current host cpu is znver3

Following commands show me the enabled cpu features:

  • rustc --print cfg (default target CPU x86-64)
    • -Ctarget_feature=+fxsr,+sse,+sse2
  • rustc -Ctarget-cpu=native --print cfg (compiling for host cpu)
    • -Ctarget_feature=+adx,+aes,+avx,+avx2,+bmi1,+bmi2,+cmpxchg16b,+f16c,+fma,+lzcnt,+movbe,+pclmulqdq,+popcnt,+rdrand,+rdseed,+sse3,+sse4.1,+sse4.2,+ssse3,+xsave,+xsavec,+xsaveopt,+xsaves

The strange thing is, even if I activate all these flags manually, they don't have that negative impact on performance.
Only when I explicitly set -Ctarget-cpu=native, everything falls apart. Seems like a strange LLVM issue.


LLVM issue LLVM-Issue #90985 and its successor LLVM-Issue #91370 suspect this to be due to a problem unrolling the loops for znver3, znver4 and skylake.

Setting the opt-level to "s" or "z", which stops loops from being unrolled, actually counteracts the problem. Even setting the opt-level to 1 has a very small negative impact.

Alternatively, I can explicitly set the target to -Ctarget-cpu=znver2 and then the problem doesn't occur.
Since GNU compiler does not seem to have the same problem, -march=native is a valid option for C(pp) code.

Of course, reducing the opt-level or explicitly setting a specific target-cpu (especially something other than native) makes no sense here. The problem really irritated me and I wanted to get to the bottom of it. I thought maybe it could be helpful to someone at some point and I would share it here.

@PEZ PEZ merged commit 7061da3 into bddicken:main Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants