Fix rust like c for Levenshtein #351

sammysheep · 2025-01-13T03:11:45Z

Description of changes

Modifies the Rust implementation for Levenshtein to make it more like the C one.

Switches from usize (usually 64-bit) to u32
Uses iterators to get rid of some bounds checking
Clean up style and make more idiomatic.

Passed the check, compiles, and runs via the test harness.

sammysheep · 2025-01-13T03:23:30Z

Switching to something like 1..m+1, as mentioned in other PRs, has the same performance as using the iterators in this PR. On Apple processors, I saw a big jump just changing usize to u32, however.

zierf · 2025-01-13T10:56:24Z

Initializing the prev_row like this:

// Use two rows instead of full matrix for space optimization
let mut curr_row = vec![0; m + 1];
let mut prev_row: Vec<u32> = Vec::with_capacity(m + 1);

// Initialize first row
prev_row.extend(0..=m);

is faster for me than initializing it all with just zeroes:

// Use two rows instead of full matrix for space optimization
let mut prev_row = vec![0u32; m + 1];
let mut curr_row = vec![0u32; m + 1];

But extend isn't implemented for u32. This way the u32 code is way slower for me than the usize with initialization.

Using enumerate seems to have a big improvement though.
If you derefence s1 and s2, the copied() function on the iterator won't be needed.

    // Main computation loop
    for (j, s2) in s2_bytes.iter().enumerate() {
        curr_row[0] = j + 1;

        for (i, s1) in s1_bytes.iter().enumerate() {
            let cost = if *s1 == *s2 { 0 } else { 1 };

For me enumerate is also faster than 1..n + 1 and 1..m + 1. I guess you were referencing to PR #346. The exclusive range runs with approximately 455ms, the enumeration brings it down to 408ms.

Reason seems to be described in PR #335. It mentions that iterating avoids permanent bound checks for accessing every single vector element.

zierf · 2025-01-13T11:39:11Z

Another observation, not using the original termination checks also reduces performance drastically.

This was removed in Commit c08ef04:

/// Calculates the Levenshtein distance between two strings using Wagner-Fischer algorithm
/// Space Complexity: O(min(m,n)) - only uses two rows instead of full matrix
/// Time Complexity: O(m*n) where m and n are the lengths of the input strings
fn levenshtein_distance(s1: &str, s2: &str) -> usize {
    // Early termination checks
    if s1 == s2 {
        return 0;
    }
    if s1.is_empty() {
        return s2.len();
    }
    if s2.is_empty() {
        return s1.len();
    }

    // Convert to bytes for faster access
    let s1_bytes = s1.as_bytes();
    let s2_bytes = s2.as_bytes();

    // Make s1 the shorter string for space optimization
    let (s1_bytes, s2_bytes) = if s1_bytes.len() > s2_bytes.len() {
        (s2_bytes, s1_bytes)
    } else {
        (s1_bytes, s2_bytes)
    };

Reinserting this code improves performance by ~10% for me.

Having &str instead of &[u8] as arguments for levenshtein_distance function also seems a bit more intuitive while still being faster.

sammysheep · 2025-01-13T11:59:58Z

You were right, that was an initialization bug.

sammysheep · 2025-01-13T12:07:07Z

Another observation, not using the original termination checks also reduces performance drastically.

This was removed in Commit c08ef04:

/// Calculates the Levenshtein distance between two strings using Wagner-Fischer algorithm
/// Space Complexity: O(min(m,n)) - only uses two rows instead of full matrix
/// Time Complexity: O(m*n) where m and n are the lengths of the input strings
fn levenshtein_distance(s1: &str, s2: &str) -> usize {
    // Early termination checks
    if s1 == s2 {
        return 0;
    }
    if s1.is_empty() {
        return s2.len();
    }
    if s2.is_empty() {
        return s1.len();
    }

    // Convert to bytes for faster access
    let s1_bytes = s1.as_bytes();
    let s2_bytes = s2.as_bytes();

    // Make s1 the shorter string for space optimization
    let (s1_bytes, s2_bytes) = if s1_bytes.len() > s2_bytes.len() {
        (s2_bytes, s1_bytes)
    } else {
        (s1_bytes, s2_bytes)
    };

Reinserting this code improves performance by ~10% for me.

Having &str instead of &[u8] as arguments for levenshtein_distance function also seems a bit more intuitive while still being faster.

Does the C version use these checks?

zierf · 2025-01-13T12:29:55Z

Looks like the C version doesn't use it atm.
But even with termination checks it still uses the Wagner-Fischer algorithm and other implementations like C# make use of it too.

sammysheep · 2025-01-13T12:39:47Z

Looks like the C version doesn't use it atm. But even with termination checks it still uses the Wagner-Fischer algorithm and other implementations like C# make use of it too.

Sure and I'd use it too in my own code, but I guess this PR was about matching the C version for better comparison.

What processor are you on? Mine was M4 Max.

zierf · 2025-01-13T12:41:06Z

Funny, your Range->map->collect isn't faster than the vector with capacity and extend.

let mut prev_row = Vec::with_capacity(m + 1);
// Initialize first row
prev_row.extend(0..=m);

But by using (0..m + 1) instead of (0..=m) it makes your code faster than the version with extend.

// Use two rows instead of full matrix for space optimization
let mut curr_row: Vec<usize> = vec![0; m + 1];
let mut prev_row: Vec<usize> = (0..m + 1).collect();

Using a AMD Ryzen 9 5950X 16-Core Processor.

sammysheep · 2025-01-13T12:55:19Z

Funny, your Range->map->collect isn't faster than the vector with capacity and extend.
let mut prev_row = Vec::with_capacity(m + 1);
// Initialize first row
prev_row.extend(0..=m);
But by using (0..m + 1) instead of (0..=m) it makes your code faster than the version with extend.
// Use two rows instead of full matrix for space optimization
let mut curr_row: Vec<usize> = vec![0; m + 1];
let mut prev_row: Vec<usize> = (0..m + 1).collect();
Using a AMD Ryzen 9 5950X 16-Core Processor.

Yeah, inclusive sometimes has worse code gen. I'll fix that.

Can you do me a favor and do a quick printf("%d",sizeof(int)); in your C code? Mine is 4 and that's why I'm using u32, but wondering if it could be different on your computer.

sammysheep · 2025-01-13T12:58:02Z

*inclusive

sammysheep · 2025-01-13T13:04:47Z

Apple M4 Max:

Checking C
Check passed
Benchmarking C
Benchmark 1: ./c/code aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ...
ciladucljdamhnafrxyabwuihrusmthypjxormffegjwxioyztsdwqemzjdf ...
  Time (mean ± σ):     399.3 ms ±   2.5 ms    [User: 396.8 ms, System: 1.9 ms]
  Range (min … max):   397.2 ms … 402.1 ms    3 runs
 

Checking Rust
Check passed
Benchmarking Rust
Benchmark 1: ./rust/target/release/code aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ...
ciladucljdamhnafrxyabwuihrusmthypjxormffegjwxioyztsdwqemzjdf ...
  Time (mean ± σ):     387.0 ms ±   1.1 ms    [User: 384.4 ms, System: 2.0 ms]
  Range (min … max):   386.3 ms … 388.3 ms    3 runs

zierf · 2025-01-13T13:12:51Z

Can you do me a favor and do a quick printf("%d",sizeof(int)); in your C code? Mine is 4 and that's why I'm using u32, but wondering if it could be different on your computer.

Output is 4.

sammysheep · 2025-01-13T13:25:45Z

Mmmmm, maybe x86 is better optimized for the 64-bit case?

On M4, different from C, if I switch back to usize it's about 488ms, so much worse on Apple.

Microbenchmarking between architectures is hard.

On the other hand, if you compile Rust on an M4, because the M series is only a few years old, it's more similar to compiling for native. Maybe on x86 you'd see different properties for native targets.

zierf · 2025-01-13T13:44:17Z

Yeah it's hard.

The compile.sh contains following line:

compile 'rust' 'RUSTFLAGS="-Zlocation-detail=none" cargo +nightly build --manifest-path rust/Cargo.toml --release'

This fails when compiling with the script, though it runs fine if i call it directly.

Is this even relevant? Directly after there is another rust --release compilation for the same files with the same Cargo.toml. The last compilation should win anyway.

PEZ · 2025-01-13T13:53:27Z

I don't know why there are two Rust compiles in compile.sh. Do you know, @bddicken?

zierf · 2025-01-13T14:06:16Z

My numbers are now scattered all over the place and there are somehow differences between using the benchmark scripts and calling hyperfine manually.

But as far as I can see atm, there is no clear improvement of “u32” over “usize”.

But could you also try to remove the call to .copied()? It should shave off a few more milliseconds.

    // Main computation loop
    for (j, s2) in s2_bytes.iter().enumerate() {
        curr_row[0] = (j + 1) as u32;

        for (i, s1) in s1_bytes.iter().enumerate() {
            let cost = if *s1 == *s2 { 0 } else { 1 };

zierf · 2025-01-13T14:41:52Z

The other question would be whether early termination is permitted @bddicken?

While the C(PP) examples and ReadMe do not currently include it, there is still some code where terminating early saves some operations.

Easy fast findings:

If it is not intended, we might leave it out to stay true to the task. Otherwise, leaving it out as a simple and well-known optimization would be nonsense.

BradLewis · 2025-01-14T01:54:18Z

I was looking at creating a PR to fix this with very similar changes, including removing the early termination (didn't really make a difference for me though). On my end (M1 Pro) the difference from using iterators is very substantial:

Benchmark 1: c
  Time (mean ± σ):     715.2 ms ±   0.3 ms    [User: 711.4 ms, System: 2.8 ms]
  Range (min … max):   714.9 ms … 715.5 ms    3 runs

Benchmark 1: old
  Time (mean ± σ):      1.478 s ±  0.002 s    [User: 1.471 s, System: 0.005 s]
  Range (min … max):    1.476 s …  1.479 s    3 runs

Benchmark 1: new
  Time (mean ± σ):     685.4 ms ±   0.9 ms    [User: 681.5 ms, System: 3.0 ms]
  Range (min … max):   684.5 ms … 686.4 ms    3 runs

I also switched from usize to i32 but on further testing that also didn't make any difference.

This reverts commit ba3dee6.

sammysheep · 2025-01-14T03:25:27Z

Interesting. At least M1 Pro seems to behave similar to M4 Max. Here is messing again with the integer width.

My compiler is: nightly-aarch64-apple-darwin - rustc 1.86.0-nightly (48a426eca 2025-01-12)

`u32`, M4 Max

Checking Rust
Check passed
Benchmarking Rust
Benchmark 1: ./rust/target/release/code aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ...
ciladucljdamhnafrxyabwuihrusmthypjxormffegjwxioyztsdwqemzjdf ...
  Time (mean ± σ):     387.5 ms ±   0.6 ms    [User: 384.8 ms, System: 2.1 ms]
  Range (min … max):   386.9 ms … 388.1 ms    3 runs

`usize`, M4 Max

Checking Rust
Check passed
Benchmarking Rust
Benchmark 1: ./rust/target/release/code aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ...
ciladucljdamhnafrxyabwuihrusmthypjxormffegjwxioyztsdwqemzjdf ...
  Time (mean ± σ):     490.2 ms ±   0.6 ms    [User: 487.3 ms, System: 2.3 ms]
  Range (min … max):   489.7 ms … 490.8 ms    3 runs

I added ba3dee6 for testing usize more easily.

zierf · 2025-01-14T15:49:01Z

Could the impact of integer sizes on performance come from Clang?
I think Rust always ends up using the GNU compiler on Linux.

But look at the comparison of the C and Cpp times with the GNU compiler:

compile 'c' 'gcc -march=native -O3 c/code.c -o c/code'
compile 'cpp' 'g++ -std=c++23 -march=native -O3 -Ofast -o cpp/code cpp/code.cpp'

Benchmark Comparison GNU Compiler

While C(pp) above still runs faster than Rust, it becomes slower when I switch to Clang.
(It's also slower if I would omit -march=native from gcc commands.)

compile 'c' 'clang -std=c17 -O4 -Ofast -fno-exceptions c/code.c -o c/code'
compile 'cpp' 'clang++ -std=c++23 -stdlib=libstdc++ -O4 -Ofast -fno-exceptions -o cpp/code cpp/code.cpp'

Benchmark Comparison Clang Compiler

Basically same negative performance impact with Zig compiler for C(pp).
Zig itself also runs way too slow.

compile 'c' 'zig cc -std=c17 -O3 c/code.c -o c/code'
compile 'cpp' 'zig c++ -std=c++23 -stdlib=libstdc++ -O3 -o cpp/code cpp/code.cpp'

On the Mac, Rust probably always uses Clang instead of GCC.
Perhaps you can test the influence of different integer sizes with the C(pp) examples if you compile them with Clang.
There may be not much difference by using different integer sizes with a GNU compiler on a Linux system.

sammysheep · 2025-01-14T16:11:05Z

Rust uses LLVM as its backend compiler, similar to clang, for all platforms. The GCC backend is still in development AFAIK. I can look at the C widths tonight.

sammysheep · 2025-01-14T22:23:01Z

Thanks for asking me to do the experiment, it seems conclusive.

My compiler:

Apple clang version 16.0.0 (clang-1600.0.26.6)
Target: arm64-apple-darwin24.2.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

C, M4 Max, long (64-bit)

Checking C
Check passed
Benchmarking C
Benchmark 1: ./c/code aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ...
ciladucljdamhnafrxyabwuihrusmthypjxormffegjwxioyztsdwqemzjdf ...
  Time (mean ± σ):     513.2 ms ±   1.7 ms    [User: 510.7 ms, System: 2.0 ms]
  Range (min … max):   511.9 ms … 515.2 ms    3 runs

C, M4 Max, int (32-bit)

Checking C
Check passed
Benchmarking C
Benchmark 1: ./c/code aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ...
ciladucljdamhnafrxyabwuihrusmthypjxormffegjwxioyztsdwqemzjdf ...
  Time (mean ± σ):     397.6 ms ±   0.2 ms    [User: 395.4 ms, System: 1.6 ms]
  Range (min … max):   397.4 ms … 397.7 ms    3 runs

C Patch

Patch for the C if you want to try it yourself:

diff --git a/levenshtein/c/code.c b/levenshtein/c/code.c
index 12e631a..f9a8ba2 100644
--- a/levenshtein/c/code.c
+++ b/levenshtein/c/code.c
@@ -4,14 +4,14 @@
 
 // Can either define your own min function 
 // or use a language / standard library function
-int min(int a, int b, int c) {
-  int min = a;
+long min(long a, long b, long c) {
+  long min = a;
   if (b < min) min = b;
   if (c < min) min = c;
   return min;
 }
 
-int levenshtein_distance(const char *str1t, const char *str2t) {
+long levenshtein_distance(const char *str1t, const char *str2t) {
   // Get lengths of both strings
   int mt = strlen(str1t);
   int nt = strlen(str2t);
@@ -23,8 +23,8 @@ int levenshtein_distance(const char *str1t, const char *str2t) {
   int n = str1 == str1t ? nt : mt;
  
   // Create two rows, previous and current
-  int prev[m+1];
-  int curr[m+1];
+  long prev[m+1];
+  long curr[m+1];
  
   // initialize the previous row
   for (int i = 0; i <= m; i++) {
@@ -72,7 +72,7 @@ int main(int argc, char *argv[]) {
   // and min distance calculated of all comparisons. Two total lines of output, 
   // formatted exactly like this.
   printf("times: %d\n", times);
-  printf("min_distance: %d\n", min_distance);
+  printf("min_distance: %ld\n", min_distance);
   
   return 0;
 }

omalley · 2025-01-14T23:41:38Z

It doesn't matter for performance, but I'd change the main like this:

    let args_str: Vec<String> = env::args().skip(1).collect();
    let args: Vec<&[u8]> = args_str.iter().map(|s| s.as_bytes()).collect();
   ...
    let min_distance = args.iter()
      .flat_map(|s1| args.iter().map(move |s2| (s1,s2)))
      .filter(|(s1, s2)| s1 != s2)
      .map(|(s1, s2)| { times += 1; levenshtein_distance(s1, s2)})
      .min();

zierf · 2025-01-15T00:11:23Z

Hmm, very interesting. Just out of curiosity, is the performance of int higher or lower with GCC than with Clang on Apple platforms?

I suspect that Clang/LLVM is better optimized on Macs and does not have as big a performance drop as it does for me on Linux.

Apart from that, I still find it very strange that the programs compiled with LLVM and the exact same C code runs slower even with 32bit ints. The Cpp code loses almost 100ms with Clang instead of g++.

gcc (GCC) 14.2.1 20241116

clang version 19.1.5
Target: x86_64-unknown-linux-gnu
Thread model: posix

Zig in particular is doing really poorly compared to everyone else.

I also compared your C example briefly.
Only the C timings are relevant, the others are just for reference.
My Rust timings already include changes from this PR and also from PR #360.

Clang int (i32)

Clang long (i64)

I have tried several runs and it looks like the program runs about 12-20ms longer with long (how appropriate) than with int variables.

Despite the longer runtimes of LLVM compiled programs, the absolutely abysmal performance of Zig is still strange, as it was built with ReleaseFast, which drops safety checks like array index out of bounds (that even caused a bug to be overlooked, see PR #355).

Would be great if someone else could check the differences between GCC and LLVM compilers on Linux, ideally also with i32 and i64.

I actually had in mind that Clang programs ran faster for me than with gcc/g++ years ago. I'm wondering if they have caught up immensely over time.

omalley · 2025-01-15T00:44:27Z

I get a large speed up using your patch on my M2 Max, by changing the swap to:

        // Swap rows
        (prev_row, curr_row) = (curr_row, prev_row);

It takes it from 606.7 ms to 483.4 ms.

zierf · 2025-01-15T01:02:25Z

Yeah PR #360 uses this alternative swap method. The compiler seems to optimize it for us. Interestingly, replacing std::mem::swap doesn't have too much of an impact for me, maybe 2ms if it's not measurement inaccuracies.

Adding more of their changes in addition to this one saved another ~10-15ms for me too. It also uses enuemrate() in the main-function loops, drops the conditions around the min_distance variable and does even more iterator magic for the hotspot.

However, the early termination takes away another ~10-13ms of runtime. Since many other languages use it and it has not been banned anywhere yet, we should also use it again.

BradLewis · 2025-01-15T01:40:58Z

However, the early termination takes away another ~10-13ms of runtime. Since many other languages use it and it has not been banned anywhere yet, we should also use it again.

Unless I'm missing something, I'm not sure how this could be the case since in the main function loop it already skips the case when i == j (same as the with the c code) and so we never actually hit any of those early return checks.

Are you sure you haven't removed that check in the main loop?

sammysheep · 2025-01-15T02:27:48Z

@zierf I installed GCC 14 and have summarized the results for M4 Max.

It looks like you were correct. GCC does a better job than LLVM for this particular problem.

Language	Compiler	32-bit integer (u32/int)	64-bit integer (usize/long)
C	GCC 14	268.7 ms	416.4 ms
Rust	rustc 1.86	387.9 ms	489.4 ms
C	Clang 16	398.8 ms	511.9 ms

I would wager this general pattern might hold for Linux. If nothing else, someone more adept than myself could look at the assembly in Godbolt to see why: https://godbolt.org/z/KE51bT4To

sammysheep · 2025-01-15T02:30:35Z

I get a large speed up using your patch on my M2 Max, by changing the swap to:
        // Swap rows
        (prev_row, curr_row) = (curr_row, prev_row);
It takes it from 606.7 ms to 483.4 ms.

It is very slightly slower to use reassignment on M4 Max than calling the swap method. No telling why the difference between our machines. :/

zierf · 2025-01-15T03:55:41Z

Unless I'm missing something, I'm not sure how this could be the case since in the main function loop it already skips the case when i == j (same as the with the c code) and so we never actually hit any of those early return checks.

Hmm good catch! Although I think I did a few runs, something must have messed up that timing or I got really unlucky. Can't verify this anymore. Same with swapping by (prev_row, curr_row) = (curr_row, prev_row) then, although I find it a bit cleaner and let the compiler figure it out. But even if it makes any difference, for me it's too tiny to be sure and you guys have to figure out what's best.

Probably rust-analyzer got in the way, sometimes it thinks it has to do stuff and maybe it ran a bit longer.

Are you sure you haven't removed that check in the main loop?

The check in the main loop was present all the time, just using the guard from PR #360.

if i == j {
    continue;
}

Wow, the Mac numbers jump really high when using 64bit integers and/or Clang.

However, I did a similar table for my Linux System.

AMD Ryzen 9 5950X 16-Core Processor
NixOS/nixpkgs/130595eba61081acde9001f43de3248d8888ac4a (2025-01-10 15:43:18)

$> nix-info -m
 - system: `"x86_64-linux"`
 - host os: `Linux 6.12.8-xanmod1, NixOS, 25.05 (Warbler), 25.05.20250110.130595e`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.24.11`
 - nixpkgs: `/nix/store/cginla74w7h4gln9b3mva6l4nmj6gj30-source`

compile 'cpp' 'g++ -std=c++23 -march=native -O3 -Ofast -o cpp/code cpp/code.cpp'
compile 'cpp' 'clang++ -std=c++23 -stdlib=libstdc++ -O4 -Ofast -fno-exceptions -o cpp/code cpp/code.cpp'

compile 'c' 'gcc -march=native -O3 c/code.c -o c/code'
compile 'c' 'clang -std=c17 -O4 -Ofast -fno-exceptions c/code.c -o c/code'

compile 'rust' 'cargo build --manifest-path rust/Cargo.toml --release'

I didn't use -march=native for Clang, because the C code would then need more than 810ms. 🤔
I took your changes in principle, but replaced long with size_t, like Rust uses usize instead of u64.
My Rust code contains Changes from PR #360, making it the fastest version I could get on my machine so far. (converted an inclusive range for prev_row to an exclusive range)

Language	Compiler	int/u32	size_t/usize
Cpp	G++ 14.2.1	470.7 ms	-
Cpp	Clang++ 19.1.5	575.9 ms	-
C	GCC 14.2.1	510.7 ms	532.0 ms
C	Clang 19.1.5	517.0 ms	534.8 ms
Rust	rustc 1.84.0	492.0 ms	491.7 ms

G++ does some real arcane magic for Cpp, it runs so much faster than all the others. However with Clang it also falls the most.

While integer sizes have a visible impact in C, in Rust it doesn't really make a difference on my machine.
Likewise, changing the memory swap doesn't really make any difference.

PEZ · 2025-01-15T11:04:44Z

This is an awesome thread. I don't know how relevant but in #358, it is noted that gcc unrolls one level of recursion for the C fibonacci program. clang on my M1 and M4 macs didn't do that. For levenshtein there probably aren't any similar smart tricks to for the compilers to find, but anyway, my point is that gcc and clang, at least on Apple Silicon, really are different beasts.

The changes so far in the PR look good to me (who don't speak Rust). Is it ready to merge as far as you are concerned, @sammysheep?

Janmm14 · 2025-01-15T13:41:11Z

Some people here have benchmarked with int/i32 and some with uint/u32, please don't get confused!

zierf · 2025-01-15T18:19:30Z

For me, the following integer sizes result in basically the same runtime:
(Clang is slower than GCC, but signed or unsigned didn't matter for both of them.)

C
- int / unsigned int
- size_t / long / unsigned long
Rust
- usize / u32 / i32

Personally, I tend to choose usize for Rust and let the compiler use the preferred architecture size. At the same time, the program would not run slower for me with u32, as it does on some other machines here.

Ideally, there would be a specific architecture on which the tests are carried out. I have also seen other benchmarks in the past that specify an AWS cloud instance including the hardware installed, so that interested parties could use a roughly comparable system or even spawn an identical instance.

I use this more for experimenting with Rust and a bit of C/C++/Zig and wouldn't rent a cloud instance for this myself, let alone buy a Macbook Pro M1 Max or M4 Max. But there are certainly some people who would spawn a small cloud instance because they are so cheap for them. Perhaps for future benchmarks.

@sammysheep Can you also remove unnecessary and broken Rust compiler line?

compile 'rust' 'RUSTFLAGS="-Zlocation-detail=none" cargo +nightly build --manifest-path rust/Cargo.toml --release'

Aside from not working, the next line would still compile and replace the executable.

@PEZ Afterwards I suggest merging this PR and then visiting PR #360, adopting at least their improved main function. We then also have to check if the changed row swapping and (harder to read) window-iterator-magic give any further improvements. But for this, we need the ranges replaced by enumerate() from this PR, which resulted in a good performance boost and is still very readable.

sammysheep · 2025-01-16T01:59:36Z

Some people here have benchmarked with int/i32 and some with uint/u32, please don't get confused!

Good point. Using i32 is worse than u32 on my machine, albeit, I'd never reach for an i32 for this problem.

sammysheep · 2025-01-16T02:03:05Z

@zierf @PEZ To save time, I attempted to incorporate some of @PizzasBear's suggestions in the last commit. Additionally, I got rid of that extra compile statement.

Runtime is the same as before on M4 Max. If it looks the same for @zierf, then please go ahead and merge. LGTM!

zierf · 2025-01-16T09:03:22Z

LGTM, no performance regression.

Like the additional argument checks btw. and the early addition of rust-analyzer. 👍

Good point. Using i32 is worse than u32 on my machine, albeit, I'd never reach for an i32 for this problem.

Would also keep unsigned integers, since signed ones don't make sense for this task.

zierf · 2025-01-16T12:24:05Z

I discovered something else interesting that I wanted to briefly document here.
Above I noted that I left out -march=native in the comparisons with C(pp) for Clang because it resulted in an extreme runtime increase for my AMD Ryzen 9 5950X 16-Core Processor.

The equivalent flag for rustc would be -Ctarget-cpu=native.

492.3 ms (100%) without -Ctarget-cpu=native
861.8 ms (175%) with -Ctarget-cpu=native

Command rustc --print=target-cpus shows:

x86-64 => current default target CPU for my x86_64-unknown-linux-gnu system
native => current host cpu is znver3

Following commands show me the enabled cpu features:

rustc --print cfg (default target CPU x86-64)
- -Ctarget_feature=+fxsr,+sse,+sse2
rustc -Ctarget-cpu=native --print cfg (compiling for host cpu)
- -Ctarget_feature=+adx,+aes,+avx,+avx2,+bmi1,+bmi2,+cmpxchg16b,+f16c,+fma,+lzcnt,+movbe,+pclmulqdq,+popcnt,+rdrand,+rdseed,+sse3,+sse4.1,+sse4.2,+ssse3,+xsave,+xsavec,+xsaveopt,+xsaves

The strange thing is, even if I activate all these flags manually, they don't have that negative impact on performance.
Only when I explicitly set -Ctarget-cpu=native, everything falls apart. Seems like a strange LLVM issue.

LLVM issue LLVM-Issue #90985 and its successor LLVM-Issue #91370 suspect this to be due to a problem unrolling the loops for znver3, znver4 and skylake.

Setting the opt-level to "s" or "z", which stops loops from being unrolled, actually counteracts the problem. Even setting the opt-level to 1 has a very small negative impact.

Alternatively, I can explicitly set the target to -Ctarget-cpu=znver2 and then the problem doesn't occur.
Since GNU compiler does not seem to have the same problem, -march=native is a valid option for C(pp) code.

Of course, reducing the opt-level or explicitly setting a specific target-cpu (especially something other than native) makes no sense here. The problem really irritated me and I wanted to get to the bottom of it. I thought maybe it could be helpful to someone at some point and I would share it here.

sammysheep added 6 commits January 12, 2025 21:43

Remove kruft, make more idiomatic

c08ef04

Use methods

43e90db

Use 32-bit integers more like C

25bb10c

Eliminate some bounds checks?

e5f101c

Be consistent

16e9c47

Make life easier to Rust-analyzer

9531577

zierf mentioned this pull request Jan 13, 2025

Making Rust faster than C with this one simple trick (optimization for levenshtein) #346

Open

1 task

Fix initialization bug

cf87e26

Use m+1

0124ccc

Remove copied

3bceb7f

sammysheep added 2 commits January 13, 2025 22:23

For testing usize

ba3dee6

Revert "For testing usize"

0682cfd

This reverts commit ba3dee6.

PEZ mentioned this pull request Jan 14, 2025

Improve rust performace for levenshtein #360

Open

1 task

sammysheep added 2 commits January 15, 2025 20:31

Remove extra Rust compile statement

dc2f2df

C-like min thanks to @PizzasBear. Promote safety checks to main.

5325d6d

PEZ merged commit 7061da3 into bddicken:main Jan 16, 2025

zierf mentioned this pull request Jan 25, 2025

Zig ReleaseSafe faster than ReleaseFast #356

Open

Fix rust like c for Levenshtein #351

Fix rust like c for Levenshtein #351

Conversation

sammysheep commented Jan 13, 2025

Description of changes

sammysheep commented Jan 13, 2025 • edited Loading

zierf commented Jan 13, 2025 • edited Loading

zierf commented Jan 13, 2025 • edited Loading

sammysheep commented Jan 13, 2025

sammysheep commented Jan 13, 2025

zierf commented Jan 13, 2025 • edited Loading

sammysheep commented Jan 13, 2025

zierf commented Jan 13, 2025

sammysheep commented Jan 13, 2025 • edited Loading

sammysheep commented Jan 13, 2025

sammysheep commented Jan 13, 2025

Apple M4 Max:

zierf commented Jan 13, 2025

sammysheep commented Jan 13, 2025

zierf commented Jan 13, 2025 • edited Loading

PEZ commented Jan 13, 2025

zierf commented Jan 13, 2025

zierf commented Jan 13, 2025

BradLewis commented Jan 14, 2025

sammysheep commented Jan 14, 2025

u32, M4 Max

usize, M4 Max

zierf commented Jan 14, 2025 • edited Loading

sammysheep commented Jan 14, 2025

sammysheep commented Jan 14, 2025

C, M4 Max, long (64-bit)

C, M4 Max, int (32-bit)

C Patch

omalley commented Jan 14, 2025

zierf commented Jan 15, 2025 • edited Loading

omalley commented Jan 15, 2025

zierf commented Jan 15, 2025 • edited Loading

BradLewis commented Jan 15, 2025

sammysheep commented Jan 15, 2025 • edited Loading

sammysheep commented Jan 15, 2025

zierf commented Jan 15, 2025 • edited Loading

PEZ commented Jan 15, 2025

Janmm14 commented Jan 15, 2025

zierf commented Jan 15, 2025 • edited Loading

sammysheep commented Jan 16, 2025

sammysheep commented Jan 16, 2025

zierf commented Jan 16, 2025

zierf commented Jan 16, 2025 • edited Loading

sammysheep commented Jan 13, 2025 •

edited

Loading

zierf commented Jan 13, 2025 •

edited

Loading

zierf commented Jan 13, 2025 •

edited

Loading

zierf commented Jan 13, 2025 •

edited

Loading

sammysheep commented Jan 13, 2025 •

edited

Loading

zierf commented Jan 13, 2025 •

edited

Loading

`u32`, M4 Max

`usize`, M4 Max

zierf commented Jan 14, 2025 •

edited

Loading

zierf commented Jan 15, 2025 •

edited

Loading

zierf commented Jan 15, 2025 •

edited

Loading

sammysheep commented Jan 15, 2025 •

edited

Loading

zierf commented Jan 15, 2025 •

edited

Loading

zierf commented Jan 15, 2025 •

edited

Loading

zierf commented Jan 16, 2025 •

edited

Loading