-
Notifications
You must be signed in to change notification settings - Fork 340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix rust like c for Levenshtein #351
Conversation
Switching to something like |
Initializing the // Use two rows instead of full matrix for space optimization
let mut curr_row = vec![0; m + 1];
let mut prev_row: Vec<u32> = Vec::with_capacity(m + 1);
// Initialize first row
prev_row.extend(0..=m); is faster for me than initializing it all with just zeroes: // Use two rows instead of full matrix for space optimization
let mut prev_row = vec![0u32; m + 1];
let mut curr_row = vec![0u32; m + 1]; But Using // Main computation loop
for (j, s2) in s2_bytes.iter().enumerate() {
curr_row[0] = j + 1;
for (i, s1) in s1_bytes.iter().enumerate() {
let cost = if *s1 == *s2 { 0 } else { 1 }; For me Reason seems to be described in PR #335. It mentions that iterating avoids permanent bound checks for accessing every single vector element. |
Another observation, not using the original termination checks also reduces performance drastically. This was removed in Commit c08ef04: /// Calculates the Levenshtein distance between two strings using Wagner-Fischer algorithm
/// Space Complexity: O(min(m,n)) - only uses two rows instead of full matrix
/// Time Complexity: O(m*n) where m and n are the lengths of the input strings
fn levenshtein_distance(s1: &str, s2: &str) -> usize {
// Early termination checks
if s1 == s2 {
return 0;
}
if s1.is_empty() {
return s2.len();
}
if s2.is_empty() {
return s1.len();
}
// Convert to bytes for faster access
let s1_bytes = s1.as_bytes();
let s2_bytes = s2.as_bytes();
// Make s1 the shorter string for space optimization
let (s1_bytes, s2_bytes) = if s1_bytes.len() > s2_bytes.len() {
(s2_bytes, s1_bytes)
} else {
(s1_bytes, s2_bytes)
}; Reinserting this code improves performance by ~10% for me. Having |
You were right, that was an initialization bug. |
Does the C version use these checks? |
Looks like the C version doesn't use it atm. |
Sure and I'd use it too in my own code, but I guess this PR was about matching the C version for better comparison. What processor are you on? Mine was M4 Max. |
Funny, your Range->map->collect isn't faster than the vector with capacity and extend. let mut prev_row = Vec::with_capacity(m + 1);
// Initialize first row
prev_row.extend(0..=m); But by using // Use two rows instead of full matrix for space optimization
let mut curr_row: Vec<usize> = vec![0; m + 1];
let mut prev_row: Vec<usize> = (0..m + 1).collect(); Using a |
Yeah, inclusive sometimes has worse code gen. I'll fix that. Can you do me a favor and do a quick |
*inclusive |
Apple M4 Max:
|
Output is 4. |
Mmmmm, maybe x86 is better optimized for the 64-bit case? On M4, different from C, if I switch back to Microbenchmarking between architectures is hard. On the other hand, if you compile Rust on an M4, because the M series is only a few years old, it's more similar to compiling for native. Maybe on x86 you'd see different properties for native targets. |
I don't know why there are two Rust compiles in |
My numbers are now scattered all over the place and there are somehow differences between using the benchmark scripts and calling hyperfine manually. But as far as I can see atm, there is no clear improvement of “u32” over “usize”. But could you also try to remove the call to // Main computation loop
for (j, s2) in s2_bytes.iter().enumerate() {
curr_row[0] = (j + 1) as u32;
for (i, s1) in s1_bytes.iter().enumerate() {
let cost = if *s1 == *s2 { 0 } else { 1 }; |
The other question would be whether early termination is permitted @bddicken? While the C(PP) examples and ReadMe do not currently include it, there is still some code where terminating early saves some operations. Easy fast findings: If it is not intended, we might leave it out to stay true to the task. Otherwise, leaving it out as a simple and well-known optimization would be nonsense. |
I was looking at creating a PR to fix this with very similar changes, including removing the early termination (didn't really make a difference for me though). On my end (M1 Pro) the difference from using iterators is very substantial: Benchmark 1: c
Time (mean ± σ): 715.2 ms ± 0.3 ms [User: 711.4 ms, System: 2.8 ms]
Range (min … max): 714.9 ms … 715.5 ms 3 runs
Benchmark 1: old
Time (mean ± σ): 1.478 s ± 0.002 s [User: 1.471 s, System: 0.005 s]
Range (min … max): 1.476 s … 1.479 s 3 runs
Benchmark 1: new
Time (mean ± σ): 685.4 ms ± 0.9 ms [User: 681.5 ms, System: 3.0 ms]
Range (min … max): 684.5 ms … 686.4 ms 3 runs I also switched from |
This reverts commit ba3dee6.
Interesting. At least M1 Pro seems to behave similar to M4 Max. Here is messing again with the integer width. My compiler is:
|
Could the impact of integer sizes on performance come from Clang? But look at the comparison of the C and Cpp times with the GNU compiler: compile 'c' 'gcc -march=native -O3 c/code.c -o c/code'
compile 'cpp' 'g++ -std=c++23 -march=native -O3 -Ofast -o cpp/code cpp/code.cpp' While C(pp) above still runs faster than Rust, it becomes slower when I switch to Clang. compile 'c' 'clang -std=c17 -O4 -Ofast -fno-exceptions c/code.c -o c/code'
compile 'cpp' 'clang++ -std=c++23 -stdlib=libstdc++ -O4 -Ofast -fno-exceptions -o cpp/code cpp/code.cpp' Basically same negative performance impact with Zig compiler for C(pp). compile 'c' 'zig cc -std=c17 -O3 c/code.c -o c/code'
compile 'cpp' 'zig c++ -std=c++23 -stdlib=libstdc++ -O3 -o cpp/code cpp/code.cpp' On the Mac, Rust probably always uses Clang instead of GCC. |
Rust uses LLVM as its backend compiler, similar to clang, for all platforms. The GCC backend is still in development AFAIK. I can look at the C widths tonight. |
Thanks for asking me to do the experiment, it seems conclusive. My compiler:
C, M4 Max, long (64-bit)
C, M4 Max, int (32-bit)
C PatchPatch for the C if you want to try it yourself: diff --git a/levenshtein/c/code.c b/levenshtein/c/code.c
index 12e631a..f9a8ba2 100644
--- a/levenshtein/c/code.c
+++ b/levenshtein/c/code.c
@@ -4,14 +4,14 @@
// Can either define your own min function
// or use a language / standard library function
-int min(int a, int b, int c) {
- int min = a;
+long min(long a, long b, long c) {
+ long min = a;
if (b < min) min = b;
if (c < min) min = c;
return min;
}
-int levenshtein_distance(const char *str1t, const char *str2t) {
+long levenshtein_distance(const char *str1t, const char *str2t) {
// Get lengths of both strings
int mt = strlen(str1t);
int nt = strlen(str2t);
@@ -23,8 +23,8 @@ int levenshtein_distance(const char *str1t, const char *str2t) {
int n = str1 == str1t ? nt : mt;
// Create two rows, previous and current
- int prev[m+1];
- int curr[m+1];
+ long prev[m+1];
+ long curr[m+1];
// initialize the previous row
for (int i = 0; i <= m; i++) {
@@ -72,7 +72,7 @@ int main(int argc, char *argv[]) {
// and min distance calculated of all comparisons. Two total lines of output,
// formatted exactly like this.
printf("times: %d\n", times);
- printf("min_distance: %d\n", min_distance);
+ printf("min_distance: %ld\n", min_distance);
return 0;
} |
It doesn't matter for performance, but I'd change the main like this:
|
Hmm, very interesting. Just out of curiosity, is the performance of I suspect that Clang/LLVM is better optimized on Macs and does not have as big a performance drop as it does for me on Linux. Apart from that, I still find it very strange that the programs compiled with LLVM and the exact same C code runs slower even with 32bit ints. The Cpp code loses almost 100ms with Clang instead of g++. gcc (GCC) 14.2.1 20241116
clang version 19.1.5
Target: x86_64-unknown-linux-gnu
Thread model: posix Zig in particular is doing really poorly compared to everyone else. I also compared your C example briefly. I have tried several runs and it looks like the program runs about 12-20ms longer with Despite the longer runtimes of LLVM compiled programs, the absolutely abysmal performance of Zig is still strange, as it was built with Would be great if someone else could check the differences between GCC and LLVM compilers on Linux, ideally also with i32 and i64. I actually had in mind that Clang programs ran faster for me than with gcc/g++ years ago. I'm wondering if they have caught up immensely over time. |
I get a large speed up using your patch on my M2 Max, by changing the swap to:
It takes it from 606.7 ms to 483.4 ms. |
Yeah PR #360 uses this alternative swap method. The compiler seems to optimize it for us. Interestingly, replacing Adding more of their changes in addition to this one saved another ~10-15ms for me too. It also uses However, the early termination takes away another ~10-13ms of runtime. Since many other languages use it and it has not been banned anywhere yet, we should also use it again. |
Unless I'm missing something, I'm not sure how this could be the case since in the main function loop it already skips the case when Are you sure you haven't removed that check in the main loop? |
@zierf I installed GCC 14 and have summarized the results for M4 Max. It looks like you were correct. GCC does a better job than LLVM for this particular problem.
I would wager this general pattern might hold for Linux. If nothing else, someone more adept than myself could look at the assembly in Godbolt to see why: https://godbolt.org/z/KE51bT4To |
It is very slightly slower to use reassignment on M4 Max than calling the swap method. No telling why the difference between our machines. :/ |
Hmm good catch! Although I think I did a few runs, something must have messed up that timing or I got really unlucky. Can't verify this anymore. Same with swapping by Probably rust-analyzer got in the way, sometimes it thinks it has to do stuff and maybe it ran a bit longer.
The check in the main loop was present all the time, just using the guard from PR #360. if i == j {
continue;
} Wow, the Mac numbers jump really high when using 64bit integers and/or Clang. However, I did a similar table for my Linux System. AMD Ryzen 9 5950X 16-Core Processor
NixOS/nixpkgs/130595eba61081acde9001f43de3248d8888ac4a (2025-01-10 15:43:18)
$> nix-info -m
- system: `"x86_64-linux"`
- host os: `Linux 6.12.8-xanmod1, NixOS, 25.05 (Warbler), 25.05.20250110.130595e`
- multi-user?: `yes`
- sandbox: `yes`
- version: `nix-env (Nix) 2.24.11`
- nixpkgs: `/nix/store/cginla74w7h4gln9b3mva6l4nmj6gj30-source` compile 'cpp' 'g++ -std=c++23 -march=native -O3 -Ofast -o cpp/code cpp/code.cpp'
compile 'cpp' 'clang++ -std=c++23 -stdlib=libstdc++ -O4 -Ofast -fno-exceptions -o cpp/code cpp/code.cpp'
compile 'c' 'gcc -march=native -O3 c/code.c -o c/code'
compile 'c' 'clang -std=c17 -O4 -Ofast -fno-exceptions c/code.c -o c/code'
compile 'rust' 'cargo build --manifest-path rust/Cargo.toml --release'
G++ does some real arcane magic for Cpp, it runs so much faster than all the others. However with Clang it also falls the most. While integer sizes have a visible impact in C, in Rust it doesn't really make a difference on my machine. |
This is an awesome thread. I don't know how relevant but in #358, it is noted that gcc unrolls one level of recursion for the C fibonacci program. clang on my M1 and M4 macs didn't do that. For levenshtein there probably aren't any similar smart tricks to for the compilers to find, but anyway, my point is that gcc and clang, at least on Apple Silicon, really are different beasts. The changes so far in the PR look good to me (who don't speak Rust). Is it ready to merge as far as you are concerned, @sammysheep? |
Some people here have benchmarked with int/i32 and some with uint/u32, please don't get confused! |
For me, the following integer sizes result in basically the same runtime:
Personally, I tend to choose Ideally, there would be a specific architecture on which the tests are carried out. I have also seen other benchmarks in the past that specify an AWS cloud instance including the hardware installed, so that interested parties could use a roughly comparable system or even spawn an identical instance. I use this more for experimenting with Rust and a bit of C/C++/Zig and wouldn't rent a cloud instance for this myself, let alone buy a Macbook Pro M1 Max or M4 Max. But there are certainly some people who would spawn a small cloud instance because they are so cheap for them. Perhaps for future benchmarks. @sammysheep Can you also remove unnecessary and broken Rust compiler line? compile 'rust' 'RUSTFLAGS="-Zlocation-detail=none" cargo +nightly build --manifest-path rust/Cargo.toml --release' Aside from not working, the next line would still compile and replace the executable. @PEZ Afterwards I suggest merging this PR and then visiting PR #360, adopting at least their improved |
Good point. Using |
@zierf @PEZ To save time, I attempted to incorporate some of @PizzasBear's suggestions in the last commit. Additionally, I got rid of that extra compile statement. Runtime is the same as before on M4 Max. If it looks the same for @zierf, then please go ahead and merge. LGTM! |
LGTM, no performance regression. Like the additional argument checks btw. and the early addition of rust-analyzer. 👍
Would also keep unsigned integers, since signed ones don't make sense for this task. |
I discovered something else interesting that I wanted to briefly document here. The equivalent flag for
Command
Following commands show me the enabled cpu features:
The strange thing is, even if I activate all these flags manually, they don't have that negative impact on performance. LLVM issue LLVM-Issue #90985 and its successor LLVM-Issue #91370 suspect this to be due to a problem unrolling the loops for Setting the Alternatively, I can explicitly set the target to Of course, reducing the |
Description of changes
Modifies the Rust implementation for Levenshtein to make it more like the C one.
usize
(usually 64-bit) tou32
Passed the check, compiles, and runs via the test harness.