Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression since v0.32-beta.16 for debug builds with profile overwrites #1048

Closed
hanna-kruppe opened this issue May 30, 2024 · 7 comments

Comments

@hanna-kruppe
Copy link

First off, thank you for wasmi and congrats on the recent 0.32 release! I recently swapped out wasmtime for wasmi 0.32-beta.16 in a certain project (not yet public) and was quite happy with it. The build got faster and smaller, the project got more portable, and wasmi's balance of startup latency + wasm execution speed worked much better for that project than wasmtime's (even with winch, which was already an improvement over cranelift). Many tests in that project compile a medium-sized wasm module and run it for short but nontrivial amount of time, and switching to wasmi v0.32.0-beta.16 made those tests run faster.

Unfortunately and surprisingly, when I tried to update to v0.32.0-beta.18 and later to v0.32.0, I found that it got 5x to 6x slower in the configuration I care about the most: building my project in the dev/test profile but enabling optimizations for wasmi and wasmi_core (via profile overrides). I've managed to minimize it down to a 1 KiB wasm module and a fairly trivial embedding: wasmi-slow-repro.tar.gz. In that tarball:

  • The compiled wasm module is included for completeness but ought to be reproducible
  • The two host-* crates do the same thing with different wasmi versions: instantiate the guest and run its sole export
  • compare.sh builds everything and runs it through hyperfine

I would expect the performance to be the same for both wasmi versions, but in the dev profile (as exercised by the script) it differs:

Benchmark 1: host-beta16/target/debug/host
  Time (mean ± σ):      67.8 ms ±   2.9 ms    [User: 66.9 ms, System: 0.9 ms]
  Range (min … max):    65.3 ms …  76.0 ms    40 runs

Benchmark 2: host-newer/target/debug/host
  Time (mean ± σ):     382.9 ms ±  19.7 ms    [User: 381.9 ms, System: 1.0 ms]
  Range (min … max):   369.1 ms … 434.2 ms    10 runs

Summary
  host-beta16/target/debug/host ran
    5.65 ± 0.38 times faster than host-newer/target/debug/host

This is on x86_64-linux-unknown-gnu, Rust 1.77.1, Intel i7-6700K CPU. Again note that wasmi and wasmi_core are compiled with optimizations in the debug profile. Removing the opt-level = 2 lines from the respective Cargo.toml files makes both programs much slower (both take ca. 1.7s on my machine). Building them with --release instead makes them perform the same, but that's of little use to me if I can't figure out how to get the same performance without building my entire project in release mode. I've tried various tweaks to the profile overrides, without success. I've also tried profiling, but all I can see is that 99% of the time is spent in wasmi's interpreter loop.

@Robbepop Robbepop changed the title Curious performance regression since v0.32-beta.16 Curious performance regression since v0.32-beta.16 for optimized debug builds May 30, 2024
@Robbepop Robbepop changed the title Curious performance regression since v0.32-beta.16 for optimized debug builds Performance regression since v0.32-beta.16 for optimized debug builds May 30, 2024
@Robbepop
Copy link
Member

Robbepop commented May 30, 2024

Hi @hanna-kruppe ,

thank you for reporting the issue.

The only significant change since v0.32.0-beta.16 was #1041.
This change was important to improve execution performance of host function calls significantly amongst others.
The downside of this change though was that the executor's implementation is now generic over a T which was avoided. We avoided this before because the executor's inner loop is the hot-path when executing instructions via Wasmi and thus the main target for optimizations. It is very fragile to the correct optimization settings as you figured and it might be that with the recent change the fragility got bigger despite the overall performance wins.

It probably comes down to codegen-units=1 because with this setting the optimizer has awareness of the entirety of the inner loop and thus can apply optimizations it otherwise could not. (suspicion) You could try to see if thing improve significantly with configs like codegen-units=2 or 4 etc. The default with 16 is quite high.

Also you could try to disable debug assertions if you only care about fast compilation.

Unfortunately massive performance regressions are common with projects that highly depend on the Rust/LLVM optimizer doing its job. Explicit tail calls in Rust are probably the feature request hill I am going to die on ...

You could also try to solve your performance issues by only compiling the Wasmi dependency with the required optimization flags and everything else on default debug mode: https://doc.rust-lang.org/cargo/reference/profiles.html#overrides

In case you can carve out the Wasmi usage into its own sub-crate you could profit from incremental builds a lot so Wasmi only gets recompiled if its wrapper-crate changes but not if the main crate that you are working on a lot does.

@hanna-kruppe
Copy link
Author

Thanks for the swift response. Applying codegen-units and other common settings to wasmi and wasmi_core doesn't seem to make any difference. This may be the root cause:

The downside of this change though was that the executor's implementation is now generic over a T

If substantial parts of the executor are not codegen's in wasmi but in my code, then that would explain why the overrides that I've tried don't help. Increasing the optimization level on the host crate fixes the regression for the minimized example. I'll check how this shakes out in the actual code base.

@Robbepop
Copy link
Member

Robbepop commented May 30, 2024

Thanks for the swift response. Applying codegen-units and other common settings to wasmi and wasmi_core doesn't seem to make any difference. This may be the root cause:

The downside of this change though was that the executor's implementation is now generic over a T

If substantial parts of the executor are not codegen's in wasmi but in my code, then that would explain why the overrides that I've tried don't help. Increasing the optimization level on the host crate fixes the regression for the minimized example. I'll check how this shakes out in the actual code base.

Yes, this indeed is the issue that you are facing due to the mentioned change.

One way to fix this is if you are able to carve out the Wasmi usage into its own crate. However, this crate needs to provide a non-generic API to the Wasmi implementation that is to be consumed by your main crate. This way your main crate would profit from incremental builds and your problem is likely gone. The effect is that a much smaller and more isolated part of your codebase is build with those high optimization levels.

@Robbepop Robbepop changed the title Performance regression since v0.32-beta.16 for optimized debug builds Performance regression since v0.32-beta.16 for debug builds with profile overwrites May 30, 2024
@Robbepop
Copy link
Member

I'll check how this shakes out in the actual code base.

Please keep me updated about it. :)

@hanna-kruppe
Copy link
Author

hanna-kruppe commented May 30, 2024

Yeah, setting opt-level = 2 for dev builds of the real crate that uses wasmi largely fixes my performance problems. I'll still have to shuffle some code around to avoid repeated rebuilds of that crate, and fine-tune the trade-off between build time and edit-compile-test cycles, but neither will be a problem. So I consider this issue solved. Thanks again for the swift and helpful responses!

You might want to consider adding a note about this somewhere in the documentation (e.g., next to the "production builds" section in README). If I hadn't already tried beta.16 and noticed the performance difference when upgrading, I might never have figured out that the performance I get from the profile overrides I tried was very far from optimal.

@Robbepop
Copy link
Member

Robbepop commented May 30, 2024

Yeah, setting opt-level = 2 for dev builds of the real crate that uses wasmi largely fixes my performance problems. I'll still have to shuffle some code around to avoid repeated rebuilds of that crate, and fine-tune the trade-off between build time and edit-compile-test cycles, but neither will be a problem. So I consider this issue solved. Thanks again for the swift and helpful responses!

You might want to consider adding a note about this somewhere in the documentation (e.g., next to the "production builds" section in README). If I hadn't already tried beta.16 and noticed the performance difference when upgrading, I might never have figured out that the performance I get from the profile overrides I tried was very far from optimal.

Thank you for the update. Glad your problem is kinda fixed. Yes, a note is probably required here.

@Robbepop
Copy link
Member

@hanna-kruppe Thank you for the recommendation to mention this footgun somewhere. This is now documented in the new Wasmi usage guide: https://github.com/wasmi-labs/wasmi/blob/master/docs/usage.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants