-
Notifications
You must be signed in to change notification settings - Fork 691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
128-bit arithmetic #1522
Comments
From the get-go, I know of two non-cryptographic uses that might benefit from this proposal:
For both cases above, having a pair of |
A benchmark of (1) shows that baseline wasm today is 360% slower and with the instructions in this proposal it's only 10% slower than native. For (2) I wasn't able to understand at-a-glance how to get a 128-bit multiply, but given the results seen here if it uses 128-bit multiplies I suspect it'll see good improvements too though. |
Would need to adapt it to reduce a 64-bit value instead. Something along the lines of: uint64_t reduce64(uint64_t x, uint64_t N) {
return ((uint128_t) x * (uint128_t) N) >> 64;
} |
Ah I see makes sense! In that case I think it's a similar kernel as to (1) so sounds good 👍 |
It is a known deficiency of Binaryen that it does not handle multivalue well. I would like to get this fixed, but anything more than incremental improvements would require a fundamental change to Binaryen IR, so it's not a quick fix. Are there other frameworks you have in mind that similarly trip over multivalue results? Also, the proposed instructions also have multiple results; were you observing that they weren't as problematic for some reason? You mentioned the possibility of adding an i128 type, but I agree that this would be more heavyweight than we want. A similar alternative would be to use the existing v128 type to represent the 128-bit integers. Having the type of the instructions be |
The reason I didn't spend much time thinking about Our mid-end optimization framework in Cranelift currently does not optimize multi-result instructions at all; Alex observed that LLVM has a nice table-driven code generator that only works for single-result instructions and requires writing a bunch of C++ by hand if you want to define optimization rules for multi-result instructions; and I think Alex had mentioned other compilers with similar restrictions too. Our benchmarks showed that we could get nearly native performance with our proposed form of these instructions despite not implementing any optimizations, which is why we think they're a good choice even though they still have multiple results. |
(Jamey and I raced so consider this more background, but not adding anything new, over what he already mentioned)
I was additionally considering Cranelift and LLVM here. I can clarify and say that supporting multivalue isn't necessarily the hard part, most frameworks I've seen support it one way or another. The trickiness is specifically around optimizing instructions that produce multiple values. For LLVM I'm not deeply familiar with backends and I have no doubt that LLVM optimizes very well, but my impression is that multi-value instructions are somewhat non-standard in the sense that writing lowerings for multi-value instructions requires writing C++ instead of using
For Cranelift it has s-expression-based lowering backends such as this rule on x64 which avoids materializing flags into a register and then testing them for a conditional jump (contrasted with the fully general rule below which does just that). This instruction matching though only works for single-result instructions, so I was finding it difficult to figure out how to pattern-match this in rules to figure out how to optimize well. Cranelift, like LLVM, supports multi-value well enough but unlike LLVM doesn't have much support for optimizing the results.
The main issue was optimizing instructions with multiple results. Initial findings have shown that the Contrasted with the proposed 128-bit instructions optimizations around multiple results weren't necessary. The "naive" lowerings of 128-bit operations were sufficient to get drastically better performance over baseline wasm today. While optimizations are still possible and compilers may also wish to implement, it doesn't seem to be as critical for good performance. |
This would be a useful addition to the spec, and one we've had requests for but haven't prioritized in the past. Do you have a sense for what the performance would look like for a naive implementation of these operations on a 32-bit platform? |
Not at this time, the current runtime used to benchmark, Wasmtime, doesn't have support for any 32-bit platform so at this point it's just hoped that implementing 128-bit ops in the host wouldn't be slower than in-wasm. I think that's definitely something to track and follow-up on if this proposal goes through the phases. |
As Alex said that's more difficult for us to prototype since Wasmtime/Cranelift does not currently support any 32-bit platforms, but we can look at what native code a toolchain like Rust/LLVM produces for equivalent operations today. I expect even unoptimized implementations of this proposal to be able to emit the same instruction sequences. With Multiplication is more complicated, naturally: on x86-64 that toolchain is producing 3 multiplications and 2 additions, while on i686 it's 10 multiplications and 14 additions. I believe both cases are optimal for general long multiplication. What about optimizations for special cases? For example, multiplying two operands where the upper 64-bits of both are known to be zero, as would be handled by the more specialized I think the naive implementation that we benchmarked for This answer is hand-wavy but I hope it helps with building some intuition for how the performance of this proposal might turn out on 32-bit targets. |
Thanks that helps. The thing I don't have an intuition for is in the context of the benchmark shared, on platforms where the 64-bit values will have to be lowered to 32-bit registers, and for multiple values - how much of a factor is register pressure and how does it contribute to the performance? I do agree with you though that having the abstraction is better than not. |
Thanks for working on this, I am happy to see this addition as well! It seems this use case might actually be a better match for the more direct "overflow flag" approach, since in most cases you really do just immediately want the value of the flag to branch on rather than to feed it into another arithmetic operation. |
@dtig, I think it'll be difficult to compare register pressure between 32-bit and 64-bit targets. If the source program needs a 128-bit multiply, that's going to require a certain minimum number of registers on a given machine, no matter which instructions we offer and regardless of what optimizations a runtime provides. Register pressure is only relevant if we choose an alternative that makes it harder to reach that minimum number of registers, such as if we offer an instruction that is a poor fit for what programmers are trying to do. Currently, for 128-bit multiplies LLVM produces a function call into compiler-rt (so caller-saved registers have to be spilled to memory) that does several 32-bit multiplies and several 64-bit multiplies, and passes results back through linear memory on the shadow stack. Compared to that, any alternative that keeps all values in the minimum number of registers is going to perform significantly better, on every platform whether it's 32-bit or not. Ideally we'd compare the code from a reasonable wasm runtime against what the native toolchain produces for the same architecture. Alex and I aren't prepared to do that for 32-bit targets right now, but we'd love help with that evaluation and we'll dig into this more as the proposal develops. @dschuff, I've been fascinated by what code LLVM's wasm backend currently generates for getting an overflow flag for both i32 and i64 addition. It doesn't widen, and it doesn't branch either, assuming you can implement If we add the "overflow flag" alternative then we can reduce both patterns to a single instruction each, which might be good for wasm code density if these operations are used frequently. That said, I currently suspect that optimizing the native code generated for these patterns is already not too difficult without adding new instructions, since there's data-flow connecting all the relevant parts. So I'm currently uncertain whether there's a strong justification for extending the instruction set for those particular operations. But we haven't yet found a benchmark where this pattern is performance-critical and we'd appreciate any suggestions! |
Hm, good point about the comparing the result to the operands; I was thinking of what checked-math libraries typically do in C, where that's exactly what you can't do because of UB. I would think that basically any code compiled with UBSan would exercise this pretty heavily; or if you want to limit to just integer overflow, compile with
On wasm today we have something like
The 64-bit version has a similar sequence for add. It uses a compiler-rt call for multiply, and I think it could use the proposed 128-bit multiply in its current form similarly. On x86, those sequences reduce to an add or mul, followed by a I wonder if it would be worth looking through various requests we've had for these features in the past, and see if there were other concrete use cases mentioned. |
There has been a proposal to standardize cryptographic operations, but it didn't progress as browsers have native crypto which doesn't need wasm support, I really suspect it would be the same here.
How often are these particular algorithms used? NumPy is using SFMT if I am not mistaken. |
@dschuff for some comparisons (hah!) these are some godbolt links showing some various steps of overflowing arithmetic and comparisons (using Rust as I'm more comfortable with that than C but it's the same LLVM IR) This is also just comparing Wasmtime's native output to LLVM's, not other wasm compilers'. For the unsigned case the code is "pretty good" but uses Given this I suspect that the best way to evaluate the performance of this would be to find a benchmark in wasm bottlenecked on signed integer overflow. The two related issues I found in this repo were #1021 and #1472. I suspect it's correct that overflow flags and saturating flags and such are much easier with Thanks for the tip about @ppenzin while I agree that the motivation for browsers to assist with improving wasm-based crypotography is low due to the existence of JS APIs both @jameysharp and I work on Wasmtime which is an out-of-browser WebAssembly runtime which doesn't have the ability to use such APIs as a fallback. |
I did a small benchmark today. I took the wasm-tools repository which has a few benchmarks for parsing/validating wasm files. I compiled the benchmarks to native and WebAssembly with and without Overall on this particular benchmark, with Wasmtime, wasm sees about a 50-60% slowdown relative to native. With That's perhaps a gut-check that overflow checks and |
The Lwan web server uses (2) in a perfect hash table generator, and (1) in its PRNG, FWIW. |
I've looked into Lwan in the past, it is pretty cool. What is the relationship to wasm here though - is it expected to run in wasm? One more comment in general here - wasmtime is just one standalone engine, @alexcrichton @jameysharp do you know if the same perf gap would persist in other engines? For example engines that use LLVM (WAMR & Wasmer with that backend turned on), or even something like wasm2c+clang? |
A new repository for this now-phase-1 proposal has been created at https://github.com/WebAssembly/128-bit-arithmetic (thanks Thomas!) I'm in the process of moving information over to there. I'm going to try to capture open questions and routes-to-investigate in GitHub issues and the current state of the proposal in Interested parties from this issue are encouraged to browse the repository, open issues, etc. I'll come back to this issue once all the various content has been moved over and this can be closed. |
Ok the repository has been filled out with an overview in WebAssembly/wide-arithmetic#8, issues are opened at https://github.com/WebAssembly/128-bit-arithmetic, and there's some more discussion of overflow flags at WebAssembly/wide-arithmetic#10. @jameysharp if you're ok with it I think it'd be ok to close this issue now in favor of the proposal repository. |
Just for your information, being consistent with your experiments, existing compilers usually do badly in keeping carry flags in different iterations of a loop. And that's why GNU MP Bignum Library still maintains a lot of assembly code in their repo(amd64, arm64) for simple big number addition. Actually, the costs are avoidable because it's possible to use instructions that don't destroy the flags to continue the loop. It would be a great victory if someday we could add the missing optimizations for keeping the carry flags between loops, especially in a compiler backend for WebAssembly. The 128-bit arithmetic instructions are good alternatives to these low-level primitives ( |
I agree with @waterlens. Bignums are a feature of various languages these days, and libraries like GMP or LibTomMath are used in many other cases as well. It would be a bummer to invent overly specialised (non-CPU) primitives that happen to work for 128 bit but are useless for implementing larger number types. |
Thanks for supporting my opinion! But I want to clarify that i128 operations are certainly helpful in improving performance of big number operations. However, it's not that good from a semantics perspective. A high-performance compiler must identify that "Oh, this program used i128 operations, but because some of the inputs are constant 0 or 1, I now know it actually wants some other operations for 64-bit integers.", which sounds weird. |
FWIW I've opened WebAssembly/wide-arithmetic#6 for further discussion on flags-vs-not and the overview now has a section explaining in more detail the rationale for not choosing overflowing operations at this time. One perhaps interesting point is that if GMP, even today, maintains handwritten assembly despite high-quality compilers like LLVM/GCC that would imply to me that the space for representing the desired operations in low-level primitives has been relatively well explored and the conclusion was that handwritten assembly is still the best option. Given that while GMP could certainly continue to write handwritten instructions it would be extra important that wasm engines compile to the exact desired native instructions which, given my investigation so far, will likely be significantly difficult. |
GCC is not a high-quality compiler for big integer / cryptography and has consistently been lagging behind for years behind Clang on carry flag handling (see mratsim/constantine#357). Big integer code compiled with GCC can easily be 25% slower than Clang. This may change with C23 and the internal refactoring needed to efficiently support _BitInt for builtin C wide-integer. |
Summary
Provide limited instructions for 128-bit integer arithmetic.
Motivation
Two use cases of 128-bit arithmetic known are:
Some benchmarks are below but currently as-compiled to wasm today these use cases are 2-7x slower than native code today. In native code 128-bit operations use tight control over the native flags register to achieve much better performance than wasm is able to express today.
The major goal of this proposal is to close the peformance gap between native and wasm in these use cases where 128-bit arithmetic is in a hot loop.
Proposal
Add the following new instructions with a type signature of
[i64 i64 i64 i64] -> [i64 i64]
. Each instruction represents 128-bit integers as a pair ofi64
, but otherwise is identical to the corresponding narrower operation.i64.add128
, extended fromi64.add
i64.sub128
, extended fromi64.sub
i64.mul128
, extended fromi64.mul
Example
This function does a 64-bit integer add and returns both the 64-bit result and the unsigned carry, roughly equivalent to Rust's
u64::overflowing_add
:Rationale and Alternatives
We evaluated several alternatives and so far this one appears to be the simplest to specify and implement, while also giving the best real-world performance in our initial benchmarks.
We first considered adding an "overflow flag" output to a 64-bit arithmetic instructions, such as
i64.{add,sub,mul}_overflow_{u,s}
of type[i64 i64] -> [i64 i32]
, where thei32
result is a 0/1 boolean indicating whether the addition overflowed. This style of instruction is what native platforms often have for implementing these 128-bit operations so this seemed like a possibility of exposing directly to wasm.We also considered a "carry flag" on both input and output, such as
i64.{add,sub}_with_carry_{u,s}
of type[i64 i64 i32] -> [i64 i32]
. This is, for example, theadc
instruction on x64 and how 128-bit addition is implemented.We found that it was nontrivial-to-difficult to generate good code from these instructions, as shown in the benchmark results below.
In contrast, with 128-bit operations the need to fuse with other instructions is much lower and it seems that "simply having support" is a pretty good sweet-spot both performance and complexity-wise.
In addition, we considered a "wide multiply"
i64.mul_wide_{u,s}
with type[i64 i64] -> [i64 i64]
that takes twoi64
inputs and produces both the low and high halves of the 128-bit result. This is the minimal operation needed for arbitrary-precision multiplication. It is straightforward to build a full 128-bit multiply from such an instruction, and the naive implementation of this instruction performed better than thei64.mul128
instruction benchmarks below. This maps closely to the x64mul
instruction as well as equivalent instructions on AArch64 for example. While this "primitive" worked for multiplication the corresponding primitive for addition/subtraction did not work out (the*_overflow_*
instruction above), so we opted to addi64.mul128
instead of this instruction.Another possible alternative is to add a native
i128
type to WebAssembly. Given the relatively niche nature of 128-bit integers, however, coupled with the pretty good code compilers already emit for 128-bit operations in source languages, we see this alternative as far too heavyweight for the benefits that it could bring. By just adding a few specialized instructions for the slow use cases it's possible to get much closer to native performance than wasm is at today.Benchmark Results
This proposal has been prototyped in LLVM, Wasmtime/Cranelift, and Rust (all just living in forks for now) with the intention of providing guidance of the direction this proposal should go in. To that effect many of the above alternatives, as well as this proposal itself, were all implemented and evaluated.
Each benchmark below frobs various custom flags in LLVM to ensure that only a certain alternative of this proposal is implemented. Output was hand-verified to contain the desired instructions and profiling confirmed that Cranelift was generating the expected instruction sequences.
Benchmarks were collected on
Intel(R) Core(TM) i9-14900K
on a Linux machine. Other platforms will be tested later if this proposal advances through the CG stages.Benchmark: addition/subtraction
For this benchmark the Rust
num-bigint
crate was used. In the Rust source for this benchmark, each pair of limbs is first cast fromu64
tou128
, then added, and finally split into the low and highu64
halves. In my experience this is a common pattern for arbitrary-precision arithmetic implementations in both C and Rust.We compiled the benchmark to native x86-64 (to use as a baseline), to current WebAssembly, and to each of three variations on the proposed extension. We compared runtime for each variant of WebAssembly against the native code baseline and report how much slower than native each variant is (lower is better):
i64.add_overflow_u
- 88%i64.add_with_carry_u
- 155%i64.add128
(this proposal) - 18%Here current WebAssembly starts at 125% slower than native and the 128-bit ops in this proposal were able to close that gap to 18% slower than native. Other shapes of this proposal, as naively implemented in LLVM/Cranelift, would require more advanced optimizations to get closer to native performance specifically around management of the flags register.
Benchmark: multiplication
For this benchmark the
blind-sig
benchmark in the Sightglass repository was selected to test. This benchmarks is bottlenecked on the__multi3
helper in compiler-rt in LLVM which is a library implementation of 128-bit multiplication compiled to WebAssembly.Benchmarking is performed as above where numbers here are reported as slowdowns relative to native code where lower is better:
i64.mul_wide_u
- 100%i64.mul128
(this proposal) - 125%Here the 637% slower-than-native gap was closed to ~100%. The alternative instruction
i64.mul_wide_u
was the best performing but this is because Wasmtime is missing some easy-to-implement optimizations for full 128-bit multiplication (e.g. here the upper halves of the 128-bit numbers are always zero). We're confident with a minor amount of elbow-grease the proposedi64.mul128
instruction can be as good asi64.mul_wide_u
.Open Questions
Are there other use cases where instructions like these would be useful?
Should we introduce an
i128
type and require that all integer operations be implemented for it, or stick with the most useful subset and represent values as pairs ofi64
, as proposed above?Are there other motivating use cases for
*_overflow_*
or*_with_carry_*
instructions? Ones that perhaps won't require "advanced flag optimizations" to showcase their benefit?Meta
This proposal is a join effort of @alexcrichton (Fermyon) and @jameysharp (Fastly). We intend to start discussion with this issue and then bring a more complete proposal to a future CG meeting.
This issue is also related to historical discussions such as:
The text was updated successfully, but these errors were encountered: