-
Notifications
You must be signed in to change notification settings - Fork 694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zero length bulk memory behavior incompatible with LLVM intrinsics #1482
Comments
Speaking for the Zig project, which by the way uses webassembly to build from source, the specification status quo is problematic for our code generation backend as well for the exact same reasons. |
Aside from just the LLVM langref, this would also seem to require an implementation of memcpy/memmove to have similar checks. The C standard doesn't directly specify the case where the length is 0, but the obvious interpretation of "copy 0 characters" would be a no-op. I think this would be worthwhile, as it would just require engines/the spec to reverse steps 12 and 13 of the exec semantics for memory.copy and memory.fill, and would be backward compatible (i.e. allowing a program that we previously trapped/rejected). |
Do we happen to know the rationale for why this route was chosen? We had a discussion about this internally recently for unrelated reasons, and I couldn't remember why we did it this way. The usual suspect here is threading, but I don't think that was the case for this one. |
Previous discussion here WebAssembly/bulk-memory-operations#124 (nothing to do with threads, we just thought at the time that this would be a minor simplification for codegen). |
Looking back at those conversations, I believe that if we had realized at the time that the simplifications we were considering would have required us to emit a user space length check, we would not have made those simplifications. @rossberg, you were heavily involved in those discussions. Do you agree? What do you think of relaxing the semantics now? |
My recollection is that the final discussion and decision happened at a face-to-face meeting. If I recall correctly, I initially was in favour of always allowing the zero case. But then we decided to require upfront OOB checks (IIRC, to avoid partial writes, and for well-behaved interaction with shared memory), and under that semantics, making a special case for zero actually requires more instructions to codegen and to execute in most cases, not fewer. I believe we discussed LLVM at the meeting, but I don't recall what the arguments were. Can the LLVM intrinsic guarantee a behaviour that is somehow equivalent to OOB being checked upfront? The linked LLVM specification says nothing, so I doubt that's the case with the usual virtual memory techniques for trapping on OOB. Which means you need to generate an explicit OOB check anyway, and skipping it for length zero is not a simplification. Or am I missing something? PS: I would also argue that an impedance mismatch with LLVM alone is not sufficient reason to make a change. Since the current semantics ultimately requires less code, there would have to be more evidence of consistent problems across various consumers to go with something that's worse in principle. |
Yes, I guess having the semantics do a bounds check except in the case of zero length would mean the engine has to emit the same length check we are trying to avoid having to emit in user space, so the only benefit of making this change to the semantics would be tiny code size savings, assuming the engine can't make the length check any faster than it can be in user space. |
I believe this is a good point. This would mean that regardless of a change in the WebAssembly specification, LLVM would still (although currently missing) emit a check for zero length to comply with its specified semantics.
While I agree LLVM alone is not a sufficient reason to make a change, this affects any compiler/code-generation tool wanting to generate for the WebAssembly target (For languages where it's non-illegal to perform a memcpy OOB with a zero-length). For example, Zig also has its own WebAssembly backend besides its LLVM backend. It currently requires to implement this same check there, as LLVM has to. I believe Odin has the same semantics. So while it's true that changing the semantics wouldn't be a simplification with regards to LLVM, it would be for any other toolchain wishing to target WebAssembly that holds true to the non-trapping semantics for zero-length OOB operations. |
My understanding is that it would only be a simplification for backends that happen to exactly match the Wasm semantics, both in terms of the zero case and the upfront OOB check. For all others, making a special case for zero length would in fact not only not be a simplification, it would be an extra complication. |
I don't really understand why there is a need to match semantics to the upfront OOB check. To clarify, we're talking about lowering llvm.memcpy to memory.copy, not about lowering memory.copy to llvm.memcpy. The latter obviously can't work without additional checks. The former works apart from the zero case -- the non-zero OOB case would be UB, so lowering that to a trap is fine. (llvm.memcpy being the LLVM-specific intrinsic here, but as noted above, other compilers also specify this operation as accepting invalid pointers for zero-length copies.) |
This is a load-bearing assumption. There could be many reasons that a runtime does not actually need to do a length zero check, or even a trap check on the pointer. Here is one, for example: Having the check in userspace prevents the possibility of such optimizations by runtimes, because even if they were able to omit it, the userspace code would pay the cost. Semantically, a trap happens to prevent a read or write from an invalid memory address. A zero-length operation does not actually read or write from an invalid memory address. We all understand this intuitively. The only reason to not codify it into the spec would be for some indisputable, pragmatic reason, in which we are willing to compromise the null hypothesis in order to satisfy some real world use case. However, in this thread you have 3 examples, including LLVM, where pragmatism actually points the other direction. |
I think the original reasoning for the current semantics is clear from our previous discussions - the Wasm-level check requires fewer machine instructions. So if most toolchains were happily able to target It's worth noting that, as discussed in WebAssembly/bulk-memory-operations#145, the C standard says that zero-length |
Conversely, having the check in userspace allows it to be optimized out by frontends or Binaryen without engines having to do anything. |
Can you point to any real world runtimes where the check would be needed? I've already pointed to one that does not. It's also easy to imagine the opposite; the requirement to trap requires an extra check on the pointer validity where an early return on len==0 would be preferred by the runtime.
huh? no, it requires more. example: ziglang/zig#16345 |
When I say "fewer machine instructions", I'm talking about compiling Wasm
The point is that the current semantics falls out naturally from doing the single check To be clear, changing the current semantics would involve adding a check in the machine code generated from Wasm |
Here is a real world example: I agree, this specification change would require adding a check to this particular runtime. Let's tally up the real world examples provided so far: Producers that would save an extra check: 3 (LLVM, Zig, Odin) Runtimes that would need the extra check: 1 (wasmtime) My other point is that this optimization... --- a/crates/runtime/src/instance.rs
+++ b/crates/runtime/src/instance.rs
@@ -844,6 +844,7 @@ impl Instance {
val: u8,
len: u64,
) -> Result<(), Trap> {
+ if len == 0 return;
let memory = self.get_memory(memory_index);
let dst = self.validate_inbounds(memory.current_length(), dst, len)?; ...is not allowed by the spec since it is required to trap. |
@andrewrk, IIUC, zig's wasm2c implementation does not match the current spec because it can do a partial write before going OOB, and that partial write is observable after the trap. Spec compliance would require it to emit an OOB check before the memmove/memset/etc, and if we made the change under discussion here, that OOB check would also have to add a special case for length 0, just like wasmtime's implementation. |
I'm not really happy with the above argumentation - every reasonable machine-code-producing runtime would need an extra check with the proposed semantics (e.g. V8, Spidermonkey, Webkit, etc), and every producer using LLVM needs an extra check with the current semantics (e.g. Wasm backends for C/C++, Zig, Odin, etc). To reiterate, I think we can reasonably consider changing the semantics, if the trade-offs make sense. I don't think a raw tally of "machine-code-producing runtimes" vs "LLVM-based Wasm producers" is the right signal, however. EDIT: a much stronger signal for me would be "do programmatic zero-length checks (inserted by Wasm producers instead of being part of the runtime) give rise to meaningfully different Wasm binary sizes, runtime performance, or optimisation opportunities?"
This is potentially a pessimisation for small non-zero-length copies, although I'd expect that any difference between the two semantics discussed here would be "in the noise" for most end-to-end benchmarks. |
I can provide one such data point: Our 2.4 MiB wasm executable for bootstrapping the zig compiler gained 3.6 KB of file size from adding this check. I would consider this negligible. As for performance, here is a comparison of how long it takes to compile the output of wasm2c with gcc for the two versions of this wasm file:
The wasm file is constructed via LLVM with -Os optimizations and then further optimized by Edit: I still think this check belongs in runtimes, not producers. |
Given a choice between two semantics that are both safe but one implies an additional check for a Wasm engine to perform (on essentially all hardware), we've generally opted to save the check in the Wasm level and push it up to producers. The reasoning is that a check at the Wasm level costs all producers and a check at the producer level only affects some--in particular, a producer may have a more powerful static analysis to remove said check (in the case, 0-byte memcopies). I am against making a change to the semantics here because it's a breaking change (something no longer traps[1]) that would only benefit Wasm binary size (for non-optimizing producers). If we want a zero-length checking [1] Applications can and do use traps as a safety check mechanism--by changing behavior, this breaking change could cause applications to execute in unanticipated ways. |
I think current semantics are somewhat unfortunate, because we are turning something which could be a complete nop into a address bounds check without a corresponding memory access. Even if it can be eliminated it still carries some (small?) cost, without an obvious benefit. I guess for me the issue is not when/where to emit the check, but why the check is necessary at all. On the other hand, I think LLVM mismatch isn't that much of a big deal - LLVM would remove the intrinsic call if it can prove length parameter is zero bytes (and in some other cases possibly), which is valid in C, the invocations that can't be removed are just passed along to the runtime library as is. While those might trap in some cases in Wasm it doesn't make LLVM transformations wrong, at least in a sense of not turning an invalid program into a valid one. |
@andrewrk, you seem to be ignoring the main observation made several times in this thread. Let me repeat @tlively's question from above: does the zig implementation perform the upfront OOB check that is also required by the Wasm spec? @penzn, AFAICS, the assumption that the current semantics leads to more checks is just false. When actually taking the OOB check into consideration, it is in fact the opposite, which is why we originally went with the current semantics. The only case that could benefit from the change would be a mem.copy whose length is statically known to be zero, because that could be turned into a nop. All other cases (i.e., the useful ones) would pay with an extra conditional. |
IIUC, Zig -> LLVM -> Wasm should be performing this check already, and @tlively's observation only applies to the stripped down wasm2c that's (only?) used in Zig's bootstrapping process. EDIT:
In the case that the length is really statically known to be zero, in principle analogous optimisations could be carried out whether the zero-length check is producer-level or engine-level. I think the two ways of distinguishing the two approaches are:
|
I'm not sure which comment you're referring to, since I don't see a question directed at me, but I already answered your question here: #1482 (comment) It's important to remember that producers and consumers have different relationships with the specification. Zig's wasm2c is not intended to be a specification compliant, general-purpose tool. Its only purpose is to input a specification-compliant webassembly executable and convert it into C code that can then be used to bootstrap our compiler. You might be tempted to dismiss this use case because it's not a general purpose wasm runtime, but it's important to remember that one of the main benefits of a specification is that it allows for tools exactly like this - ones that rely on the other party adherring to specification and then do whatever happens to be useful. This is one reason I think if there needs to be an extra branch on one side of the fence, it should be on the runtime side. |
Now I'm really confused. Are you asking that the Wasm standard should be changed in order to enable an intentionally non-compliant implementation to do a certain optimisation, while imposing extra costs on all compliant implementations? Moreover, if an implementation is non-compliant anyway, why can't it relax this rule as it sees fit? |
@andrewrk, he was referring to my comment #1482 (comment), pointing out that Zig's wasm2c implementation, which you provided as an example runtime that "does not actually need to do a length zero check, or even a trap check on the pointer" does not appear to be spec-compliant, and making it spec compliant would require adding a trap check but not a zero check under the current semantics, or both a trap check and a zero check (to make the trap check not apply in that special case) under the semantics you are arguing for. Edit: But I guess you already figured out what he was referring to, so now I'm confused as well for the same reason as @rossberg |
I think y'all are confusing yourselves.
This framing of the problem is undisputed. The only thing being discussed in this thread is what pragmatism looks like. So far I'm the only one who has provided any real world examples and any real world performance measurements. I don't work on a web browser or a general-purpose wasm runtime, so that's not really what I bring to the table. I invite those representatives to do their part. |
One dynamic that isn't surfaced in the above framing - there's a bias towards the incumbent semantics because it's relatively expensive to coordinate a semantic change across Wasm runtimes (consider - users can choose the version of toolchain they use to produce Wasm, but by and large not the version of browser that the Wasm runs in). So it's not totally accurate to frame maintaining our current semantics as "compromis[ing] the null hypothesis" - if anything the burden of proof is the other way round (edit: hence my comments on "signals" above). |
I think we've made it very clear that spec-compliant runtimes would have to emit an extra zero check if we exempted zero-length copies from the normal up-front bounds checks, so this change is performance-neutral at best where toolchains would previously have had to emit the zero check themselves and a pessimization where toolchains already would have been able to avoid a zero check. @andrewrk, do you agree with this so far? You have not acknowledged these points yet. The only way to make exempting zero-length copies from bounds checks an optimization would be if we stopped requiring up-front bounds checks and allowed the bounds checking to be done via trap handlers, possibly after a partial write. But that's a much bigger change and one we explicitly decided against when bulk memory operations were standardized, so I would not want to open that back up for discussion without a great deal of new information. |
I think I've contributed as much as I can to this issue. I'll leave you all with a summary of my position:
|
Thanks, @andrewrk. FWIW, I think your position that runtimes should do more so toolchains can do less is entirely reasonable, but not the position WebAssembly has historically taken when presented with similar trade offs. I'll go ahead and close this issue, but if anyone does have new information to contribute, feel free to reopen it. |
It was raised in llvm/llvm-project#63755 that LLVM's memcpy and memset intrinsics are specified such that a zero-length access is a no-op, whereas
memory.copy
andmemory.fill
are currently specified such that they will trap ifoffset + length > memory size
, i.e. they can trap for zero-length accesses beyond the end of the memory. This means that to correctly lower LLVM's intrinsics under the current semantics, we would in general have to emit a length check and only executememory.copy
ormemory.set
if the length is not zero.It would be nice to be able to emit
memory.copy
andmemory.fill
without the surrounding length check, as we (incorrectly) do today. One way to achieve that would be to relax the WebAssembly semantics so that these instructions never trap when the length is zero, just like the LLVM intrinsics. Would there be any appetite for making that change?The text was updated successfully, but these errors were encountered: