-
Notifications
You must be signed in to change notification settings - Fork 27
Trapping on zero-length writes? #145
Comments
This was previously discussed in #124, and was actually changed to the current behavior in order to simplify the spec. Can you explain how this hurts performance? The spec mandates a bounds check before any writes are performed (#111), so performing a copy without a temporary and relying only on signal handling is not feasible. And so if you're performing an explicit bounds check of 'offset + len < memory_size |
That's not true. If you do a backwards-copy (i.e. starting from the end rather than the front) you can avoid a bounds check and not partially write any memory in the case of a partially invalid copy.
Yes that is true, in implementations that always use explicit bounds check the check would be slightly more complex, but only slightly so. It's a check if |
You cannot always do a backwards-copy, as memory.copy must handle overlapping ranges correctly. A backwards-copy of an overlapping range where the source is after the destination will not copy correctly.
Yes I agree, it's a minor thing. The larger issue is the feasibility of signal handling without a temporary copy. |
You can unconditionally do a backwards copy for |
Okay, could you maybe elaborate on what exact code you want to generate, and how the spec is limiting you? Focusing on memory.copy would be useful as it's the trickiest case. There are a lot of possible implementations here depending on whether you treat constant lengths special, branch on dst/src, emit inline loads/store, emit inline loops, make a VM call, use signal handling. |
Hmm I thought I was pretty clear already. In WASM implementations that detect memory overflow traps using CPU interrupts on bad memory accesses (via signal handling or otherwise), requiring a trap on zero-writes requires those implementations to add an additional bounds check that it wouldn't otherwise require. This applies not only for Addressing your initial question: the requirement to avoid partial writes has no bearing on the feasibility of using an interrupt-based trapping mechanism since you can perform the write in reverse.
With the notable exception of eliding bounds checks in the instances where the runtime can detect ahead-of-time that the parameters don't cause an overflow (e.g. constant values, loop invariants), what I'm suggesting always applies regardless of those different configurations. |
Yes I understand that, I'm just trying to understand the impact on performance that not special casing SpiderMonkey has two different paths for memory.copy/fill currently. If the length is a small constant It's possible that dropping the bounds check and using signal handling in the VM call (therefore allowing However, the VM call path has appeared to be heavily memory bound so-far. Our inline path handles the small copies where overhead really kills us, and knows the constant value of length in order to handle If you have performance data, that would really help here. I'm sure our implementation could be improved, but I haven't seen |
I'm opening the issue with the understanding that the performance improvements are potential and abstract. I could produce a microbenchmark where the performance improvement would be tangible, e.g. calling The major point is that the current spec unavoidably prevents an optimization from which some use cases may benefit. We can at least agree to that. Whether those use cases are worth considering, that's not really my concern. I raised this issue with the understanding that the bulk operations have been specced for the purposes of maximizing performance and if that is true, this particular detail hinders that goal. |
Okay I may be screaming into the void here but after reading some of the discussions related to this issue I think there are deeper semantic issues with the current spec beyond performance issues. I humbly offer my recommendations:
Some rationale: Allowing partial writes is the only spec that is consistent across single-threaded and multi-threaded contexts. This would also bring it semantically in line with Never trapping on zero-length writes makes LLVM should not use Since I realize these recommendations would result in some churn on the spec so I give them lightly with modest expectations. Just offering these suggestions in case there was any doubt with the current spec. |
There's a long discussion on #111 about this, perhaps it would be useful to state how you disagree with the conclusions there? In particular, it sounds like the primary design choice here was to reduce non-determinism, at a small up-front cost. (That said, I was away during most of this discussion, so I'm less familiar with how things progressed.)
My understanding from @tlively is that this is already the case.
I see the benefit of trying to align with C's |
Yes, small constant-sized memcpy/memmove are lowered to a sequence of loads and stores. When bulk memory is enabled, larger constant-sized and non-constant sized memcpy/memmove are lowered to memory.copy instructions on the assumption that the engine can optimize those better than it can optimize calls to a compiled memcopy function. |
Yes from what I gather there were two main drivers of disallowing partial writes:
I respectively see the first issue as an implementation issue, so less important than making sure the semantics are right w.r.t. to existing semantics. Additionally I don't expect users of I am sympathetic to the goal of maximizing determinism but I think in this case it increases complexity and decreases potential performance gains for little tangible benefit. Determinism is also not a promise that can be efficiently kept in the presence of a potential Edit: Another reason why I don't think maximizing determinism is strictly necessary in this case is that most users of WASM will never directly interact with the bulk memory instructions (unlike users of JS). Most users will use high level languages that defer to memory.copy, and those high level functions already have partial-write semantics, e.g.
That's a good point. I guess it depends on what is less surprising to most implementers (standard library, compiler, runtime). For what it's worth, I was surprised that |
Just to make sure this information doesn't get lost, my understanding was that at the end of the discussion on #111, we decided that in the presence of a hypothetical concurrent On a similar theme, when all the scenarios are played out, making partial writes always visible (instead of just an edge-case of a hypothetical racing The zero-length write trap issue is somewhat separate; neither choice seems to violate any major design principles. It is somewhat of a red herring to align the semantics here with |
@rianhunter I think we've iterated on the memory.copy semantics enough at this point that we would want a very compelling reason to consider changing them again. A performance issue could potentially be bad enough to make changing the semantics an attractive option, but without benchmark data of some sort you're going to have a very hard time making a strong enough case to get most people on board. @eqrion gave some evidence that there is not much of a performance issue in #145 (comment), so I think a reasonable next step would be for you (or anyone else who is interested) to put together some sort of measurement of how much of a performance win your suggested semantics would be. That way we could continue this discussion with concrete data. |
That's not necessarily true. If so desired, you can emulate deterministic behavior bulk memory instructions in the single threaded case using the non-deterministic instructions like so:
Note that you cannot implement the partial-copy version in terms of the non-partial-copy version. As I mentioned, I don't think there are many use-cases for the non-partial-copy version but it can be achieved if so desired.
That's not necessarily true either. Consider this code:
If you call this function with
Which is unfortunate though this is what I expect most codebases to do if they are porting their existing code to wasm.
@tlively That's fair. For what it's worth I don't think this is simply a performance issue anymore but an issue of consistent and extensible semantics. |
I think if either src or dest are invalid pointers, a zero-length memmove is still undefined behaviour. A Wasm-compiled version trapping is permitted by the spec. |
I see that it says that on https://en.cppreference.com/w/c/string/byte/memmove but I do not find that in the C11 (or earlier) standard: http://port70.net/~nsz/c/c11/n1570.html#7.24.2.2, nor do I see that in the POSIX standard https://pubs.opengroup.org/onlinepubs/9699919799/functions/memmove.html. The code I posted is idiomatic C and no implementation of With a sufficient number of users of an API, |
In the standard at least, it seems to be here http://port70.net/~nsz/c/c11/n1570.html#7.24.1p2 (7.24.1/2), as a blanket rule covering all functions under the string.h header. I can't comment on implementation expectations, but it wouldn't be the first time real world code has relied on undefined behaviour. |
For the sake of being thorough, I don't think that applies here because the meaning of "valid pointer" is defined in 7.1.4: "If an argument to a function has an invalid value (such as a value outside the domain of the function, or a pointer outside the address space of the program, or a null pointer, or a pointer to non-modifiable storage when the corresponding parameter is not const-qualified) or a type (after promotion) not expected by a function with variable number of arguments, the behavior is undefined. If a function argument is described as being an array, the pointer actually passed to the function shall have a value such that all address computations and accesses to objects (that would be valid if the pointer did point to the first element of such an array) are in fact valid."
|
I agree there's a slight ambiguity here, but malloc(0) is specified as returning either null or a pointer which can never successfully be dereferenced (http://port70.net/~nsz/c/c11/n1570.html#7.22.3p1). I read the quoted language as specifying that the passed pointer must have "access-validity-parity" with an array of the function argument type, even if the library function doesn't actually carry out any accesses. As an extra piece of intuition, values of an array type must have at least one element (at least in this context, as per http://port70.net/~nsz/c/c11/n1570.html#6.2.5p20), hence the language "if the pointer did point to the first element of such an array".
So long as the C/C++-level code invokes undefined behaviour, it's hard to talk about "consistency" at the Wasm level in a concrete way. Does this code snippet correspond to reasonable C/C++? EDIT: I think this is getting into the weeds a bit. I'm happy to take this to email if you'd find it interesting to have a longer-form discussion; it's tangential to my research area. |
That is true
Reasonable by some definition of reasonable! e.g. imagine an allocator that had exhausted all of its space and was requested an array of 0 object. Instead of calling |
According to changes in spec: WebAssembly/bulk-memory-operations#124 WebAssembly/bulk-memory-operations#145 we unfortunately can't fold to nop even for memory.copy(x, y, 0). So this PR revert all reductions to nop but do this only under ignoreImplicitTraps flag
According to the Overview.md, memory.fill and memory.copy trap if "the destination offset plus size is greater than the length of the target memory"
According to
bulk-memory-operations/test/core/bulk.wast
Line 52 in ffdbb6e
This forces an implementation to insert an explicit check against the destination offset to see if it will overflow, even ones that use interrupts to detect memory traps. This hurts performance.
Can this requirement be relaxed to "the destination offset plus size is greater than the length of the target memory and the number of bytes to write is non-zero"? Since this is a degenerate case, I don't think this will meaningfully change the spec and it allows for more efficient implementations.
The text was updated successfully, but these errors were encountered: