Skip to content
This repository has been archived by the owner on Nov 3, 2021. It is now read-only.

Trapping on zero-length writes? #145

Closed
rianhunter opened this issue Mar 22, 2020 · 21 comments
Closed

Trapping on zero-length writes? #145

rianhunter opened this issue Mar 22, 2020 · 21 comments

Comments

@rianhunter
Copy link
Contributor

rianhunter commented Mar 22, 2020

According to the Overview.md, memory.fill and memory.copy trap if "the destination offset plus size is greater than the length of the target memory"

According to

(assert_trap (invoke "fill" (i32.const 0x10001) (i32.const 0) (i32.const 0))
this means that it will trap even if the size of the write is 0.

This forces an implementation to insert an explicit check against the destination offset to see if it will overflow, even ones that use interrupts to detect memory traps. This hurts performance.

Can this requirement be relaxed to "the destination offset plus size is greater than the length of the target memory and the number of bytes to write is non-zero"? Since this is a degenerate case, I don't think this will meaningfully change the spec and it allows for more efficient implementations.

@eqrion
Copy link
Contributor

eqrion commented Mar 22, 2020

Can this requirement be relaxed to "the destination offset plus size is greater than the length of the target memory and the number of bytes to write is non-zero"? Since this is a degenerate case, I don't think this will meaningfully change the spec and it allows for more efficient implementations.

This was previously discussed in #124, and was actually changed to the current behavior in order to simplify the spec.

Can you explain how this hurts performance? The spec mandates a bounds check before any writes are performed (#111), so performing a copy without a temporary and relying only on signal handling is not feasible.

And so if you're performing an explicit bounds check of 'offset + len < memory_sizeis actually simpler if you don't need to special caselen == 0`.

@rianhunter
Copy link
Contributor Author

Can you explain how this hurts performance? The spec mandates a bounds check before any writes are performed (#111), so performing a copy without a temporary and relying only on signal handling is not feasible.

That's not true. If you do a backwards-copy (i.e. starting from the end rather than the front) you can avoid a bounds check and not partially write any memory in the case of a partially invalid copy.

And so if you're performing an explicit bounds check of 'offset + len < memory_sizeis actually simpler if you don't need to special caselen == 0`.

Yes that is true, in implementations that always use explicit bounds check the check would be slightly more complex, but only slightly so. It's a check if len is non-zero which is relatively inexpensive.

@eqrion
Copy link
Contributor

eqrion commented Mar 22, 2020

That's not true. If you do a backwards-copy (i.e. starting from the end rather than the front) you can avoid a bounds check and not partially write any memory in the case of a partially invalid copy.

You cannot always do a backwards-copy, as memory.copy must handle overlapping ranges correctly. A backwards-copy of an overlapping range where the source is after the destination will not copy correctly.

Yes that is true, in implementations that always use explicit bounds check the check would be slightly more complex, but only slightly so. It's a check if len is non-zero which is relatively inexpensive.

Yes I agree, it's a minor thing. The larger issue is the feasibility of signal handling without a temporary copy.

@rianhunter
Copy link
Contributor Author

rianhunter commented Mar 22, 2020

That's not true. If you do a backwards-copy (i.e. starting from the end rather than the front) you can avoid a bounds check and not partially write any memory in the case of a partially invalid copy.

You cannot always do a backwards-copy, as memory.copy must handle overlapping ranges correctly. A backwards-copy of an overlapping range where the source is after the destination will not copy correctly.

You can unconditionally do a backwards copy for memory.init and memory.fill and conditionally in memory.copy in cases where dst >= src

@eqrion
Copy link
Contributor

eqrion commented Mar 22, 2020

You can unconditionally do a backwards copy for memory.init and memory.fill and conditionally in memory.copy in cases where dst >= src

Okay, could you maybe elaborate on what exact code you want to generate, and how the spec is limiting you? Focusing on memory.copy would be useful as it's the trickiest case.

There are a lot of possible implementations here depending on whether you treat constant lengths special, branch on dst/src, emit inline loads/store, emit inline loops, make a VM call, use signal handling.

@rianhunter
Copy link
Contributor Author

rianhunter commented Mar 22, 2020

Okay, could you maybe elaborate on what exact code you want to generate, and how the spec is limiting you? Focusing on memory.copy would be useful as it's the trickiest case.

Hmm I thought I was pretty clear already. In WASM implementations that detect memory overflow traps using CPU interrupts on bad memory accesses (via signal handling or otherwise), requiring a trap on zero-writes requires those implementations to add an additional bounds check that it wouldn't otherwise require. This applies not only for memory.copy but also memory.init and memory.fill.

Addressing your initial question: the requirement to avoid partial writes has no bearing on the feasibility of using an interrupt-based trapping mechanism since you can perform the write in reverse.

There are a lot of possible implementations here depending on whether you treat constant lengths special, branch on dst/src, emit inline loads/store, emit inline loops, make a VM call, use signal handling.

With the notable exception of eliding bounds checks in the instances where the runtime can detect ahead-of-time that the parameters don't cause an overflow (e.g. constant values, loop invariants), what I'm suggesting always applies regardless of those different configurations.

@eqrion
Copy link
Contributor

eqrion commented Mar 23, 2020

Okay, could you maybe elaborate on what exact code you want to generate, and how the spec is limiting you? Focusing on memory.copy would be useful as it's the trickiest case.

Hmm I thought I was pretty clear already. In WASM implementations that detect memory overflow traps using CPU interrupts on bad memory accesses (via signal handling or otherwise), requiring a trap on zero-writes requires those implementations to add an additional bounds check that it wouldn't otherwise require. This applies not only for memory.copy but also memory.init and memory.fill.

Yes I understand that, I'm just trying to understand the impact on performance that not special casing length=0 has to see if it's worth reverting the spec change.

SpiderMonkey has two different paths for memory.copy/fill currently. If the length is a small constant != 0, we load all the source bytes to registers/stack, then write to the destination in reverse order. Else, we emit a VM call which performs a bounds check and then uses the system memmove/memset.

It's possible that dropping the bounds check and using signal handling in the VM call (therefore allowing length=0 OOBs) could improve performance.

However, the VM call path has appeared to be heavily memory bound so-far. Our inline path handles the small copies where overhead really kills us, and knows the constant value of length in order to handle length=0 efficiently.

If you have performance data, that would really help here. I'm sure our implementation could be improved, but I haven't seen length=0 behavior blocking performance improvements.

@rianhunter
Copy link
Contributor Author

If you have performance data, that would really help here. I'm sure our implementation could be improved, but I haven't seen length=0 behavior blocking performance improvements.

I'm opening the issue with the understanding that the performance improvements are potential and abstract. I could produce a microbenchmark where the performance improvement would be tangible, e.g. calling memory.fill with a relatively short len in a tight loop, but it's less clear in general how significant of an impact dropping the conditional branch would be. Really depends on the use case.

The major point is that the current spec unavoidably prevents an optimization from which some use cases may benefit. We can at least agree to that. Whether those use cases are worth considering, that's not really my concern. I raised this issue with the understanding that the bulk operations have been specced for the purposes of maximizing performance and if that is true, this particular detail hinders that goal.

@rianhunter
Copy link
Contributor Author

rianhunter commented Mar 23, 2020

Okay I may be screaming into the void here but after reading some of the discussions related to this issue I think there are deeper semantic issues with the current spec beyond performance issues. I humbly offer my recommendations:

  • Allow partial writes
  • Never trap on zero-length writes
  • LLVM/clang should not materialize small constant-sized memcpy/memmove into memory.copy
  • Rename memory.copy to memory.move

Some rationale:

Allowing partial writes is the only spec that is consistent across single-threaded and multi-threaded contexts. This would also bring it semantically in line with memmove().

Never trapping on zero-length writes makes memory.copy semantically equivalent to memmove() which seems highly desirable considering memory.map may exist in the future.

LLVM should not use memory.copy since it has more information than memory.copy does and can do a better job of optimizing it. I think memory.copy is best suited for large and/or unknown-sized memory copies.

Since memory.copy allows for overlapping memory it should have a consistent naming convention with the C functions used for the same purpose.


I realize these recommendations would result in some churn on the spec so I give them lightly with modest expectations. Just offering these suggestions in case there was any doubt with the current spec.

@binji
Copy link
Member

binji commented Mar 23, 2020

Allowing partial writes is the only spec that is consistent across single-threaded and multi-threaded contexts. This would also bring it semantically in line with memmove().

There's a long discussion on #111 about this, perhaps it would be useful to state how you disagree with the conclusions there? In particular, it sounds like the primary design choice here was to reduce non-determinism, at a small up-front cost. (That said, I was away during most of this discussion, so I'm less familiar with how things progressed.)

LLVM/clang should not materialize small constant-sized memcpy/memmove into memory.copy

My understanding from @tlively is that this is already the case.

Rename memory.copy to memory.move

I see the benefit of trying to align with C's memmove. But the I find that "move" doesn't quite convey the right behavior here, since it implies that the source is destroyed in some way. Maybe less so for memory.move, but if we renamed table.copy to table.move I might assume that the source is nulled.

@tlively
Copy link
Member

tlively commented Mar 23, 2020

LLVM/clang should not materialize small constant-sized memcpy/memmove into memory.copy

My understanding from @tlively is that this is already the case.

Yes, small constant-sized memcpy/memmove are lowered to a sequence of loads and stores. When bulk memory is enabled, larger constant-sized and non-constant sized memcpy/memmove are lowered to memory.copy instructions on the assumption that the engine can optimize those better than it can optimize calls to a compiled memcopy function.

@rianhunter
Copy link
Contributor Author

rianhunter commented Mar 23, 2020

Allowing partial writes is the only spec that is consistent across single-threaded and multi-threaded contexts. This would also bring it semantically in line with memmove().

There's a long discussion on #111 about this, perhaps it would be useful to state how you disagree with the conclusions there? In particular, it sounds like the primary design choice here was to reduce non-determinism, at a small up-front cost. (That said, I was away during most of this discussion, so I'm less familiar with how things progressed.)

Yes from what I gather there were two main drivers of disallowing partial writes:

  • Being able to easily generate inline code for short-writes from @eqrion
  • Maximizing determinism from @rossberg

I respectively see the first issue as an implementation issue, so less important than making sure the semantics are right w.r.t. to existing semantics. Additionally I don't expect users of memory.copy having the expectation that small writes are the targeted optimized case, excluding high-level users of memmove()/memcpy() which expect their compiler to generate optimal code in those cases (i.e. not memory.copy).

I am sympathetic to the goal of maximizing determinism but I think in this case it increases complexity and decreases potential performance gains for little tangible benefit. Determinism is also not a promise that can be efficiently kept in the presence of a potential memory.shrink or memory.map coupled with threads. If those instructions are ever specified, then memory.copy as it exists now will likely fall out of favor for a newer instruction that allows non-determinism so it can be used efficiently in multi-threaded contexts. The alternative would be to require interrupt-based trapping implementations to use the same mutex around all of memory.copy, memory.fill, memory.init, memory.map, and memory.shrink which doesn't seem necessary at the user-level.

Edit: Another reason why I don't think maximizing determinism is strictly necessary in this case is that most users of WASM will never directly interact with the bulk memory instructions (unlike users of JS). Most users will use high level languages that defer to memory.copy, and those high level functions already have partial-write semantics, e.g. memcpy()/memmove(). Only system implementers will casually have exposure to the bulk memory instructions and they are usually held to a different standard in terms of being cognizant of low-level semantics.

Rename memory.copy to memory.move

I see the benefit of trying to align with C's memmove. But the I find that "move" doesn't quite convey the right behavior here, since it implies that the source is destroyed in some way. Maybe less so for memory.move, but if we renamed table.copy to table.move I might assume that the source is nulled.

That's a good point. I guess it depends on what is less surprising to most implementers (standard library, compiler, runtime). For what it's worth, I was surprised that memory.copy wasn't semantically equivalent to memcpy() but that's anecdotal and personal. I guess you convinced me otherwise. It would be nice if there were a memcpy() equivalent but probably not necessary. Linus convinced me memcpy() was unnecessary years ago https://bugzilla.redhat.com/show_bug.cgi?id=638477#c132

@conrad-watt
Copy link
Contributor

conrad-watt commented Mar 23, 2020

I am sympathetic to the goal of maximizing determinism but I think in this case it increases complexity and decreases potential performance gains for little tangible benefit. Determinism is also not a promise that can be efficiently kept in the presence of a potential memory.shrink or memory.map coupled with threads. If those instructions are ever specified, then memory.copy as it exists now will likely fall out of favor for a newer instruction that allows non-determinism so it can be used efficiently in multi-threaded contexts. The alternative would be to require interrupt-based trapping implementations to use the same mutex around all of memory.copy, memory.fill, memory.init, memory.map, and memory.shrink which doesn't seem necessary at the user-level.

Just to make sure this information doesn't get lost, my understanding was that at the end of the discussion on #111, we decided that in the presence of a hypothetical concurrent memory.shrink or memory.map, memory.copy would expose a non-deterministic partial write/copying order. This is partly because, as I think you're getting at, some implementations have a path for memory.copy that involves a bounds check followed by a call to the system memmove.

On a similar theme, when all the scenarios are played out, making partial writes always visible (instead of just an edge-case of a hypothetical racing memory.shrink or memory.map) seems to push us towards even single-threaded programs having non-deterministic results in the trap case, which I think is highly desirable to avoid.

The zero-length write trap issue is somewhat separate; neither choice seems to violate any major design principles. It is somewhat of a red herring to align the semantics here with memmove though, because it's only safe to delegate to memmove when you're certain that the copy is in-bounds, requiring an ahead-of-time bounds check anyway. As @eqrion mentioned, this check is simpler with the current semantics.

@tlively
Copy link
Member

tlively commented Mar 23, 2020

@rianhunter I think we've iterated on the memory.copy semantics enough at this point that we would want a very compelling reason to consider changing them again. A performance issue could potentially be bad enough to make changing the semantics an attractive option, but without benchmark data of some sort you're going to have a very hard time making a strong enough case to get most people on board. @eqrion gave some evidence that there is not much of a performance issue in #145 (comment), so I think a reasonable next step would be for you (or anyone else who is interested) to put together some sort of measurement of how much of a performance win your suggested semantics would be. That way we could continue this discussion with concrete data.

@rianhunter
Copy link
Contributor Author

rianhunter commented Mar 23, 2020

On a similar theme, when all the scenarios are played out, making partial writes always visible (instead of just an edge-case of a hypothetical racing memory.shrink or memory.map) seems to push us towards even single-threaded programs having non-deterministic results in the trap case, which I think is highly desirable to avoid.

That's not necessarily true. If so desired, you can emulate deterministic behavior bulk memory instructions in the single threaded case using the non-deterministic instructions like so:

(func $nonpartial.memory.fill
  (param $dst i32) (param $val i32) (param $cnt i32)
  (if (i32.gt (i32.add (local.get $dst) (local.get $cnt)) (i32.mul (i32.const 65536) (memory.size)))
    (unreachable)
    (memory.fill (local.get $dst) (local.get $val) (local.get $cnt)))
)

Note that you cannot implement the partial-copy version in terms of the non-partial-copy version. As I mentioned, I don't think there are many use-cases for the non-partial-copy version but it can be achieved if so desired.

The zero-length write trap issue is somewhat separate; neither choice seems to violate any major design principles. It is somewhat of a red herring to align the semantics here with memmove though, because it's only safe to delegate to memmove when you're certain that the copy is in-bounds, requiring an ahead-of-time bounds check anyway. As @eqrion mentioned, this check is simpler with the current semantics.

That's not necessarily true either. Consider this code:

void *malloc(size_t n) {
  if (!n) return (void *)-1;
  /* ... */
}

void *memcpy(void *dest, void *src, size_t n) {
  __wasm_builtin_memory_copy(dest, src, n);
  return dest;
}

struct foo *copy_foos(struct foo *input, size_t n_foos) {
  struct foo *result;
  size_t buf_size = n_foos * sizeof(struct foo); // assume doesn't overflow
  result = malloc(buf_size);
  if (!result && n_foos)
    return NULL;
  memcpy(result, input, buf_size);
  return result;
}

If you call this function with n == 0, the C standard allows malloc(0) to return any pointer value. In this case it returns (void *)-1 (which is not an uncommon construct). Ordinarily that would be fine since memcpy() doesn't touch the pointer if n == 0, but in this case it would trap. To fix this you'd have to implement memcpy() as:

void *memcpy(void *dest, void *src, size_t n) {
  if (!n) return dest;
  __wasm_builtin_memory_copy(dest, src, n);
  return dest;
}

Which is unfortunate though this is what I expect most codebases to do if they are porting their existing code to wasm.

I think we've iterated on the memory.copy semantics enough at this point that we would want a very compelling reason to consider changing them again. A performance issue could potentially be bad enough to make changing the semantics an attractive option, but without benchmark data of some sort you're going to have a very hard time making a strong enough case to get most people on board.

@tlively That's fair. For what it's worth I don't think this is simply a performance issue anymore but an issue of consistent and extensible semantics.

@conrad-watt
Copy link
Contributor

If you call this function with n == 0, the C standard allows malloc(0) to return any pointer value. In this case it returns (void *)-1 (which is not an uncommon construct). Ordinarily that would be fine since memcpy() doesn't touch the pointer if n == 0, but in this case it would trap.

I think if either src or dest are invalid pointers, a zero-length memmove is still undefined behaviour. A Wasm-compiled version trapping is permitted by the spec.

@rianhunter
Copy link
Contributor Author

If you call this function with n == 0, the C standard allows malloc(0) to return any pointer value. In this case it returns (void *)-1 (which is not an uncommon construct). Ordinarily that would be fine since memcpy() doesn't touch the pointer if n == 0, but in this case it would trap.

I think if either src or dest are invalid pointers, a zero-length memmove is still undefined behaviour. A Wasm-compiled version trapping is permitted by the spec.

I see that it says that on https://en.cppreference.com/w/c/string/byte/memmove but I do not find that in the C11 (or earlier) standard: http://port70.net/~nsz/c/c11/n1570.html#7.24.2.2, nor do I see that in the POSIX standard https://pubs.opengroup.org/onlinepubs/9699919799/functions/memmove.html.

The code I posted is idiomatic C and no implementation of memmove() of which I'm aware traps when n == 0 and the pointer wasn't dereferenceable. By trapping in situations where decades of previous implementations have not trapped seems like a violation of "Hyrum's Law":

With a sufficient number of users of an API,
it does not matter what you promise in the contract:
all observable behaviors of your system
will be depended on by somebody.

@conrad-watt
Copy link
Contributor

In the standard at least, it seems to be here http://port70.net/~nsz/c/c11/n1570.html#7.24.1p2 (7.24.1/2), as a blanket rule covering all functions under the string.h header.

I can't comment on implementation expectations, but it wouldn't be the first time real world code has relied on undefined behaviour.

@rianhunter
Copy link
Contributor Author

rianhunter commented Mar 23, 2020

In the standard at least, it seems to be here http://port70.net/~nsz/c/c11/n1570.html#7.24.1p2 (7.24.1/2), as a blanket rule covering all functions under the string.h header.

For the sake of being thorough, I don't think that applies here because the meaning of "valid pointer" is defined in 7.1.4:

"If an argument to a function has an invalid value (such as a value outside the domain of the function, or a pointer outside the address space of the program, or a null pointer, or a pointer to non-modifiable storage when the corresponding parameter is not const-qualified) or a type (after promotion) not expected by a function with variable number of arguments, the behavior is undefined. If a function argument is described as being an array, the pointer actually passed to the function shall have a value such that all address computations and accesses to objects (that would be valid if the pointer did point to the first element of such an array) are in fact valid."

malloc(0) always returns a valid pointer by this definition, since the pointer returned is an array and calling memmove() with n==0 doesn't request an invalid access of that array. If you didn't want to consider it an array, and instead an pointed-to object (which I think is the wrong interpretation), you could consider the pointer outside of the address space of the program in WASM's case but then you have an inconsistency where this doesn't trap:

(memory.fill (i32.mul (i32.const 65536) (memory.size)) (i32.const 0) (i32.const 0))

@conrad-watt
Copy link
Contributor

conrad-watt commented Mar 23, 2020

In the standard at least, it seems to be here http://port70.net/~nsz/c/c11/n1570.html#7.24.1p2 (7.24.1/2), as a blanket rule covering all functions under the string.h header.

For the sake of being thorough, I don't think that applies here because the meaning of "valid pointer" is defined in 7.1.4:

"If an argument to a function has an invalid value (such as a value outside the domain of the function, or a pointer outside the address space of the program, or a null pointer, or a pointer to non-modifiable storage when the corresponding parameter is not const-qualified) or a type (after promotion) not expected by a function with variable number of arguments, the behavior is undefined. If a function argument is described as being an array, the pointer actually passed to the function shall have a value such that all address computations and accesses to objects (that would be valid if the pointer did point to the first element of such an array) are in fact valid."

malloc(0) always returns a valid pointer by this definition, since the pointer returned is an array and n==0 doesn't request an invalid access of that array.

I agree there's a slight ambiguity here, but malloc(0) is specified as returning either null or a pointer which can never successfully be dereferenced (http://port70.net/~nsz/c/c11/n1570.html#7.22.3p1). I read the quoted language as specifying that the passed pointer must have "access-validity-parity" with an array of the function argument type, even if the library function doesn't actually carry out any accesses. As an extra piece of intuition, values of an array type must have at least one element (at least in this context, as per http://port70.net/~nsz/c/c11/n1570.html#6.2.5p20), hence the language "if the pointer did point to the first element of such an array".

you could consider the pointer outside of the address space of the program in WASM's case but then you have an inconsistency where this doesn't trap:

(memory.fill (i32.mul (i32.const 65536) (memory.size)) (i32.const 0) (i32.const 0))

So long as the C/C++-level code invokes undefined behaviour, it's hard to talk about "consistency" at the Wasm level in a concrete way. Does this code snippet correspond to reasonable C/C++?

EDIT: I think this is getting into the weeds a bit. I'm happy to take this to email if you'd find it interesting to have a longer-form discussion; it's tangential to my research area.

@rianhunter
Copy link
Contributor Author

As an extra piece of intuition, values of an array type must have at least one element (at least in this context, as per http://port70.net/~nsz/c/c11/n1570.html#6.2.5p20), hence the language "if the pointer did point to the first element of such an array".

That is true

So long as the C/C++-level code invokes undefined behaviour, it's hard to talk about "consistency" at the Wasm level in a concrete way. Does this code snippet correspond to reasonable C/C++?

Reasonable by some definition of reasonable! e.g. imagine an allocator that had exhausted all of its space and was requested an array of 0 object. Instead of calling memory.grow it could just return a pointer to the top to the heap under the assumption that it would never be dereferenced. That pointer would be invalid in the sense that it's outside the address space of the running WASM process but it wouldn't invoke trap behavior if memory.fill was called on with n == 0. For what it's worth it's little quirks like that which are activating my spidey sense here.

kripken pushed a commit to WebAssembly/binaryen that referenced this issue Aug 24, 2020
According to changes in spec:
WebAssembly/bulk-memory-operations#124
WebAssembly/bulk-memory-operations#145

we unfortunately can't fold to nop even for memory.copy(x, y, 0).

So this PR revert all reductions to nop but do this only under ignoreImplicitTraps flag
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants