Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomic and thread operators #1019

Closed
wants to merge 23 commits into from
Closed

Atomic and thread operators #1019

wants to merge 23 commits into from

Conversation

binji
Copy link
Member

@binji binji commented Mar 24, 2017

This is an initial pass at adding some atomic and thread operators, based on some work that @flagxor did earlier.

There's still a lot that needs to be fleshed out, but I wanted to get something going.

Copy link
Member

@jfbastien jfbastien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a way to identify post-MVP features. ⚛️ is an easy on here! We then need a page that identifies all these symbols (MVP, each post-MVP, and 🦄 as "future").

This is also missing the specification of how non-atomic accesses behave. We could just wait for the JS work to finish thought?

How do we intend to move such proposals forward?

Semantics.md Outdated
@@ -42,7 +42,7 @@ environment such as a browser, a trap results in throwing a JavaScript exception
If developer tools are active, attaching a debugger before the
termination would be sensible.

## Stack Overflow
## Stack overflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalization was inconsistent in the file, but seemed to be "Upper lower lower." Probably should be a different change though.

Semantics.md Outdated
@@ -81,7 +81,7 @@ Note that the value types `i32` and `i64` are not inherently signed or
unsigned. The interpretation of these types is determined by individual
operators.

## Linear Memory
## Linear memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Semantics.md Outdated
memory if the module's memory import doesn't specify that it allows shared
memory.

### Linear memory accesses
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalization.

Semantics.md Outdated
| `i64.atomic_xchg32_u` | 4 bytes, zero-extended i32 to i64 | nop | 4 bytes, wrapped from i64 to i32 |
| `i64.atomic_xchg` | 8 bytes | nop | 8 bytes |

The atomic compare exchange RMW operators take three operands: a linear memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda weird to classify cmpxchg as RMW. It's a 3rd category IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Semantics.md Outdated
@@ -211,7 +356,7 @@ Thus, it is recommend that WebAssembly producers align frequently-used data to
permit the use of natural alignment access, and use loads and stores with the
greatest alignment values practical, while always avoiding misaligned accesses.

### Out of Bounds
### Out of bounds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Semantics.md Outdated
@@ -461,7 +606,7 @@ supported (including NaN values of all possible bit patterns).
* `f32.const`: produce the value of an f32 immediate
* `f64.const`: produce the value of an f64 immediate

## 32-bit Integer operators
## 32-bit integer operators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Semantics.md Outdated
## Thread operators

* `is_lock_free`: return 1 if an atomic memory access of a given size can be
completed without the execution engine acquiring a lock. If not, return `0`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you import the wording from https://tc39.github.io/ecma262/ ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Semantics.md Outdated
@@ -678,6 +823,48 @@ outside the range which rounds to an integer in range) traps.
This trap is intended to be impossible for user code to catch or handle, even in the future when it may be possible to
handle some other kinds of traps or exceptions.

## Thread operators

* `is_lock_free`: return 1 if an atomic memory access of a given size can be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-quote 1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Semantics.md Outdated
memory.

The wait operators take three operands: a linear memory index, an
expected value, and a relative timeout in milliseconds. The wait operation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be negative?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Described below, but I moved it up.

@binji
Copy link
Member Author

binji commented Mar 25, 2017

We need a way to identify post-MVP features. ⚛️ is an easy on here! We then need a page that identifies all these symbols (MVP, each post-MVP, and 🦄 as "future").

Good point, perhaps we should open a separate issue to discuss?

This is also missing the specification of how non-atomic accesses behave. We could just wait for the JS work to finish thought?

Yeah, I was considering just referencing the ecma262 memory model, but it's not immediately obvious to a reader how it would map.

How do we intend to move such proposals forward?

I seem to remember there being a plan about this. Little help, anyone? :-)

@lukewagner
Copy link
Member

I think we're still figuring out what, post-MVP, is the process for proposing extensions. Also, with the concurrent emergence of a proper spec in the spec repo, the shape of a PR/proposal might be quite different.

@jfbastien
Copy link
Member

I'm not sure we want to block everything on the spec repo at this point in time.

@lukewagner
Copy link
Member

I guess it depends on how long each of these things (process for evolving post-MVP, spec repo, threads) take.

@flagxor
Copy link
Member

flagxor commented Mar 27, 2017

Agree we don't want to block if possible.
@lukewagner would pulling out the sections this adds into a separate Threads.md doc be easier in terms of weathering the shifting repos?


## Atomic compare exchange operators ([described here](Semantics.md#atomic-memory-accesses))

| `i32.atomic_cmpxchg` | `0xe61c` | `memory immediate` | atomic compare exchange with memory |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatting on this table is broken, possible because of the lack of a heading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thanks!

| `i32.atomic_xor8_s` | `0xe612` | `4` `memory immediate` | atomic xor with memory |
| `i32.atomic_xor8_u` | `0xe613` | `4` `memory immediate` | atomic xor with memory |
| `i32.atomic_xor16_s` | `0xe614` | `4` `memory immediate` | atomic xor with memory |
| `i32.atomic_xor16_u` | `0xe615` | `4` `memory immediate` | atomic xor with memory |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Names like i32.atomic_xor16_s are a little awkward with _s/_u apparently associated with otherwise sign-agnostic operators. What would you think of any of the following conventions?

  • i32.atomic16_s.xor
  • i32.atomic_rmw16_s.xor
  • i32.atomic.rmw16_s.xor (in this case, following atomic with . could be a broader convention)

These would make it clear that the xor isn't signed itself; it's just the loaded value from the rmw that's being sign-extended. This isn't super important with the current opcodes, but could be in the future if wasm ever wants atomic rotate, popcnt, or other operators which care about the number of bits in the actual arithmetic part.

Such a clarification would also align with an idea anticipated here concerning the possibility of adding a real add_s and similar operators (which could trap or do other things on overflow) in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like i32.atomic.rmw16_s.xor, it nicely matches the proposed opcode layout too. Though I wonder about xchg and cmpxchg:
i32.atomic.rmw16_s.xchg? And as @jfbastien mentioned, cmpxchg isn't really RMW, so i32.atomic.cmpxchg16_s? And I suppose for the non-extending forms:
i32.atomic.rmw.xor?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i32.atomic.cmpxchg16_s and i32.atomic.rmw.xor seem to fit the pattern nicely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done

@lukewagner
Copy link
Member

@flagxor I'm not sure what would be better without knowing what the intended process is for proposing and making changes. Since threads are a high priority, we could just cheat and carry on like we did pre-MVP.

| Name | Opcode | Immediate | Description |
| ---- | ---- | ---- | ---- |
| `i32.atomic_load` | `0xe628` | `memory immediate` | atomic load from memory |
| `i64.atomic_load` | `0xe629` | `memory immediate` | atomic load from memory |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all these load/store ops, could we reuse the existing load/store ops, adding a "consistency" annotation to the memory_immediate.flags?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I believe that's what was discussed before. But I had a thought it might be nice to have all atomic/threading operators use the same prefix, which led me down this route. One additional nicety is that the opcodes for atomic load/store operators match the non-atomic ones, e.g. i32.atomic_load is 0xe628 and i32.load is 0x28, etc.

Since the opcode specifies that it is atomic, this also means that we don't need a special "non-atomic" value for the memory immediate for RMW ops too.

But I'm not married to the idea, just thought it was kinda neat. :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, those are fair points. I guess we are already adding a bunch of operators, so it's not a huge difference, and there would be a natural extension for using the atomic prefix on global ops (when we eventually allow shared globals).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm partial to calling the "atomic" prefix the "lock" prefix, and numbering it 0xF0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jfbastien can't tell if serious...

Semantics.md Outdated

| Name | Read | Modify | Write |
| ---- | ---- | ---- | ---- |
| `i32.atomic_cmpxchg8_s` | 1 byte, sign-extended i8 to i32 | nop | (conditional) 1 byte, wrapped from i32 to i8 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One detail that's not clear from this presentation is whether the comparison is the sign-extended value with the full 32 bits of the expected value, or the expected value wrapped to 8 bits with the 8-bit loaded value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to clarify, ptal

@AndrewScheidecker
Copy link

What about atomic operations on globals?

@lukewagner
Copy link
Member

@AndrewScheidecker I think first we need to define a new WebAssembly.Global object for mutable globals that can be imported, exported and marked shared (symmetric to Memory); but after that it makes sense to have atomic global operations.

Semantics.md Outdated

| Name | Read | Modify | Write |
| ---- | ---- | ---- | ---- |
| `i32.atomic.rmw8_s.add` | 1 byte, sign-extended i8 to i32 | sign-agnostic addition | 1 byte, wrapped from i32 to i8 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cmpxchg clarification looks good. So good, that I'm now wondering about a Return column for the rmw table too :-). This is admittedly a really minor nit at this point because none of the current rmw operators care whether their inputs are extended or wrapped, but if we add other rmw operators in the future, they could potentially care.

Semantics.md Outdated
## Thread operators

* `is_lock_free`: If the atomic step of an atomic primitive (see
[Atomic Memory Accesses](#atomic-memory-accesses)) on a datum of size N
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is N coming from, is that an i32 operand?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Semantics.md Outdated
* `f32.atomic.store`: (no conversion) atomically store 4 bytes
* `f64.atomic.store`: (no conversion) atomically store 8 bytes

Atomic read-modify-write (RMW) operators atomically read a value from a memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specify in which order the operands are expected on the stack for these instructions?

Copy link
Member Author

@binji binji Mar 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find another place in the doc to crib from, so I just specified it by describing the operand in order. Should I be more explicit?

Semantics.md Outdated
| `i64.atomic.rmw32_u.xchg` | 4 bytes | nop | 4 bytes | zero-extended i32 to i64 |
| `i64.atomic.rmw.xchg` | 8 bytes | nop | 8 bytes | as i64 |

The atomic compare exchange operators take three operands: a linear memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the linear memory is an immediate, not an operand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I meant address. :-)

Semantics.md Outdated
change. It is a validation error to use these operators on non-shared linear
memory.

The wait operators take three operands: a linear memory index, an expected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: as above, I suppose the memory is an immediate.


| Name | Opcode | Immediate | Description |
| ---- | ---- | ---- | ---- |
| `is_lock_free` | `0xe600` | | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we allocate prefix opcodes at the end of the opcode space instead of the middle? So 0xff for threads being the first such proposal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Semantics.md Outdated

### Wake

The wake operators take two operands: a linear memory index, and a wake count.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: same here.

| `i32.wait` | `0xe601` | `offset` | |
| `i64.wait` | `0xe602` | `offset` | |
| `i32.wake` | `0xe603` | `offset` | |
| `i64.wake` | `0xe604` | `offest` | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't these all have memory immediates?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

| `i64.atomic.rmw16_u.xor` | `0xe619` | `4` `memory immediate` | atomic xor with memory |
| `i64.atomic.rmw32_s.xor` | `0xe61a` | `4` `memory immediate` | atomic xor with memory |
| `i64.atomic.rmw32_u.xor` | `0xe61b` | `4` `memory immediate` | atomic xor with memory |
| `i32.atomic.rmw.xchg` | `0xe610` | `5` `memory immediate` | atomic exchange with memory |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is xchg actually a RMW operation? Seems like it could be its own group, with names like i32.atomic.xchg16_s, in line with cmpxchg.

Semantics.md Outdated
| `i64.atomic.rmw32_s.xor` | 4 bytes | 32-bit sign-agnostic bitwise exclusive or | 4 bytes | sign-extended i32 to i64 |
| `i64.atomic.rmw32_u.xor` | 4 bytes | 32-bit sign-agnostic bitwise exclusive or | 4 bytes | zero-extended i32 to i64 |
| `i64.atomic.rmw.xor` | 8 bytes | 64-bit sign-agnostic bitwise exclusive or | 8 bytes | as i64 |
| `i32.atomic.rmw8_s.xchg` | 1 byte | nop | 1 byte | sign-extended i8 to i32 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is xchg actually a RMW operation? It seems like it would fit better with the cmpxchg group -- it's an unconditional version of it. With analogous names like i32.atomic.xchg16_s. Or as its own group.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xchg is usually considered an RMW operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with it either way, but it probably shouldn't be grouped with cmpxchg since those have 3 operands.

Copy link
Member

@rossberg rossberg Mar 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm primarily wondering by what criterion xchg is RMW but cmpxchg is not. Some consistency would be nice.

Bikeshedding more generally, why the complicated names in the first place? Wouldn't i32.atomic.add8_s etc work just fine? Sure it's RMW, but that's kind of implied.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmpxchg (compare-and-swap) is fundamentally more powerful than other RMW operations. Between that, and having a distinct type signature, it commonly gets its own category, even if it is also an RMW operation.

cmpxchg isn't just a conditional version of xchg. For example, an abstract formal semantics wouldn't be able to reduce cmpxchg to xchg when the compare succeeds; that would imply the comparison is a separate step from the swap, breaking atomicity.

add8_s under wasm's existing naming scheme would seem to imply that the add, or, and, xor, etc., are signed/unsigned, which is inconsistent with the existing operators.

Copy link
Member

@rossberg rossberg Mar 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunfishcode, interesting, thanks for the pointer!

Regarding the naming, I see your argument. My only remaining quibble is that, even being in different expressive categories, it would still be nice to have a somewhat more symmetric naming scheme between cmpxchg and the other RMW operators.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i32.atomic.rmw8_s.cmpxchg would be the maximally symmetric convention, it seems. I don't think it's critical, but I wouldn't be opposed to it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Semantics.md Outdated
The atomic compare exchange operators take three operands: a linear memory
index, an `expected` value, and a `replacement` value. The operators
conditionally store the `replacement` value, but only if the `loaded` value is
equal to the `expected` value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of this sentence seems a little strange to me. "X conditionally does Y, but only when Z".. I think what you are trying to say is that its always conditional, and then describe said condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jfbastien
Copy link
Member

I think we've got a great opportunity to call the "atomic" prefix the "lock" prefix, and number it 0xF0.

| `i64.wake` | `0xc3` | `offest` | SAB like wake, takes an index + `offset` |
| `i32.thread_create` | `0xc4` | `table` | Creates a thread, stores thread in `table` at an i32 index. Returns 0 on success or non-zero otherwise. |
| `i32.thread_join` | `0xc5` | `table` | Waits to join to thread stored at an i32 index in `table`. Returns 0 on success or non-zero otherwise. |
| `thread_exit` | `0xc6` | | Halts current thread. |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three thread ops were dropped in 4b94ac6#diff-7c076c52f23f4a770974fb209259f34bL962 seemingly without comment (that I can find). Is there somewhere else they went?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to document this! Is the intent to start off with workers, mimic JS' API, and then add better capabilities later?

Threads.md Outdated
threads of execution. The shared memory can be imported or defined in the
module. It is a validation error to attempt to import shared linear
memory if the module's memory import doesn't specify that it allows shared
memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to flesh this out a bit more, including these sorts of details.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some of the details, PTAL (still no JS.md stuff yet)

@jfbastien
Copy link
Member

We need to specify:

  • What happens with grow_memory and WebAssembly.Memory.prototype.grow.
  • What happens to instance.export.memory.buffer that are in JS.
  • Whether current_memory is sequentially-consistent.

@binji
Copy link
Member Author

binji commented Apr 5, 2017

OK, I've added a little bit here about agents/agent clusters, mostly copied from https://tc39.github.io/ecma262/#sec-agents. Not sure how much detail we want to go into here. Gonna write up some JS.md stuff next.

...

If linear memory is marked as shared, the
[maximum memory size](Modules.md#linear-memory-section) must be specified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same restriction for JS.md WebAssembly.Memory.

`grow_memory` of a shared linear memory is allowed, but may fail if the new
size is greater than the specified maximum size, or if reserving the additional
memory fails. All agents in the executing agent's cluster will have access to
the additional linear memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to discuss neutering?

@hemobo
Copy link

hemobo commented Apr 6, 2017

@jfbastien, can you motivate why you insist on 0xf0? It seems strictly worse than starting from the end.

Yay, bikeshedding! Count me in!

Reserving the last row of 16 bytes for extension prefixes sounds like a good idea to me. In particular because the potential cost of some future version running out of possibilities for extension seems far higher than the cost of putting some of the more rare prospective one byte opcodes behind a prefix due to lack of space in the lower bytes.

Putting the first extension byte at 0xf0 would be a nice way to mark off that reserved range. Opcode space fragmentation would then be a feature here, not a bug, keeping room for (possibly unanticipated) future extensions.

The only freedom gained by going backwards from 0xff (i.e. defragmentation) seems to be to allow for the possibility of one byte opcodes 'bleeding' into that reserved range, which I think is exactly what one would like to prevent.

Also, extension opcodes could still grow from 0xf0 backwards if at some point in the future it becomes clear that more than 16 prefixes are needed and one byte opcodes haven't filled the lower bytes yet. So one doesn't give up any growth potential for extensions here compared to starting from 0xff and going backwards, the only difference is the explicit reservation of the last row of bytes.

It also feels kind of awkward to spec all the extension opcodes backwards, in a list of operations one would either end up with features running from most obscure to most common, or otherwise with opcodes being listed backwards for extension prefixes and the other way around for the rest... just feels icky :)

@rossberg
Copy link
Member

rossberg commented Apr 7, 2017

One question that also needs answering is how shared memories interact with instantiation. In particular, what is the amount of atomicity, if any, that is provided for writes performed for data sections?

@jfbastien
Copy link
Member

jfbastien commented Apr 7, 2017

One question that also needs answering is how shared memories interact with instantiation. In particular, what is the amount of atomicity, if any, that is provided for writes performed for data sections?

Good point. Ordering of these as well (i.e. are they nonatomic writes, from start to end, or are they sequentially consistent?).

I assume we don't intend to allow shared Tables for now?

@lukewagner
Copy link
Member

w.r.t. misaligned atomic accesses trapping:

  • Does this match the behavior of any of the major ISAs? Or do they "succeed" but fail to provide their atomicity guarantees?
  • Will trapping atomics end up leading to some of the same double-branching that we've seen with the trapping float operations (where user code branches on alignment to avoid the trap, and then the engine branches again to trap)?
  • Each trapping instruction has some small amount of overhead (metadata/code) to record the (semi-)precise trapping bytecode offset (for error reporting). For codes with millions of heap accesses, this adds up to several MB for the current out-of-bounds heap access checks. Since separate instructions would check for misalignment vs. out-of-bounds, I expect trapping semantics will incur some amount of size overhead.

@jfbastien
Copy link
Member

w.r.t. misaligned atomic accesses trapping:

  • Does this match the behavior of any of the major ISAs? Or do they "succeed" but fail to provide their atomicity guarantees?

Some ISAs "just work", others tear (sometimes just when crossing cachelines), others silently align, and others trap. It's unclear which of these ISAs we care about.

  • Will trapping atomics end up leading to some of the same double-branching that we've seen with the trapping float operations (where user code branches on alignment to avoid the trap, and then the engine branches again to trap)?

No: there's no "polyfill" possible. having a side-lock for misaligned accesses won't do the right thing.

  • Each trapping instruction has some small amount of overhead (metadata/code) to record the (semi-)precise trapping bytecode offset (for error reporting). For codes with millions of heap accesses, this adds up to several MB for the current out-of-bounds heap access checks. Since separate instructions would check for misalignment vs. out-of-bounds, I expect trapping semantics will incur some amount of size overhead.

We'd need to quantify how many atomic instructions actually occur. I expect very few, so low cost. Would be good to confirm.

@lukewagner
Copy link
Member

Some ISAs "just work", others tear (sometimes just when crossing cachelines), others silently align, and others trap. It's unclear which of these ISAs we care about.

What about just x86/x64/ARM32/ARM64/MIPS?

No: there's no "polyfill" possible. having a side-lock for misaligned accesses won't do the right thing.

Well, not a polyfill that provides atomicity, just a path that does the load non-atomically w/o trapping. Auto-aligning via mask (like asm.js+SAB would do) would be another option.

@jfbastien
Copy link
Member

jfbastien commented Apr 7, 2017

Some ISAs "just work", others tear (sometimes just when crossing cachelines), others silently align, and others trap. It's unclear which of these ISAs we care about.

What about just x86/x64/ARM32/ARM64/MIPS?

The behavior I described is covered by just x86 and ARM. Which ARM is another question. The auto-align and trapping ones aren't the mainstream ones AFAIK (or rather, the ones which run browsers, definitely true for "embedded" types ARMs).

No: there's no "polyfill" possible. having a side-lock for misaligned accesses won't do the right thing.

Well, not a polyfill that provides atomicity, just a path that does the load non-atomically w/o trapping. Auto-aligning via mask (like asm.js+SAB would do) would be another option.

If the intent is to have atomicity, except the code is OK without atomicity, then... Why even bother with atomicity? The polyfill you describe makes no sense.

We could auto-align, but I don't see how that's useful compared to trapping. Trapping is much more consistent with what WebAssembly otherwise does, and as I mentioned above I think the cost is minimal given how few of these instructions I expect to see.

@lukewagner
Copy link
Member

Again, I'm not talking about a "polyfill"; I'm wondering whether we'd get into a situation (as we've seen with the trapping conversion ops) where a compiler is forced to keep code "working" (i.e, not trapping) by adding its own branches/masks so that unaligned atomic accesses don't trap.

We could auto-align, but I don't see how that's useful compared to trapping.

I'm totally not suggesting that; I'm anti-suggesting that, even. I'm trying to determine whether it's better for us to not trap since, it sounds like, on all our tier 1 platforms, that's what the ISA gives us and trapping just adds several minor sources of overhead.

@jfbastien
Copy link
Member

Again, I'm not talking about a "polyfill"; I'm wondering whether we'd get into a situation (as we've seen with the trapping conversion ops) where a compiler is forced to keep code "working" (i.e, not trapping) by adding its own branches/masks so that unaligned atomic accesses don't trap.

That's what I was calling a polyfill. I don't think it can be made to work.

We could auto-align, but I don't see how that's useful compared to trapping.

I'm totally not suggesting that; I'm anti-suggesting that, even.

Sure, I'm listing possibilities.

I'm trying to determine whether it's better for us to not trap since, it sounds like, on all our tier 1 platforms, that's what the ISA gives us and trapping just adds several minor sources of overhead.

I understand what you're asking. I just don't understand which alternative you prefer out of the ones I listed. All our tier 1 platforms don't have the same behavior.

@lukewagner
Copy link
Member

That's what I was calling a polyfill. I don't think it can be made to work.

It depends what you mean by "work". If "work" means "loads the bytes as if by a non-atomic load", then sure it can.

I just don't understand which alternative you prefer out of the ones I listed. All our tier 1 platforms don't have the same behavior.

Ok, maybe I've misunderstood the cases here. I assumed there are three: trap, auto-align (boo!), do the unaligned load (without atomicity guarantees). Is that right? If so, then when you said above "The auto-align and trapping ones aren't the mainstream ones AFAIK", I assumed that meant all the tier 1 platforms are in the last case.

@jfbastien
Copy link
Member

jfbastien commented Apr 10, 2017

That's what I was calling a polyfill. I don't think it can be made to work.

It depends what you mean by "work". If "work" means "loads the bytes as if by a non-atomic load", then sure it can.

But then it's non-atomic. Why would a user bother with atomics if their code is OK not being atomic? I'm totally happy making that type of code slower.

I just don't understand which alternative you prefer out of the ones I listed. All our tier 1 platforms don't have the same behavior.

Ok, maybe I've misunderstood the cases here. I assumed there are three: trap, auto-align (boo!), do the unaligned load (without atomicity guarantees). Is that right? If so, then when you said above "The auto-align and trapping ones aren't the mainstream ones AFAIK", I assumed that meant all the tier 1 platforms are in the last case.

AFAIK x86 "just works" and will be atomic even across cache lines, whereas "mainstream" mobile ARMs will tear. These are "tier 1". Other ARMs (e.g. for embedded) will trap or auto-align.

I agree we should tune for tier 1. We shouldn't ignore tier 2 if it's easy. In this case tier 1 differs, and trapping is cheap enough that I'd go with that because it prevents "I tested on Chrome on x86 and it worked, ship it!".

Silently tearing is really bad! It's hard to reliably observe.

@lukewagner
Copy link
Member

But then it's non-atomic. Why would a user bother with atomics if their code is OK not being atomic?
I'm totally happy making that type of code slower.

If the compiler has to blindly work around it (like it had to do with float conversion traps), it would penalize the whole app, not just the dumb code. Then again, maybe explicit atomics + unaligned bugs are a rare enough combination in practice that this situation wouldn't arise in practice.

and trapping is cheap enough that I'd go with that because it prevents "I tested on Chrome on x86
and it worked, ship it!".

Yeah, I think that is the central tradeoff: making all tier-1 fast paths slower for some amount of bug-catching. It'd certainly help to have some measurements here.

If the loaded value is not equal to the expected value, the operator returns 1
("not-equal"). If the values are equal, the agent is suspended. If the agent
is woken, the wait operator returns 0 ("ok"). If the timeout expires before
another agent wakes this one, this operator returns 2 ("timed-out").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you know, in the HTML embedding of ES2017 we do not allow Atomics.wait to wait on the main thread of a window (or, in Firefox at present, on the main thread of a SharedWorker). In ES, a call to Atomics.wait on such a thread results in a TypeError being thrown. Wasm can be used on the main thread, so is there going to be the same restriction there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never really liked this AgentCanSuspend compromise since it seems like it would just lead to spin-locking instead. That being said, it seems like wasm should be symmetric with JS here, so either we make the restriction too or we try to remove the restriction from JS. If we make the restriction, it seems like we'd throw a RuntimeError.

Copy link
Contributor

@lars-t-hansen lars-t-hansen Apr 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will lead to some spin-locking, but those that dabble in that will quickly discover that the browser locks up, as the main thread is not able to service requests made on behalf of other parts of the system. (Though I sort of doubt that will be apparent to many of the people bitten by this problem.)

(Edited to add:)

IMO we should expect to live with this problem and to ease the creation of content that lives with it, too, by providing structures for easily running Wasm off-main-thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I think we need to be symmetric with JS here.


The embedder is also permitted to suspend or wake an agent. A suspended agent
can be woken by the embedder or the wake operator, regardless of how the agent
was suspended (e.g. via the embedder or a wait operator).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this attempting to formalize the push-into-history / restore-from-history mechanism of browsers, or to allow spurious wakeups (which are not in the ES version of Atomics.wait)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it was allowing spurious wakeups, but of course you're right, that's not allowed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry -- this is meant to allow the thread to be suspended/woken by JavaScript as well as WebAssembly. But it's a good point that it is similar to a spurious wakeup. I wonder if there's a good way to make a distinction.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reference the abstract WakeWaiter operation? If the wasm wake operator uses that, that'll unify wasm with how the embedder can explicitly wake up agents.

@jfbastien
Copy link
Member

I created a new repo for this: https://github.com/WebAssembly/threads
@binji can you move things there? It'll be easier to iterate.

@binji
Copy link
Member Author

binji commented Apr 18, 2017

Moved here: WebAssembly/threads#1, I'll close this for now. Please move relevant comments there!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.