-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Atomicity: lock-free guarantees #300
Comments
I would also like to propose that 8-byte accesses also be guaranteed to be lock-free in 64-bit mode. This would mean that some 32-bit platforms (like ppc32, mips32) wouldn't be able to run 64-bit wasm, though some (like x86-32) can, but as detailed above, all known relevant 64-bit platforms would be fine. |
Is lock-freedom observable? E.g. would it be possible for an implementation to "violate" lock-freedom in a way that is observable to the program? Or is this more of a performance advisory? |
It's mostly a performance advisory (albeit a significant one), but it is potentially possible to observe. A few examples:
|
How does this interact with unaligned access? Also, does 4-byte lock-free status guarantee that 1-byte and 2-byte IOs are lock-free? I can imagine an architecture where a 1-byte write turns into read-4-into-temp, write-1-into-temp, write-4-from-temp, in which case it would be very much not lock-free. |
Since PostMVP.md#threads already mentions the goal of providing atomic ops and |
@kg your two questions:
@lukewagner I basically want to figure out what LLVM should do (both for |
I like the idea of shooting for the best semantics we can get, but I'm not sure mandating |
@lukewagner lying isn't an option IMO :-)
The most conservative is to say "sometimes" in all cases, which is what I asked @sunfishcode to do in clang for now. |
Are you proposing that as a part of the '1/2/4-byte accesses are atomic' guarantee, we require conforming implementations to turn all sub-4-byte accesses into cmpxchg? That seems catastrophically non-performant. |
@kg I indeed am, and it's lock-free so cheaper than the alternative :-) |
@kg to clarify, I'm talking about atomic accesses only, not regular load/store. |
@jfbastien Ok, then it sounds like we agree (to not require |
@luke I am positing that no future arch will ever drop support for lock-free 1/2/4-byte atomic accesses. The arch survey mentioned above includes all major architectures, and beyond that, the C++ memory model encourages architectures to do so, and is one of the pillars of our work here. |
I sure hope architectures support 1-byte atomics forever :-) So I tend to agree with @sunfishcode and count on 4 being 4eva. |
Yeah, I was mostly wondering about 2. If we want to make this assumption, then, we should add it to our list of assumptions. |
2 is an assumption on efficient execution, but the make-or-break assumption we're potentially making is about 4. Being wrong on 4 is more than just performance breaking! |
From people working on MIPS I have heard that the implementation advice for 1-byte and 2-byte atomic RMW is to use a separate spinlock, not the 4-byte atomic LL/SC pair. I don't know where that advice came from originally, and I remain skeptical that it's good advice. Just sayin'. I have not yet investigated what the C++ compilers do on that platform. I'm not sure how to interpret @jfbastien's comment earlier about 1-byte atomics, surely all modern platforms are single-copy atomic for a single byte and new platforms will be so, but MIPS and (last I looked) PPC have no instructions specifically for atomic 1-byte RMW. (Nor 2-byte.) |
OK, we don't currently have atomic accesses in AstSemantics so this was confusing. We should make sure to be clear that this only applies to atomic access opcodes (and introduce them, naturally) |
Related discussion happening in the Shared memory repo. I've noted there that some microcontroller systems that can and do run JS are probably not lock-free in our sense for four-byte data. For example, the Model 1 Raspberry PI is an ARM11 chip with only a SWP instruction suitable mainly for building a spinlock. I don't want to overemphasize the microcontroller use case, and microcontrollers are getting better fast, but JS is everywhere :) |
@lars-t-hansen My survey above was based on what's in LLVM, and LLVM seems to interpret "lock-free" for types to just mean that it can do plain atomic loads and stores of those types without locks. For example, it considers 1-byte, 2-byte, and 4-byte integer types to all be "lock-free" for all armv6 platforms. Unfortunately, the C++ standard itself is vague about what a "lock-free type" means. |
Specifically, for this C code for example:
clang --target=armv6 b.c -S -O2 -o - gives no error and emits this:
Which to my understanding is not "lock-free" code. I will seek clarification. |
@sunfishcode, we probably need to figure out what we need by the terminology then. On ARM and MIPS, that property is called "single-copy atomicity", which just boils down to a relaxed atomic load or store being implementable with a normal load or store instruction. I agree that all platforms I've surveyed, post-dating the first Alpha, are single-copy atomic for integer data (only) up to the native pointer size, a prerequisite for implementing Java. The code above is what I refer to as lock-free: no external lock needed for a RMW operation. (Incidentally, though not central to this issue, ldrexh is only available in the ARMv6K and later and I'm not sure that would work on the Raspberry PI.) |
@kg Atomics aren't in AstSemantics.md b/c AstSemantics.md is what's in MVP; atomics are specifically called out in FutureFeatures.md#threads where I think they belong. We coudl add a link from AstSemantics.md#linear-memory-operations to this, though. |
@lars-t-hansen I was reading the current C++ draft which incorporates the definition of lock-free from n3927 but I'm not confident I interpreted it correctly. FWIW, I looked up the original Raspberry Pi and it has an ARM1176JZF-S which according to ARM does have ldrexh, strexh, ldrexb, and strexb. |
@lars-t-hansen what I meant is: if you have a 4-byte lock-free access then you can synthesize 1-byte and 2-byte accesses using shifts and masks. That's true for load/store as well as RMW. See this LLVM code: |
Going point by point with the C++ definition:
The C++ spec appears to mumble and fumble on 2c for LL/SC, which is probably as good as it gets. The spirit of lock freedom is all about ensuring global progress rather than the progress of any one thread. Even ignoring 2c, lock freedom allows for some thread to get stuck indefinitely. Even if all of the threads executing this LL/SC loop on some short are stuck, it can only be because some some other thread is succeeding in modifying some word on this cache line. Arguably that thread is making progress in that case.
|
Thanks. I'm prepared to call a ldrexh/strexh loop lock-free then. I looked up some reference points: the original Raspberry Pi and Cortex-M3 both have ldrexh/ldrexb/strexh/strexb. Cortex-M0 does not. So:
|
@pizlonator, doesn't that definition of lock-freedom mean that any size access implemented with a spinlock (eg using a test-and-set on a separate one-bit location, even one that is shared among all atomic accesses) is lock-free? I'm not necessarily opposed to that, but it is broader than I would have expected. In the gcc documentation we have this: "bool __atomic_always_lock_free (size_t size, void *ptr). This built-in function returns true if objects of size bytes always generate lock-free atomic instructions for the target architecture. size must resolve to a compile-time constant and the result also resolves to a compile-time constant. " That's more limited than your definition but it also seems more useful - it says something about the specific performance characteristics and implementation of the access: atomic access, not (sharded) (spin)lock. @sunfishcode, re ldrexh/strexh, serves me right for trusting Wikipedia on such details :) @jfbastien, clearly a 4-byte op can be used to synthesize a 1-byte op and a 2-byte op, no discussion about that. |
You're right that spin locks would be lock free if the OS scheduler had some fairness guarantee and the "spinning" part of the spin lock caused the OS to fairly round-robin through all runnable threads before coming back to the spinning thread.
|
@lars-t-hansen As one data point, LLVM's MIPS backend uses LL/SC for i16 and i8. |
Using LL/SC or |
@jfbastien Can you help me reconcile your interpretation with the fact that clang and libc++ seem to have no code for making the size or alignment of |
I'm more familiar with ARM, where non-atomic stores do reset the global monitor set by x86 is obviously not a problem. I'm not familiar enough with MIPS to know how their LL/SC is spec'd, and whether that's consistent across all of their SMP cores. I'm told NVIDIA processors "just work" in these circumstances. I do think that it's sensible for weak memory systems with multiple sockets to do what I've detailed (effectively TBAA based on atomic access types) but I don't have a public ISA document to point at to back it up. Let me poke a few people. |
Jikes RVM's thin/biased locks, GC header, and hash code logic relied on doing a mix of 32-bit LL/SC and 8-bit or 16-bit non-atomic accesses. That code was known to work on PPC. WebKit on ARM uses 32-bit LL/SC to implement 8-bit weak CAS, and those bytes also get accesses by nonatomic 8-bit memory access instructions. WebKit is known to work on a variety of ARMs. I'm not entirely opposed to specifying the behavior that you speak of if it is necessary to support some major architecture, but it strikes me as a very risky move based on my own experience with lock free algorithms.
|
Size doesn't matter in this context. ARM's global monitor protects wide address ranges, so non-atomics that are "too close" to atomics can cause the
I agree with you that it would lead to a bizarre programming model, and that it would break some lock-free code. That's what developers get for running with sharp scissors ;-) More seriously I'm hoping this isn't something we need to spec, couldn't find hardware that we care about where it's an issue. I'll will post if I hear that this is an issue. I'm simply cautious because C++ is pretty specific but obscure about this (the requirement is on |
I poked C++ WG21 SG1 folks who were involved in the current standard's wording. The worry which led to the current restrictions in C++ is mostly around "embedded processors that we don't know about". Paul McKenney confirms that Alpha (pre-EV5) is of course the chief offender, but I think it's safe to ignore. Lawrence Crowl points out that one interesting other case is the 80386:
@jyasskin and Hans Boehm point out that processors which exhibit this issue with atomic / non-atomic mixing would also exhibit the same issue with pure non-atomic accesses: they'd also have to do e.g. 16-bit accesses using 32-bit cmpxchg or LL/SC. Hans also says:
TL;DR: we're probably good if we ignore this issue. |
@jfbastien does "TL;DR: we're probably good if we ignore this issue" mean that you support guaranteeing that i8, i16, and i32 on WebAssembly are all "lock free types" in the C/C++ sense? |
Yes, but I'd phrase it as a platform assumption the same way we expect little-endian. I can create a PR to do this if it sounds good to you. |
That sounds good to me. |
FWIW, I'm not yet convinced that the JS shared memory spec will make the assumption that any particular size is lock free. So this may be another point of infidelity between the wasm spec and its JS polyfill. That's probably not a big deal, just noting it. |
I'll mark this as closed for now, though @lars-t-hansen's point means we should probably revisit the exact guarantees in the future. |
TL;DR: make all accesses of 1, 2 and 4 bytes "always" lock-free, wider access "maybe".
C++11 has a concept of "atomicity" where an operations is either entirely observable, or not at all. This guarantee is on a per object basis, during the object's lifetime—for our purpose this is equivalent to per memory location. This is irrespective of the object's size, but does require proper natural alignment (UB if not naturally aligned).
Not all hardware have load/store of wide sizes (indeed, atomics can have any size!), the language therefore offers an
is_lock_free
method on thestd::atomic
class. This is a per-type property, but it's effectively based on size. This method is determined at runtime (i.e. it is notconstexpr
), but it's effectively known at load time because it can't change during program execution.Using an
atomic
of non-atomic size relies on the runtime using a global lock (or a shard of global locks) to seamlessly prevent partial reads or writes. Normal loads/stores are unaffected, only atomics of unusual sizes go through this lock.C has
ATOMIC_*_LOCK_FREE
macros forBOOL
,CHAR
,CHAR16_T
,CHAR32_T
,WCHAR_T
,SHORT
,INT
,LONG
,LLONG
,POINTER
, which can be used at compile time. These macros are a tribool: never lock-free, sometimes lock free, always lock free.There is a proposal to add
constexpr atomic::is_always_lock_free
to the languages.Note: things are a bit more complicated because C11 decided to standardize
is_lock_free
as taking a pointer to the object as well. Implementations ignore that pointer, I suggest we do the same.In all 3 cases we (WebAssembly implementors) need to decide what we want to guarantee. The runtime checks are easy (look at current machine, return that), the shard of global locks is easy (it's an implementation detail), but the compile-time guarantee of "yes this will always be lock free on all platforms WebAssembly will ever support" isn't that easy to make, and it's binding forever.
Having such a guarantee is pretty important: it's silly to write a lock-free algorithm if the datastructure is atomic through a lock (i.e. isn't actually lock free!). A good example for 64-bit lock-free being useful is a doubly-linked list where 2 pointers use up 64 bits.
@sunfishcode points out:
I think we can safely say 1, 2 and 4 byte accesses are always lock-free (even if you can't do a byte load you can always
cmpxchg
a wider size).The text was updated successfully, but these errors were encountered: