-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Atomics and Locks (ARM, AArch64, X86) #14293
Fix: Atomics and Locks (ARM, AArch64, X86) #14293
Conversation
Allows to specify how memory accesses, including non atomic, are to be reordered around atomics. Follows the C/C++ semantics: <https://en.cppreference.com/w/c/atomic/memory_order> Keeps sequentially consistent as the default ordering. There are a couple difficulties with this patch: 1. LLVM 12 and below only accepted some specific ordering combinations in cmpxchg, and all LLVM versions refuse release / acquire during store / load. 2. Codegen needs ordering to be constants (to generate the LLVM IR) but this is now a variable in stdlib. We thus generate all possible cases every time, which is optimized away by the LLVM optimization passes when it realizes there is only one branch possible (unless `ordering` is really a variable, which should be very rare).
CPU architectures with a weak memory model (such as ARM) can reorder the CPU instructions at runtime (unlike X86), not following the exact order of the assembly. We must respect the memory model guarantees to ensure that we have proper barrier or specific memory ordering at specific places when they are required (e.g. locks). A lock must ensure that anything after the lock was acquired aren't executed before the lock is acquired, and the other way around: anything before we release the lock must be executed before we actually release it. Each architecture has different quirks: - AArch64: atomics guarantee the ordering but we must use an explicit atomics everywhere (at least load/store with acquire/release ordering); - ARM v6/v7: atomics don't guarantee any ordering (only the atomicity of the operation), there must add explicit memory barriers around locks; - X86: it seems we can go away with a non atomic release, yet using acquire/release ordering isn't any slower than a non atomic store.
Fix: detect LLVM version
macOS CI is still failing with |
Thanks Sija, I missed that assertion, it felt like nixpkgs was crashing. I have no idea what's happening there. The assertion is pointless and doesn't hint at any issue. Not having the assertion and having the actual LLVM IR error would actually be much more helpful. Might |
@Sija are you sure? CI reports a different message:
Not
|
I tried to reproduce with Nix on Linux but the assertion won't reproduce 🤷 |
It's odd, failing assertion seems to be this one: https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/IR/Instructions.h#L623-L624, yet the CI message mentions line 612... Also, llvm-config is at version 11.1.0 ( |
The I think I'm starting to understand: LLVM with assertions turned on still checks that the failure ordering can't have release semantics, but without assertions it will accept them. Sigh, so which is which? 😕 |
There are cases where the detection doesn't work properly, and we might want to avoid having 2 crystal compiler behaving differently depending on the LLVM version it was compiled in. We might want to enable more cases when we ditch support for LLVM < 13.0
As stated in the latest commit: I removed the check for the LLVM version and only enable the strictest of choices. That seems to have fixed the issue with LLVM assertions, and I think it's better to not have different semantics based on the LLVM version Crystal happens to have been compiled with. |
The assertion failure comes from LLVM 15.0.2, since that is the version the compiler itself is built with. |
Add comment about not enforcing the no stronger rule for CAS in the compiler (codegen has to be valid) vs stdlib (consistent public API).
Reading again https://llvm.org/docs/Atomics.html I realize the notes for code generation mentions using explicit fences on weak architectures... and I'm back to wonder if we ever needed to put explicit If so, the problem was merely the lazy set that was breaking the acquire/release requirement (an acquire is only guaranteed when paired with a release, and vice versa). Let's cross compile & disassemble to check that. |
I got mixed results. With a mere If I enable a recent CPU (e.g.
|
Confirmed: the explicit fences are not needed for arvm6 and armv7, while armv5 and older don't support fences. See #14567 |
This patch should fix the use of atomics in locks once and for all (until we add new architectures, that each will need to be investigated). All usages should now be correct as per the textbook, instead of relying on hacks without much impact.
Explanation
CPU architectures with a weak memory model (such as ARM) can reorder the CPU instructions at runtime (unlike X86), not following the exact order of the assembly. We must respect the memory model guarantees to ensure that we have proper barrier or specific memory ordering at specific places when they are required (e.g. locks).
A lock must ensure that anything after the lock was acquired aren't executed before the lock is acquired, and the other way around: anything before we release the lock must be executed before we actually release it. Each architecture has different quirks:
Explicit orderings
X86 is the main reason I added specific orderings. I couldn't notice differences in AArch64 between seq-cst and acq/rel, but on X86 it had a great impact, being no slower than the lazy store while using correct atomics.
Note: I'm following the algorithms used by the Linux v4.4 kernel overall.
Performance impact
I ran the "mutex" example from #14272 by @jgaskins ten times at different MT levels (ranging from 1 to 128 threads) and charted the average value. See run.cr.
On X86 (intel haswell i7-4712HQ, 4 cores, 8 threads) we see that the current patch isn't any slower than the current master (incorrect but appears to work), whereas just using load + store with the default sequentially consistent ordering has a huge impact.
On AArch64 (Neoverse N1 server, 160 cores) we can see that this patch is slower than master with the traditional LL/SC atomics (inherited from ARMv6), but that the difference becomes unnoticeable when the new LSE atomics are enabled (
--mattr=+lse
) which significantly outperform LL/SC on this hardware.Future evolution: we might want to enable LSE atomics by default for AArch64 when available: can detect
lse
feature with LLVM 15 and this snippet by @HertzDevil and maybe a flag to disable when needed.Help Wanted: I'd be very interested to see results of LL/SC vs LSE atomics for other AArch64 CPUs: Apple, Raspberry Pi, AWS, ... 🙇
Related issues:
Mutex#unlock
on aarch64 #14272