From 8dbfd1635a85b7b4ff687e84243d9b891df03819 Mon Sep 17 00:00:00 2001 From: Ryan Houdek Date: Fri, 31 May 2024 10:32:54 -0700 Subject: [PATCH] FEXCore/docs: Adds programmer documentation about memory model emulation I keep needing to look these up to remember the limitations. Add a doc file so I can more easily point to the information. --- FEXCore/docs/MemoryModelEmulation.md | 152 +++++++++++++++++++++++++++ 1 file changed, 152 insertions(+) create mode 100644 FEXCore/docs/MemoryModelEmulation.md diff --git a/FEXCore/docs/MemoryModelEmulation.md b/FEXCore/docs/MemoryModelEmulation.md new file mode 100644 index 0000000000..8452083002 --- /dev/null +++ b/FEXCore/docs/MemoryModelEmulation.md @@ -0,0 +1,152 @@ +# What is x86-TSO and what is different compared to ARM's weak memory model? +x86's memory model is a very strictly coherent memory model that effectively mandates that all memory accesses are "atomic". While atomicity is +actually a bit more strict, we actually need to emulate it in ARM using atomic instructions. We are also required to emulate this strictness with +unaligned accesses, which is due to x86 CPUs allowing unaligned atomics for "free" within a cacheline. Intel also takes this a step more and allowing +full atomics with a feature called "split-locks", AMD gains this same feature in Zen 5. + +# Emulating loads +Due to x86 SIB addressing, this can happen on most instructions. FEX emulates these in a variety of ways depending on features. +Most instructions are emulated with an atomic instruction but we also implement a feature called "half-barrier" atomics for unaligned atomics. + +## Base ARMv8.0 +- Addressing limitations + - Register only +This is emulated using an atomic load instruction plus a nop. +- On unaligned access the code gets backpatched to a non-atomic load plus a memory barrier + +## FEAT_LRCPC +- Addressing limitations + - Register only +This matches the base ARMv8.0 implementation but adds new instructions that match x86-TSO behaviour, making the emulation slightly quicker. +- On unaligned access it still gets backpatched to non-atomic load plus a memory barrier. + +## FEAT_LRCPC2 +- Addressing limitations + - Register plus 9-bit signed immediate (-256, 255) +Adds some new instructions that allow immediate encoding inside of the previous LRCPC instructions + +## FEAT_LRCPC3 +Adds a handful of GPR instructions that aren't super interesting + +FEX doesn't currently implement these since no hardware supports it. + +- ldapr - Post-index load for stack +- ldiapr - Post-index load pair for stack +- stilp - pre-index store pair for stack +- stlr - pre-index store for stack + +# Emulating stores +Again due to x86 SIB addressing, this can also happen on most instructions. There are less options for FEX with this extension, so in most cases this +just turns in to an atomic store with half-barrier backpatching for unaligned accesses + +## FEAT_LRCPC, FEAT_LRCPC2 +Adds nothing for emulating stores + +## FEAT_LRCPC3 + +# Emulating atomic instructions +x86 has atomic memory operations that can do a variety of operations. For unaligned atomic operations FEX will emulate the operation inside the signal +handler if it happens to be unaligned. + +## CASPair - cmpxchg + +## Base ARMv8.0 +- Addressing limitations + - Register only +This is emulated with a ldaxp+stlxp pair of instructions. + +## FEAT_LSE +- Addressing limitations + - Register only +Adds a new caspal instruction that does the operation almost exactly like x86. + +## CAS - cmpxchg8b/cmpxchg16b + +## Base ARMv8.0 +- Addressing limitations + - Register only +Similar to CASPair but now only uses a ldaxr+stlxr pair + +## FEAT_LSE +- Addressing limitations + - Register only +Similar to CASPair adds a new casal instruction that operates basically like x86 + +# AtomicFetch +## Op from the following list +- Add +- Sub +- And +- CLR +- Or +- Xor +- Neg +- Swap + +## Base ARMv8.0 +- Addressing limitations + - Register only +All operations get emulated with an ldaxr+stlxr+ instruction + +## FEAT_LSE +- Addressing limitations + - Register only +Almost all operations now have a native atomic memory operation instruction. The only outlier is atomicNeg which doesn't have an LSE equivalent and +uses the ARMv8.0 implementation. + +# Vector loads +Since almost all memory accesses on x86 are TSO, this includes vector operations. + +## Base ARMv8.0 +- Addressing limitations + - Register plus 9-bit signed immediate (-256, 255) + - Register plus 12-bit unsigned scaled immediate (Scaled by access size) +Emulated using half-barriers, which means a load+dmb + +## FEAT_LRCPC3 +- LDAP1 added for element loads. Register only address encoding +- LDAPUR added for vector register loads, supports 9-bit simm offset + +# Vector stores +Just like loads, these are emulated using half-barriers + +## Base ARMv8.0 +- Addressing limitations + - Register plus 9-bit signed immediate (-256, 255) + - Register plus 12-bit unsigned scaled immediate (Scaled by access size) +Emulated using half-barriers, which means a dmb+str + + +## FEAT_LRCPC3 +- STL1 added for element stores. Register only address encoding +- STLUR added for vector register stores, supports 9-bit simm offset + +# Addressing limitations depending on operating mode +## GPR loadstores +### TSO Emulation disabled +- Register only (ldr/str) +- Register + Register + scale (ldr/str) +- Register + 9-bit simm (ldur/stru) +- Register + 12-bit unsigned scaled imm (ldr/str) + +### TSO Emulation enabled +- Register only (ldar/stlr) +- Register only (ldapr/stlr) - FEAT_LRCPC +- Register + 9-bit simm (ldapr/stlur) - FEAT_LRCPC2 + +## Vector loadstores +### TSO Emulation disabled +- Register only (ldr/str) +- Register + Register + scale (ldr/str) +- Register + 9-bit simm (ldur/stru) +- Register + 12-bit unsigned scaled imm (ldr/str) + +### TSO Emulation enabled +- Same as TSO emulation disabled due to half-barrier implementation + +### TSO Emulation enabled (FEAT_LRCPC3) +- Register only (ldap1/stl1) - Element loadstore +- Register + 9-bit simm (ldapur/stlur) + +## Atomic memory operations +Always TSO emulation enabled, always register only.