<atomic>: Optimize atomic_thread_fence #739

AlexGuteniev · 2020-04-23T11:56:10Z

I'm referring to this part again:

Lines 1894 to 1909 in 3141965

    
           extern "C" inline void atomic_thread_fence(const memory_order _Order) noexcept {
 
               if (_Order == memory_order_relaxed) {
 
                   return;
 
               }
 
           #if defined(_M_ARM) || defined(_M_ARM64)
 
               _Memory_barrier();
 
           #else // ^^^ ARM32/ARM64 hardware / x86/x64 hardware vvv
 
               _Compiler_barrier();
 
               if (_Order == memory_order_seq_cst) {
 
                   static long _Guard;
 
                   (void) _InterlockedCompareExchange(&_Guard, 0, 0);
 
                   _Compiler_barrier();
 
               }
 
           #endif // hardware
 
           }

Stackoverflow answer mentions that use of one variable for all threads is sub-optimal as it creates unnecessary contention.

Also there's no real point in having lock cmpxchg and the same effect can be achieved by a simpler operation. My thinking it should be implemented as:

    volatile long i; // TODO: suppress the warning about uninitialized variable if needed
    _InterlockedIncrement(&i);

This compiles to:

lock inc    dword ptr [rsp]

The text was updated successfully, but these errors were encountered:

pcordes · 2020-04-23T12:34:36Z

Most other compilers use mfence for atomic_thread_fence(mo_seq_cst), even clang which like MSVC uses xchg for seq_cst stores instead of mov+mfence.

A lock add dword ptr [rsp], 0 is a no-op other than clobbering flags, and would be a fine choice. Perhaps better than mfence but having every thread hammering on the same cache line for a static variable is clearly terrible. Most other compilers use lock add [esp], 0 as a barrier in 32-bit mode if mfence isn't available. It might even be faster than mfence for some cases, especially on Skylake-derived CPUs where mfence blocks OoO exec of even non-memory instructions (like lfence). https://stackoverflow.com/questions/50494658/are-loads-and-stores-the-only-instructions-that-gets-reordered/50496379#50496379 / https://stackoverflow.com/questions/40409297/does-lock-xchg-have-the-same-behavior-as-mfence

Or almost as good, reserve some stack space in the current function in a cache line that's normally already hot in L1d cache in Modified state to use as a normal atomic variable, which can be implemented without any new intrinsics, just make this var non-static. An extra 4 (or 1) byte of stack space would typically be worth it to avoid a likely cache miss (if other threads are running the same code, the static var probably gets bounced around).

AlexGuteniev · 2020-04-23T14:15:11Z

So is the main advantage of lock add dword ptr [rsp] in avoiding a stack variable allocation?

I think avoiding stack variable is impractical for normal Windows user-mode programs (for which this STL is designated), if it is hard to achieve with current intrinsics. Though may look into using _AddressOfReturnAddress intrinsic.

mfence definitely performs worse on my working i7-8750H.

Probably if not going to save stack variable, xchg is the best choice to avoid clobbering flags.

AlexGuteniev · 2020-04-23T18:06:53Z

After I made a pull requests that passed test, I have an idea why there may be static in first place.
There is warning C28113: "Accessing a local variable via an Interlocked function". static probably was a convenient way to suppress it. I've added explicit suppression for this statement instead.

pcordes · 2020-04-24T00:03:16Z

So is the main advantage of lock add dword ptr [rsp] in avoiding a stack variable allocation?

That, and not needing any free registers, just a 1-byte immediate. Especially likely to be a problem in 32-bit code where we have fewer registers. Note that a spill/reload around a memory barrier can't have its latency hidden as effectively by OoO exec.

mfence definitely performs worse on my working i7-8750H.

Yeah, that's expected on Skylake-derived CPUs with up-to-date microcode.

You'd only find mfence faster if you were using it in a loop that did some real work and for some reason the stack variable you were using a atomic RMW on missed in L2 cache. I assume your test was a pretty tight loop without a lot of in-flight work that might evict some stack cache lines.

IDK if some other microarchitectures (without such a brutally slow mfence) could maybe have better throughput for mfence than the potential latency bottleneck of repeated atomic RMW on the same location. It has to drain the store buffer anyway so it's not like store-forwarding latency is part of it.

But since Skylake-derived uarches are very widespread especially in servers these days, it's probably best to keep using an atomic RMW always. I don't think any downside of an RMW to a local on the stack on other uarch is going to be significant at all, and certainly not compare to the amount of performance we'd lose on Skylake-based CPUs.

Probably if not going to save stack variable, xchg is the best choice to avoid clobbering flags.

That's reasonable, although like I said += 0 has the advantage of not needing any free registers. If the compiler can do it with lock add, not lock xadd for a fetch_add.

static probably was a convenient way to suppress it

Yeah plausible. Easy to lose sight of the big picture sometimes and make performance mistakes.

jpmorgan-atMS · 2020-04-24T00:28:33Z

Note that LOCK NOT also does not update the flags and doesn't modify any registers

pcordes · 2020-04-24T00:59:12Z

Note that lock not also does not update the flags and doesn't modify any registers

Not sure what you're trying to say. lock is a prefix for other instructions. lock add dword [rsp], 0 does modify FLAGS, just like regular add - https://www.felixcloutier.com/x86/add#flags-affected

jpmorgan-atMS · 2020-04-24T02:26:19Z

The NOT instruction does not update flags, so you can issue LOCK NOT [mem] without altering registers or flags. Unfortunately you can't (currently) generate this from intrinsic functions in MSVC.

pcordes · 2020-04-24T02:44:42Z

The NOT instruction does not update flags, so you can issue LOCK NOT [mem] without altering registers or flags. Unfortunately you can't (currently) generate this from intrinsic functions in MSVC.

Oh, good idea, that would work. lock is usable with not. https://www.felixcloutier.com/x86/lock.
An atomic val ^= -1; could optimize to lock not, but if that still uses lock xor then a lock inc on a dummy variable would be totally fine, too. It's probably rare that it would be a win to keep FLAGS across a memory barrier.

According to https://www.uops.info/table.html, lock not and lock inc are the same number of uops as as lock add on recent Intel and AMD, so the front-end cost should be identical.

AlexGuteniev · 2020-04-24T03:44:07Z

Compiler does not have _InterlockedNot and _InterlockedAdd.
_InterlockedXor and _InterlockedExchangeAdd do not optimize to the above.

Apparently saving destination register can be done with _interlockedbittestandset. Though lock xor also works for saving registers, and may be better.

https://godbolt.org/z/Vbgt_A - I don't see compiler saving a register with _interlockedbittestandset , since it uses source address in eax. Is this impossible, or compiler isn't just looking into that?

AlexGuteniev · 2020-04-24T04:17:56Z

Ok let's go with lock inc, as it is the smallest of the available that saves registers.

AlexGuteniev · 2020-04-24T04:58:50Z

_AddressOfReturnAddress works to get something that is already on stack, but as it is return address, it is not very top, and can be like [ebp+200], so the chance it is not in L1 cache is higher

pcordes · 2020-04-24T22:13:20Z

_AddressOfReturnAddress works to get something that is already on stack, but as it is return address, it is not very top, and can be like [ebp+200], so the chance it is not in L1 cache is higher

That's not necessarily true, and certainly not obvious.

If you're using EBP as a traditional frame pointer, the return address is definitely at [ebp+8], right below any stack args and right above the saved-EBP value you stored on function entry.

If you actually meant [esp + 200] for a function that omits a frame pointer and reserves a bunch of space, the cache line containing the return address is just as likely to be hot as any other line in the current stack frame.

Probably moreso because the caller was probably using it, and your return address was stored by the calling function right before entering this function. (Unless it was reached by a jmp tailcall).

If you're in a loop in a long-running function that doesn't access locals on the stack very frequently, there's just as much chance for [esp] to have been evicted as [esp + 200]. If small scalar locals tend to be above arrays or large structs, closer to the return address, a dummy var is likely in the same line as the return address anyway, and probably keeping each other hot in cache if any are getting accessed in a loop in a big function.

OTOH if MSVC puts arrays at higher addresses inside a stack frame, and scalar locals at the bottom, then yeah maybe a higher chance for the bottom of the stack frame to be hot. I guess also if you were making function calls to child functions, that would be keeping the bottom of this function's stack frame hot. (A ret back into this function would have popped a return address from [esp-4].)

(One reason for compilers to put arrays near the top of a stack frame is so buffer overflows hit a stack canary right away, instead of overwriting locals first. This is good if you're compiling with an option like gcc -fstack-protector-strong to actually check a canary in the stack frame of functions with arrays. Otherwise it's probably bad, making it easier to overwrite the return address and reach a ret without crashing.)

AlexGuteniev · 2020-04-25T03:00:51Z

I think I observed exactly [ebp+200], but cannot recreate it, so maybe I mistakenly noted that.
Anyway, traditional stack frames are not seen very often, I suppose. x64 do not seem to use them, and in x86 /O2 implies /Oy.

I meant a case where long-running function allocates a lot of stack, but yes, I now agree that it is not clear which part of this stack is less likely to be hot.

If saving stack variable makes sense, then _InterlockedXor(reinterpret_cast<volatile long*>(_AddressOfReturnAddress()), 0); that ususally expands to lock xor dword ptr [rsp], 0 on x64 would probably be an option, as lock add [mem] cannot be generated.

A good part is that _AddressOfReturnAddress does not prevent inlining, instead it uses the return address of outer non-inline function.

AlexGuteniev · 2020-04-25T04:03:33Z

Tricky use of _AddressOfReturnAddress might be a problem for security analysis of control path. Normal runtime stack buffer overrun control and control flow guard should not be able to notice this, but an external runtime debugger tool might be confused.

AlexGuteniev · 2020-04-26T05:19:27Z

I've used _IntrlockedXor(mem, 0) in #651 to provide seq cst fence around relaxed store using __iso_volatile_store64 (which uses SSE2 or floating point registers). In this case, it is known that target operand is hot:

https://github.com/microsoft/STL/pull/694/files#diff-69d94ca27cdd6e1c1420ba4fc9e60b86L650-R648

(Actually first memory fence is not needed, will replace it with compiler barrier)

For general case, probably best would be to have an instinsic for seq cst fence. Compiler could implement it by lock add [mem], 0/lock xor [mem], 0 anything that was recently modified or will be modified soon and belong to the current thread, could be something relative to [ebp] or [esp] (but probably not [fs:]/[gs:])

Fixes #739.

AlexGuteniev · 2020-05-12T14:38:12Z

I found the intrinsic that was supposed to be used here.
It is __faststorefence.
Compiles to lock or dword ptr [rsp],0 without allocating stack variable.

Unfortunately, it is available only on x64 mode, but not on x86 mode.
If it was available for both x86 and x64, things could have been simplified by avoiding suppression and explanations. I've created DevCom-1027191 , but I don't think it will be prioritized high.

AlexGuteniev added a commit to AlexGuteniev/STL that referenced this issue Apr 23, 2020

Fix microsoft#739 (without saving stack variable)

ad3adbd

AlexGuteniev mentioned this issue Apr 23, 2020

Optimize atomic_thread_fence #740

Merged

StephanTLavavej added the performance Must go faster label Apr 24, 2020

StephanTLavavej closed this as completed in #740 Apr 30, 2020

StephanTLavavej pushed a commit that referenced this issue Apr 30, 2020

Optimize atomic_thread_fence (#740)

a5a9e49

Fixes #739.

StephanTLavavej added the fixed Something works now, yay! label Apr 30, 2020

sbarral mentioned this issue Jul 26, 2022

Fenced load in bounded queue / v1.2.3 smol-rs/concurrent-queue#17

Closed

aharrison24 mentioned this issue Sep 23, 2024

[libc++] Incorrect memory order in atomic wait llvm/llvm-project#109290

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

<atomic>: Optimize atomic_thread_fence #739

<atomic>: Optimize atomic_thread_fence #739

AlexGuteniev commented Apr 23, 2020 •

edited

Loading

pcordes commented Apr 23, 2020 •

edited

Loading

AlexGuteniev commented Apr 23, 2020 •

edited

Loading

AlexGuteniev commented Apr 23, 2020 •

edited

Loading

pcordes commented Apr 24, 2020 •

edited

Loading

jpmorgan-atMS commented Apr 24, 2020 •

edited

Loading

pcordes commented Apr 24, 2020

jpmorgan-atMS commented Apr 24, 2020 •

edited

Loading

pcordes commented Apr 24, 2020

AlexGuteniev commented Apr 24, 2020 •

edited

Loading

AlexGuteniev commented Apr 24, 2020 •

edited

Loading

AlexGuteniev commented Apr 24, 2020

pcordes commented Apr 24, 2020

AlexGuteniev commented Apr 25, 2020 •

edited

Loading

AlexGuteniev commented Apr 25, 2020

AlexGuteniev commented Apr 26, 2020 •

edited

Loading

AlexGuteniev commented May 12, 2020 •

edited

Loading

<atomic>: Optimize atomic_thread_fence #739

<atomic>: Optimize atomic_thread_fence #739

Comments

AlexGuteniev commented Apr 23, 2020 • edited Loading

pcordes commented Apr 23, 2020 • edited Loading

AlexGuteniev commented Apr 23, 2020 • edited Loading

AlexGuteniev commented Apr 23, 2020 • edited Loading

pcordes commented Apr 24, 2020 • edited Loading

jpmorgan-atMS commented Apr 24, 2020 • edited Loading

pcordes commented Apr 24, 2020

jpmorgan-atMS commented Apr 24, 2020 • edited Loading

pcordes commented Apr 24, 2020

AlexGuteniev commented Apr 24, 2020 • edited Loading

AlexGuteniev commented Apr 24, 2020 • edited Loading

AlexGuteniev commented Apr 24, 2020

pcordes commented Apr 24, 2020

AlexGuteniev commented Apr 25, 2020 • edited Loading

AlexGuteniev commented Apr 25, 2020

AlexGuteniev commented Apr 26, 2020 • edited Loading

AlexGuteniev commented May 12, 2020 • edited Loading

AlexGuteniev commented Apr 23, 2020 •

edited

Loading

pcordes commented Apr 23, 2020 •

edited

Loading

AlexGuteniev commented Apr 23, 2020 •

edited

Loading

AlexGuteniev commented Apr 23, 2020 •

edited

Loading

pcordes commented Apr 24, 2020 •

edited

Loading

jpmorgan-atMS commented Apr 24, 2020 •

edited

Loading

jpmorgan-atMS commented Apr 24, 2020 •

edited

Loading

AlexGuteniev commented Apr 24, 2020 •

edited

Loading

AlexGuteniev commented Apr 24, 2020 •

edited

Loading

AlexGuteniev commented Apr 25, 2020 •

edited

Loading

AlexGuteniev commented Apr 26, 2020 •

edited

Loading

AlexGuteniev commented May 12, 2020 •

edited

Loading