Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<atomic>: Optimize atomic_thread_fence #739

Closed
AlexGuteniev opened this issue Apr 23, 2020 · 16 comments · Fixed by #740
Closed

<atomic>: Optimize atomic_thread_fence #739

AlexGuteniev opened this issue Apr 23, 2020 · 16 comments · Fixed by #740
Labels
fixed Something works now, yay! performance Must go faster

Comments

@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented Apr 23, 2020

I'm referring to this part again:

STL/stl/inc/atomic

Lines 1894 to 1909 in 3141965

extern "C" inline void atomic_thread_fence(const memory_order _Order) noexcept {
if (_Order == memory_order_relaxed) {
return;
}
#if defined(_M_ARM) || defined(_M_ARM64)
_Memory_barrier();
#else // ^^^ ARM32/ARM64 hardware / x86/x64 hardware vvv
_Compiler_barrier();
if (_Order == memory_order_seq_cst) {
static long _Guard;
(void) _InterlockedCompareExchange(&_Guard, 0, 0);
_Compiler_barrier();
}
#endif // hardware
}

Stackoverflow answer mentions that use of one variable for all threads is sub-optimal as it creates unnecessary contention.

Also there's no real point in having lock cmpxchg and the same effect can be achieved by a simpler operation. My thinking it should be implemented as:

    volatile long i; // TODO: suppress the warning about uninitialized variable if needed
    _InterlockedIncrement(&i);

This compiles to:

lock inc    dword ptr [rsp]  
@pcordes
Copy link

pcordes commented Apr 23, 2020

Most other compilers use mfence for atomic_thread_fence(mo_seq_cst), even clang which like MSVC uses xchg for seq_cst stores instead of mov+mfence.

A lock add dword ptr [rsp], 0 is a no-op other than clobbering flags, and would be a fine choice. Perhaps better than mfence but having every thread hammering on the same cache line for a static variable is clearly terrible. Most other compilers use lock add [esp], 0 as a barrier in 32-bit mode if mfence isn't available. It might even be faster than mfence for some cases, especially on Skylake-derived CPUs where mfence blocks OoO exec of even non-memory instructions (like lfence). https://stackoverflow.com/questions/50494658/are-loads-and-stores-the-only-instructions-that-gets-reordered/50496379#50496379 / https://stackoverflow.com/questions/40409297/does-lock-xchg-have-the-same-behavior-as-mfence

Or almost as good, reserve some stack space in the current function in a cache line that's normally already hot in L1d cache in Modified state to use as a normal atomic variable, which can be implemented without any new intrinsics, just make this var non-static. An extra 4 (or 1) byte of stack space would typically be worth it to avoid a likely cache miss (if other threads are running the same code, the static var probably gets bounced around).

@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Apr 23, 2020

So is the main advantage of lock add dword ptr [rsp] in avoiding a stack variable allocation?

I think avoiding stack variable is impractical for normal Windows user-mode programs (for which this STL is designated), if it is hard to achieve with current intrinsics. Though may look into using _AddressOfReturnAddress intrinsic.

mfence definitely performs worse on my working i7-8750H.

Probably if not going to save stack variable, xchg is the best choice to avoid clobbering flags.

@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Apr 23, 2020

After I made a pull requests that passed test, I have an idea why there may be static in first place.
There is warning C28113: "Accessing a local variable via an Interlocked function". static probably was a convenient way to suppress it. I've added explicit suppression for this statement instead.

@pcordes
Copy link

pcordes commented Apr 24, 2020

So is the main advantage of lock add dword ptr [rsp] in avoiding a stack variable allocation?

That, and not needing any free registers, just a 1-byte immediate. Especially likely to be a problem in 32-bit code where we have fewer registers. Note that a spill/reload around a memory barrier can't have its latency hidden as effectively by OoO exec.

mfence definitely performs worse on my working i7-8750H.

Yeah, that's expected on Skylake-derived CPUs with up-to-date microcode.

You'd only find mfence faster if you were using it in a loop that did some real work and for some reason the stack variable you were using a atomic RMW on missed in L2 cache. I assume your test was a pretty tight loop without a lot of in-flight work that might evict some stack cache lines.

IDK if some other microarchitectures (without such a brutally slow mfence) could maybe have better throughput for mfence than the potential latency bottleneck of repeated atomic RMW on the same location. It has to drain the store buffer anyway so it's not like store-forwarding latency is part of it.

But since Skylake-derived uarches are very widespread especially in servers these days, it's probably best to keep using an atomic RMW always. I don't think any downside of an RMW to a local on the stack on other uarch is going to be significant at all, and certainly not compare to the amount of performance we'd lose on Skylake-based CPUs.

Probably if not going to save stack variable, xchg is the best choice to avoid clobbering flags.

That's reasonable, although like I said += 0 has the advantage of not needing any free registers. If the compiler can do it with lock add, not lock xadd for a fetch_add.


static probably was a convenient way to suppress it

Yeah plausible. Easy to lose sight of the big picture sometimes and make performance mistakes.

@jpmorgan-atMS
Copy link

jpmorgan-atMS commented Apr 24, 2020

Note that LOCK NOT also does not update the flags and doesn't modify any registers

@pcordes
Copy link

pcordes commented Apr 24, 2020

Note that lock not also does not update the flags and doesn't modify any registers

Not sure what you're trying to say. lock is a prefix for other instructions. lock add dword [rsp], 0 does modify FLAGS, just like regular add - https://www.felixcloutier.com/x86/add#flags-affected

@jpmorgan-atMS
Copy link

jpmorgan-atMS commented Apr 24, 2020

The NOT instruction does not update flags, so you can issue LOCK NOT [mem] without altering registers or flags. Unfortunately you can't (currently) generate this from intrinsic functions in MSVC.

@StephanTLavavej StephanTLavavej added the performance Must go faster label Apr 24, 2020
@pcordes
Copy link

pcordes commented Apr 24, 2020

The NOT instruction does not update flags, so you can issue LOCK NOT [mem] without altering registers or flags. Unfortunately you can't (currently) generate this from intrinsic functions in MSVC.

Oh, good idea, that would work. lock is usable with not. https://www.felixcloutier.com/x86/lock.
An atomic val ^= -1; could optimize to lock not, but if that still uses lock xor then a lock inc on a dummy variable would be totally fine, too. It's probably rare that it would be a win to keep FLAGS across a memory barrier.

According to https://www.uops.info/table.html, lock not and lock inc are the same number of uops as as lock add on recent Intel and AMD, so the front-end cost should be identical.

@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Apr 24, 2020

Compiler does not have _InterlockedNot and _InterlockedAdd.
_InterlockedXor and _InterlockedExchangeAdd do not optimize to the above.

Apparently saving destination register can be done with _interlockedbittestandset. Though lock xor also works for saving registers, and may be better.

https://godbolt.org/z/Vbgt_A - I don't see compiler saving a register with _interlockedbittestandset , since it uses source address in eax. Is this impossible, or compiler isn't just looking into that?

@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Apr 24, 2020

Ok let's go with lock inc, as it is the smallest of the available that saves registers.

@AlexGuteniev
Copy link
Contributor Author

_AddressOfReturnAddress works to get something that is already on stack, but as it is return address, it is not very top, and can be like [ebp+200], so the chance it is not in L1 cache is higher

@pcordes
Copy link

pcordes commented Apr 24, 2020

_AddressOfReturnAddress works to get something that is already on stack, but as it is return address, it is not very top, and can be like [ebp+200], so the chance it is not in L1 cache is higher

That's not necessarily true, and certainly not obvious.

If you're using EBP as a traditional frame pointer, the return address is definitely at [ebp+8], right below any stack args and right above the saved-EBP value you stored on function entry.

If you actually meant [esp + 200] for a function that omits a frame pointer and reserves a bunch of space, the cache line containing the return address is just as likely to be hot as any other line in the current stack frame.

Probably moreso because the caller was probably using it, and your return address was stored by the calling function right before entering this function. (Unless it was reached by a jmp tailcall).

If you're in a loop in a long-running function that doesn't access locals on the stack very frequently, there's just as much chance for [esp] to have been evicted as [esp + 200]. If small scalar locals tend to be above arrays or large structs, closer to the return address, a dummy var is likely in the same line as the return address anyway, and probably keeping each other hot in cache if any are getting accessed in a loop in a big function.

OTOH if MSVC puts arrays at higher addresses inside a stack frame, and scalar locals at the bottom, then yeah maybe a higher chance for the bottom of the stack frame to be hot. I guess also if you were making function calls to child functions, that would be keeping the bottom of this function's stack frame hot. (A ret back into this function would have popped a return address from [esp-4].)

(One reason for compilers to put arrays near the top of a stack frame is so buffer overflows hit a stack canary right away, instead of overwriting locals first. This is good if you're compiling with an option like gcc -fstack-protector-strong to actually check a canary in the stack frame of functions with arrays. Otherwise it's probably bad, making it easier to overwrite the return address and reach a ret without crashing.)

@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Apr 25, 2020

I think I observed exactly [ebp+200], but cannot recreate it, so maybe I mistakenly noted that.
Anyway, traditional stack frames are not seen very often, I suppose. x64 do not seem to use them, and in x86 /O2 implies /Oy.

I meant a case where long-running function allocates a lot of stack, but yes, I now agree that it is not clear which part of this stack is less likely to be hot.

If saving stack variable makes sense, then _InterlockedXor(reinterpret_cast<volatile long*>(_AddressOfReturnAddress()), 0); that ususally expands to lock xor dword ptr [rsp], 0 on x64 would probably be an option, as lock add [mem] cannot be generated.

A good part is that _AddressOfReturnAddress does not prevent inlining, instead it uses the return address of outer non-inline function.

@AlexGuteniev
Copy link
Contributor Author

Tricky use of _AddressOfReturnAddress might be a problem for security analysis of control path. Normal runtime stack buffer overrun control and control flow guard should not be able to notice this, but an external runtime debugger tool might be confused.

@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Apr 26, 2020

I've used _IntrlockedXor(mem, 0) in #651 to provide seq cst fence around relaxed store using __iso_volatile_store64 (which uses SSE2 or floating point registers). In this case, it is known that target operand is hot:

https://github.com/microsoft/STL/pull/694/files#diff-69d94ca27cdd6e1c1420ba4fc9e60b86L650-R648

(Actually first memory fence is not needed, will replace it with compiler barrier)

For general case, probably best would be to have an instinsic for seq cst fence. Compiler could implement it by lock add [mem], 0/lock xor [mem], 0 anything that was recently modified or will be modified soon and belong to the current thread, could be something relative to [ebp] or [esp] (but probably not [fs:]/[gs:])

StephanTLavavej pushed a commit that referenced this issue Apr 30, 2020
@StephanTLavavej StephanTLavavej added the fixed Something works now, yay! label Apr 30, 2020
@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented May 12, 2020

I found the intrinsic that was supposed to be used here.
It is __faststorefence.
Compiles to lock or dword ptr [rsp],0 without allocating stack variable.

Unfortunately, it is available only on x64 mode, but not on x86 mode.
If it was available for both x86 and x64, things could have been simplified by avoiding suppression and explanations. I've created DevCom-1027191 , but I don't think it will be prioritized high.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed Something works now, yay! performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants