-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<atomic>: Optimize atomic_thread_fence #739
Comments
Most other compilers use A Or almost as good, reserve some stack space in the current function in a cache line that's normally already hot in L1d cache in Modified state to use as a normal atomic variable, which can be implemented without any new intrinsics, just make this var non- |
So is the main advantage of I think avoiding stack variable is impractical for normal Windows user-mode programs (for which this STL is designated), if it is hard to achieve with current intrinsics. Though may look into using _AddressOfReturnAddress intrinsic.
Probably if not going to save stack variable, |
After I made a pull requests that passed test, I have an idea why there may be |
That, and not needing any free registers, just a 1-byte immediate. Especially likely to be a problem in 32-bit code where we have fewer registers. Note that a spill/reload around a memory barrier can't have its latency hidden as effectively by OoO exec.
Yeah, that's expected on Skylake-derived CPUs with up-to-date microcode. You'd only find IDK if some other microarchitectures (without such a brutally slow mfence) could maybe have better throughput for mfence than the potential latency bottleneck of repeated atomic RMW on the same location. It has to drain the store buffer anyway so it's not like store-forwarding latency is part of it. But since Skylake-derived uarches are very widespread especially in servers these days, it's probably best to keep using an atomic RMW always. I don't think any downside of an RMW to a local on the stack on other uarch is going to be significant at all, and certainly not compare to the amount of performance we'd lose on Skylake-based CPUs.
That's reasonable, although like I said
Yeah plausible. Easy to lose sight of the big picture sometimes and make performance mistakes. |
Note that |
Not sure what you're trying to say. |
The NOT instruction does not update flags, so you can issue |
Oh, good idea, that would work. According to https://www.uops.info/table.html, |
Compiler does not have Apparently saving destination register can be done with https://godbolt.org/z/Vbgt_A - I don't see compiler saving a register with |
Ok let's go with |
|
That's not necessarily true, and certainly not obvious. If you're using EBP as a traditional frame pointer, the return address is definitely at If you actually meant Probably moreso because the caller was probably using it, and your return address was stored by the calling function right before entering this function. (Unless it was reached by a jmp tailcall). If you're in a loop in a long-running function that doesn't access locals on the stack very frequently, there's just as much chance for OTOH if MSVC puts arrays at higher addresses inside a stack frame, and scalar locals at the bottom, then yeah maybe a higher chance for the bottom of the stack frame to be hot. I guess also if you were making function calls to child functions, that would be keeping the bottom of this function's stack frame hot. (A (One reason for compilers to put arrays near the top of a stack frame is so buffer overflows hit a stack canary right away, instead of overwriting locals first. This is good if you're compiling with an option like |
I think I observed exactly I meant a case where long-running function allocates a lot of stack, but yes, I now agree that it is not clear which part of this stack is less likely to be hot. If saving stack variable makes sense, then A good part is that |
Tricky use of |
I've used (Actually first memory fence is not needed, will replace it with compiler barrier) For general case, probably best would be to have an instinsic for seq cst fence. Compiler could implement it by |
I found the intrinsic that was supposed to be used here. Unfortunately, it is available only on x64 mode, but not on x86 mode. |
I'm referring to this part again:
STL/stl/inc/atomic
Lines 1894 to 1909 in 3141965
Stackoverflow answer mentions that use of one variable for all threads is sub-optimal as it creates unnecessary contention.
Also there's no real point in having
lock cmpxchg
and the same effect can be achieved by a simpler operation. My thinking it should be implemented as:This compiles to:
The text was updated successfully, but these errors were encountered: