-
Notifications
You must be signed in to change notification settings - Fork 12.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[libc++] Incorrect memory order in atomic wait #109290
Comments
Fantastic analysis! I've only skimmed over it, but I believe you are correct. Here are some additional thoughts: I don't think that modelling
Together with the following quote:
...I think it is most appropriate to model That being said, Hyrum's Law is still in effect. Internally, one of the first things The proposed fix (upgrading As noted by cbloom in this blog post, what we actually need here is not So all the If we are going with the |
Thank you for the thoughtful response. I really appreciate the insights.
Agreed. That was my concern with the StackOverflow discussion. The description in the futex man pages looks very similar to the C++ requirement that all threads agree on the modification order of a single atomic variable, which implies that
It's interesting to see that the futex design docs make a big deal out of using that
This seems like sensible reasoning to me, though the idea of using a full memory barrier when it isn't strictly necessary makes me a bit nervous. Not as nervous as the inverse problem, mind you! Given what you've suggested, is the following an acceptable way to model the futex wait in Herd? P1 (atomic_int* plat, atomic_int* waiters, int* expected) {
...
// Platform wait
atomic_thread_fence(memory_order_seq_cst);
int do_wait = atomic_compare_exchange_strong_explicit(plat,
expected, 0, memory_order_relaxed, memory_order_relaxed);
...
} where I think it's worth looking at the pros and cons of the proposed solutions. You can see the code generated for each of them in Compiler Explorer. They all pass cleanly in Herd using the Option 1: Sequential consistencyThis option upgrades -------------------------------
// Notifying thread
platform_state.fetch_add(1, mo_seq_cst); // (1) <---- UPDATED
if (0 != contention_state.load(mo_seq_cst)) { // (2)
platform_notify_all(platform_state);
}
-------------------------------
// Waiting thread ("slow" path)
contention_state.fetch_add(1, mo_seq_cst); // (3)
platform_wait(platform_state, old_value); // (4)
contention_state.fetch_sub(1, mo_release); // (5) Pros
Cons
Option 1a: Sequential consistency +
|
This sounds reasonable indeed. Internally, the futex machinery has to do some kind of synchronization for sure. But as the user of that machinery one also has to take into account that there may be spurious wakeups after a The way
This sounds good! Personally I would prefer Option 1a, because (in my intuitive understanding at least) the Keeping (3) FWIW, I could verify all proposed solutions in my toy implementations of futex and eventcount with GenMC, a tool which I can wholly recommend! |
Thanks for the detailed analysis. I think all of this makes sense. And please raise a PR if that is convenient to you. Regarding performance, we do have some basic benchmarks w.r.t atomic wait and it would be interesting to see if it impacts the results on those platforms. I wonder if you have also looked at the MacOS ulock implementation on the arm architectural. If I understand correctly, it is potentially broken too. |
I think I've found an issue in
__libcpp_atomic_wait
, where an atomic operation with an overly-relaxed memory order argument could theoretically result in a lost wakeup. For what it's worth, the equivalent part of libstdc++'s atomic wait implementation uses stronger memory ordering, which I believe to be correct.To be clear, I don't think this is a problem in practice on x86 or ARM, but according to the C++ standard it looks like a bug.
Background
The atomic wait implementation in libc++ uses platform wait primitives such as futexes (linux) and
__ulock_wait/__ulock_wait
(macOS) to perform wait and notify operations. Because the platform wait primitives usually only work for values of a particular width (32-bit on Linux), you can't always wait on the user's value directly. The library instead does a laundering trick where internally it waits on a proxy value of the correct size. To do this, it uses the address of the user's atomic value to look up an entry in a 'contention table'. Each entry in the table contains two atomic values of an appropriate size for the platform; one representing a monotonically increasing counter which is used as the proxy for the user's value, and one holding a count of the number of threads currently waiting on the variable. The latter is used in a nifty optimisation which allows the platformnotify
call to be elided when there are no threads waiting on the value.The key parts of the algorithm
The algorithm that does the contention tracking is split across a few functions:
__libcpp_contention_wait, __libcpp_atomic_notify, __libcpp_contention_notify
But after inlining and removal of non-relevant parts, it essentially boils down to this pseudocode:
The line marked "Here is the problem" corresponds to this line in
atomic.cpp
.The problem
According to the C++ memory model, we should reason about code involving atomics in terms of "synchronizes-with" and "happens-before" relations, rather than potential reorderings. The contention tracking algorithm cannot be reasoned about solely in terms of synchronization between release-acquire pairs. It also requires the waiting thread and the notifying thread to agree on a single total order for operations
(1)
,(2)
,(3)
and(4)
. The way to achieve that (according to the standard) is to make all four operations seq-cst.Based on that, the
mo_release
on(1)
looks insufficient. Informally, under the C++ memory model, I do not believe there is anything stopping(1)
from 'reordering' with(2)
. We need some sort of StoreLoad barrier to get the single total order property.A Herd7 simulation shows that a lost wakeup could occur under the C++ memory model
Inspired by @jiixyj's Herd7 simulation of a similar contention tracking mechanism in the semaphore implementation, I thought I'd have a go here too. I'm certainly no expert with Herd, so treat this with a pinch of salt.
I've modelled the existing algorithm as:
prog.litmus
It's not possible to model read-compare-wait operations in Herd. Instead I've modelled the platform wait as a sequentially consistent load; I'm genuinely unsure if that's justified. I've found some discussion on stack overflow but I don't find it fully convincing. In D114119 it's modelled as a relaxed compare-exchange-strong. That looks like it might be overkill for this situation, but I can't say for sure. Guidance on this would be welcome.
The proposition at the bottom looks for the case where thread P1 enters a waiting state, but thread P0 decides not to call
notify
, which could result in deadlock. The/\
token represents logical conjunction.Executing this in Herd with:
shows that there is one possible execution exhibiting the lost wake-up behaviour. It corresponds to thread P0 running to the end without calling
notify
, while P1 decides that a 'wait' is necessary. Crucially, even though P0 has incrementedplat
'before' checkingwaiters
, P1 observes an older value in the modification order ofplat
and therefore decides to enter a waiting state.In the Herd simulation, the issue can be solved by enforcing a total order on the operations involving
plat
. Upgrading the relaxed increment ofplat
in thread P0 to haveseq_cst
memory order results in Herd reporting no remaining executions satisfying the proposition.Would a compiler ever make such reordering?
I've not found good sources on this. My suspicion is that it's unlikely to happen in practice. But I believe the standard allows it.
Could it happen on x86?
On x86, even a fully relaxed RMW operation has sequentially consistent behaviour. So no, it can't happen on x86.
Could it happen on ARM architectures?
On AArch64, the 'notify' side of the algorithm compiles down to a Load-linked/Store-conditional (LL/SC) loop for incrementing the platform counter, followed by a load-acquire (
LDAR
) for reading the number of waiters. TheSTLXR
in the LL/SC loop, and the followingLDAR
instruction both have acquire-release semantics, so they will not reorder.For what it's worth, if the
LDAR
is relaxed to a plainLDR
(i.e.std::memory_order_relaxed
) then Herd shows that it can reorder with theSTLXR
and result in a lost wakeup. The relevant parts of the algorithm are modelled here:You can try this in the online Herd sim. The proposition at the bottom corresponds to the notify side seeing no waiters (and therefore not calling the platform notify operation), and the waiter side not seeing the increment to the platform variable.
Could it happen on more relaxed architectures?
It looks like it is possible on POWER, but I haven't done any further investigation.
Potential solutions
I think the best solution is probably to upgrade
(1)
tomo_seq_cst
. This is what libstdc++ does. On x86, even a fully relaxed RMW operation has sequentially consistent behaviour, so the change will only affect potential compiler reorderings. On ARM, upgrading a RMW op fromrelease
toseq_cst
turns anldxr
into aldaxr
in the LL/SC loop. This ought to be far cheaper than any sort of fence.Other options would be to turn
(2)
into a read-don't-modify-write operation (fetch_add(0, mo_acq_rel)
) or to insertseq_cst
fences between(1)
&(2)
and(3)
and(4)
(along with fully relaxing the other operations). Both of which look more expensive to me.Concrete proposal
Update this line to be
memory_order_seq_cst
:llvm-project/libcxx/src/atomic.cpp
Line 166 in 7e56a09
The text was updated successfully, but these errors were encountered: