Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libc++] Incorrect memory order in atomic wait #109290

Open
aharrison24 opened this issue Sep 19, 2024 · 4 comments
Open

[libc++] Incorrect memory order in atomic wait #109290

aharrison24 opened this issue Sep 19, 2024 · 4 comments
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi.

Comments

@aharrison24
Copy link

aharrison24 commented Sep 19, 2024

I think I've found an issue in __libcpp_atomic_wait, where an atomic operation with an overly-relaxed memory order argument could theoretically result in a lost wakeup. For what it's worth, the equivalent part of libstdc++'s atomic wait implementation uses stronger memory ordering, which I believe to be correct.

To be clear, I don't think this is a problem in practice on x86 or ARM, but according to the C++ standard it looks like a bug.

Background

The atomic wait implementation in libc++ uses platform wait primitives such as futexes (linux) and __ulock_wait/__ulock_wait (macOS) to perform wait and notify operations. Because the platform wait primitives usually only work for values of a particular width (32-bit on Linux), you can't always wait on the user's value directly. The library instead does a laundering trick where internally it waits on a proxy value of the correct size. To do this, it uses the address of the user's atomic value to look up an entry in a 'contention table'. Each entry in the table contains two atomic values of an appropriate size for the platform; one representing a monotonically increasing counter which is used as the proxy for the user's value, and one holding a count of the number of threads currently waiting on the variable. The latter is used in a nifty optimisation which allows the platform notify call to be elided when there are no threads waiting on the value.

The key parts of the algorithm

The algorithm that does the contention tracking is split across a few functions:
__libcpp_contention_wait, __libcpp_atomic_notify, __libcpp_contention_notify

But after inlining and removal of non-relevant parts, it essentially boils down to this pseudocode:

std::atomic<uint32_t> platform_state;   // Monotonically increasing count
std::atomic<uint32_t> contention_state; // Waiter count

-------------------------------
// Notifying thread
platform_state.fetch_add(1, mo_release);      // (1) <---- Here is the problem
if (0 != contention_state.load(mo_seq_cst)) { // (2) 
  platform_notify_all(platform_state);
}

-------------------------------
// Waiting thread
contention_state.fetch_add(1, mo_seq_cst);    // (3)
platform_wait(platform_state, old_value);     // (4)
contention_state.fetch_sub(1, mo_release);    // (5)

The line marked "Here is the problem" corresponds to this line in atomic.cpp.

The problem

According to the C++ memory model, we should reason about code involving atomics in terms of "synchronizes-with" and "happens-before" relations, rather than potential reorderings. The contention tracking algorithm cannot be reasoned about solely in terms of synchronization between release-acquire pairs. It also requires the waiting thread and the notifying thread to agree on a single total order for operations (1), (2), (3) and (4). The way to achieve that (according to the standard) is to make all four operations seq-cst.

Based on that, the mo_release on (1) looks insufficient. Informally, under the C++ memory model, I do not believe there is anything stopping (1) from 'reordering' with (2). We need some sort of StoreLoad barrier to get the single total order property.

A Herd7 simulation shows that a lost wakeup could occur under the C++ memory model

Inspired by @jiixyj's Herd7 simulation of a similar contention tracking mechanism in the semaphore implementation, I thought I'd have a go here too. I'm certainly no expert with Herd, so treat this with a pinch of salt.

I've modelled the existing algorithm as:

prog.litmus

C prog

{
}

P0 (atomic_int* plat, atomic_int* waiters) {
  atomic_fetch_add_explicit(plat, 1, memory_order_release);
  int num_waiters = atomic_load_explicit(waiters, memory_order_seq_cst);
  int do_notify = (num_waiters != 0);
}

P1 (atomic_int* plat, atomic_int* waiters) {
  atomic_fetch_add_explicit(waiters, 1, memory_order_seq_cst);
  int do_wait = (atomic_load_explicit(plat, memory_order_seq_cst) == 0);
  
  // Deliberately left out - it's not relevant to the analysis
  //atomic_fetch_sub_explicit(waiters, 1, memory_order_release);
}

exists
(0:do_notify=0 /\ 1:do_wait=1)

It's not possible to model read-compare-wait operations in Herd. Instead I've modelled the platform wait as a sequentially consistent load; I'm genuinely unsure if that's justified. I've found some discussion on stack overflow but I don't find it fully convincing. In D114119 it's modelled as a relaxed compare-exchange-strong. That looks like it might be overkill for this situation, but I can't say for sure. Guidance on this would be welcome.

The proposition at the bottom looks for the case where thread P1 enters a waiting state, but thread P0 decides not to call notify, which could result in deadlock. The /\ token represents logical conjunction.

Executing this in Herd with:

mkdir -p output
rm -f output/prog.dot
herd7 -c11 -show prop -graph columns -o output -showinitwrites false -oneinit false prog.litmus
dot -Tpdf output/prog.dot > prog.pdf

shows that there is one possible execution exhibiting the lost wake-up behaviour. It corresponds to thread P0 running to the end without calling notify, while P1 decides that a 'wait' is necessary. Crucially, even though P0 has incremented plat 'before' checking waiters, P1 observes an older value in the modification order of plat and therefore decides to enter a waiting state.

In the Herd simulation, the issue can be solved by enforcing a total order on the operations involving plat. Upgrading the relaxed increment of plat in thread P0 to have seq_cst memory order results in Herd reporting no remaining executions satisfying the proposition.

Would a compiler ever make such reordering?

I've not found good sources on this. My suspicion is that it's unlikely to happen in practice. But I believe the standard allows it.

Could it happen on x86?

On x86, even a fully relaxed RMW operation has sequentially consistent behaviour. So no, it can't happen on x86.

Could it happen on ARM architectures?

On AArch64, the 'notify' side of the algorithm compiles down to a Load-linked/Store-conditional (LL/SC) loop for incrementing the platform counter, followed by a load-acquire (LDAR) for reading the number of waiters. The STLXR in the LL/SC loop, and the following LDAR instruction both have acquire-release semantics, so they will not reorder.

For what it's worth, if the LDAR is relaxed to a plain LDR (i.e. std::memory_order_relaxed) then Herd shows that it can reorder with the STLXR and result in a lost wakeup. The relevant parts of the algorithm are modelled here:

AArch64 Contention_Tracking
{
0:X0=plat; 0:X1=count;
1:X0=plat; 1:X1=count;
}
 P0                 | P1;
 label1:            | label2:;
 LDXR W8, [X0]      | LDAXR W8, [x1];
 ADD W8, W8, #1     | ADD W8, W8, #1;
 STLXR W9, W8, [X0] | STLXR W9, W8, [x1];
 CBNZ W9, label1    | CBNZ W9, label2;
 LDR W2, [X1]       | LDAR W3, [X0];
exists
(0:X2=0 /\ 1:X3=0)

You can try this in the online Herd sim. The proposition at the bottom corresponds to the notify side seeing no waiters (and therefore not calling the platform notify operation), and the waiter side not seeing the increment to the platform variable.

Could it happen on more relaxed architectures?

It looks like it is possible on POWER, but I haven't done any further investigation.

Potential solutions

I think the best solution is probably to upgrade (1) to mo_seq_cst. This is what libstdc++ does. On x86, even a fully relaxed RMW operation has sequentially consistent behaviour, so the change will only affect potential compiler reorderings. On ARM, upgrading a RMW op from release to seq_cst turns an ldxr into a ldaxr in the LL/SC loop. This ought to be far cheaper than any sort of fence.

Other options would be to turn (2) into a read-don't-modify-write operation (fetch_add(0, mo_acq_rel)) or to insert seq_cst fences between (1) & (2) and (3) and (4) (along with fully relaxing the other operations). Both of which look more expensive to me.

Concrete proposal

Update this line to be memory_order_seq_cst:

__cxx_atomic_fetch_add(&__entry->__platform_state, __cxx_contention_t(1), memory_order_release);

@github-actions github-actions bot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Sep 19, 2024
@jiixyj
Copy link
Contributor

jiixyj commented Sep 19, 2024

Fantastic analysis! I've only skimmed over it, but I believe you are correct.

Here are some additional thoughts:

I don't think that modelling futex_wait with anything involving seq_cst is 100% correct. The linked StackOverflow post models it with a.compare_exchange_strong(..., std::memory_order_seq_cst, std::memory_order_seq_cst);. But I believe this is a wrong reading of the man page. Involving seq_cst would mean that futex_wait would join the single total modification order of all atomic operations that are tagged with seq_cst (including seq_cst operations on other atomic variables!). However, the futex man page states (emphasis mine):

The loading of the futex word's value, the comparison of that value
with the expected value, and the actual blocking will happen
atomically and will be totally ordered with respect to concurrent
operations performed by other threads on the same futex word.

Together with the following quote:

Analogously to an atomic compare-and-exchange operation that
potentially changes shared memory, blocking via a futex is an
atomic compare-and-block operation.

...I think it is most appropriate to model futex_wait as a.compare_exchange_strong(..., std::memory_order_relaxed, std::memory_order_relaxed);. My mental model of this is that it's analogous to condition variables -- cv_wait/cv_notify don't provide any memory ordering on their own, but their associated mutex does! Similarly, futex_wait/futex_notify don't provide any memory ordering on their own -- it has to be ensured by the user in some other way.

That being said, Hyrum's Law is still in effect. Internally, one of the first things futex_wait does is call smb_mb() which I believe is a full memory barrier and more or less equivalent to a std::atomic_thread_fence(std::memory_order_seq_cst) fence. So a more "real life" model of futex_wait that depends on implementation details of the Linux kernel might be std::atomic_thread_fence(std::memory_order_seq_cst); a.compare_exchange_strong(..., std::memory_order_relaxed, std::memory_order_relaxed);.

The proposed fix (upgrading (1) to mo_seq_cst) relies on futex_wait providing a std::atomic_thread_fence(std::memory_order_seq_cst). This is OK if we assume that futex_wait provides this. If we don't, then we have to write that fence ourselves before calling into the platform wait function. I think that (3) being seq_cst is not enough. The seq_cst fence has to be there, after (3).

As noted by cbloom in this blog post, what we actually need here is not seq_cst, but a store/load barrier between (1) and (2). One indication that seq_cst is not needed is that there are only two threads in our example. However, seq_cst should only be truly needed when there are three or more threads.

So all the seq_cst operations are only there to work around C++'s missing store/load barrier. Because of this, I prefer the other solution of turning (2) into fetch_add(0, mo_rel) (I believe rel is enough here), and weakening (3) to acq. With this we worked around C++'s lack of a store/load barrier by turning the load into a store. On x86_64 this should optimize nicely with Clang (it will emit a memory fence followed by a load instead of a RMW operation, IIRC). On other platforms/compilers this might be not so optimized, though, which might make that solution a non-starter in practice... One additional upside though is that we can drop the assumption that futex_wait provides a std::atomic_thread_fence(std::memory_order_seq_cst) fence. It all works out, even if we assume futex_wait is purely a relaxed compare-and-block operation.

If we are going with the seq_cst solution, I think it would be a good idea to put a std::atomic_thread_fence(std::memory_order_seq_cst) fence before the platform wait function to avoid depending on implementation details of futex_wait. We are entering the kernel anyway, and this is on the "slow" path, so the overhead of this in practice should be minimal.

@aharrison24
Copy link
Author

Thank you for the thoughtful response. I really appreciate the insights.

I don't think that modelling futex_wait with anything involving seq_cst is 100% correct. The linked StackOverflow post models it with a.compare_exchange_strong(..., std::memory_order_seq_cst, std::memory_order_seq_cst);. But I believe this is a wrong reading of the man page. Involving seq_cst would mean that futex_wait would join the single total modification order of all atomic operations that are tagged with seq_cst (including seq_cst operations on other atomic variables!).

Agreed. That was my concern with the StackOverflow discussion. The description in the futex man pages looks very similar to the C++ requirement that all threads agree on the modification order of a single atomic variable, which implies that seq_cst is not needed. But I was also not convinced that it was right to model futex_wait as a fully relaxed operation. My reasoning was that a x.store(1, mo_release) in one thread can surely 'synchronise' with a futex_wait that happens later in another thread, resulting in something that looks like a happens-before relationship. So I thought it must have at least load-acquire semantics. I struggle to justify that further, though.

Internally, one of the first things futex_wait does is call smb_mb()...

It's interesting to see that the futex design docs make a big deal out of using that smp_mb() to ensure correctness. Ironically, it's in a contention tracking mechanism very similar to the one we're talking about. It also looks like they got it wrong initially due to the lack of a #StoreLoad barrier on some paths!

If we are going with the seq_cst solution, I think it would be a good idea to put a std::atomic_thread_fence(std::memory_order_seq_cst) fence before the platform wait function to avoid depending on implementation details of futex_wait. We are entering the kernel anyway, and this is on the "slow" path, so the overhead of this in practice should be minimal.

This seems like sensible reasoning to me, though the idea of using a full memory barrier when it isn't strictly necessary makes me a bit nervous. Not as nervous as the inverse problem, mind you!

Given what you've suggested, is the following an acceptable way to model the futex wait in Herd?

P1 (atomic_int* plat, atomic_int* waiters, int* expected) {
  ...

  // Platform wait
  atomic_thread_fence(memory_order_seq_cst);
  int do_wait = atomic_compare_exchange_strong_explicit(plat,
        expected, 0, memory_order_relaxed, memory_order_relaxed);
  
  ...
}

where *expected is initialised to zero elsewhere. The seq_cst fence is there on the basis that either the platform wait would have a full memory barrier in practice, or we'd have to provide one ourselves.

I think it's worth looking at the pros and cons of the proposed solutions. You can see the code generated for each of them in Compiler Explorer. They all pass cleanly in Herd using the platform_wait model proposed above.

Option 1: Sequential consistency

This option upgrades (1) to be sequentially consistent, making the algorithm equivalent to what libstdc++ currently does.

-------------------------------
// Notifying thread
platform_state.fetch_add(1, mo_seq_cst);      // (1) <---- UPDATED
if (0 != contention_state.load(mo_seq_cst)) { // (2) 
  platform_notify_all(platform_state);
}

-------------------------------
// Waiting thread ("slow" path)
contention_state.fetch_add(1, mo_seq_cst);    // (3)            
platform_wait(platform_state, old_value);     // (4)
contention_state.fetch_sub(1, mo_release);    // (5)

Pros

  • No change to assembly on x86.
  • Minimal change to assembly on AArch64 (LDXR is upgraded to LDAXR).
  • Easy to reason about, because everything is seq_cst.

Cons

  • We don't really need full sequential consistency (with its global modification order) for this algorithm. All we wanted was a #StoreLoad barrier, but that's not directly expressible in the C++ memory model.
  • If for some reason the platform wait does not behave as if it contains a full memory barrier then this solution is not correct according to the C++ memory model.

Option 1a: Sequential consistency + seq_cst fence before platform wait

If we insert a prophylactic seq_cst fence between (3) and (4) (in case the platform wait doesn't have one) then we can relax (3).

-------------------------------
// Notifying thread
platform_state.fetch_add(1, mo_seq_cst);      // (1)  <--- UPDATED
if (0 != contention_state.load(mo_seq_cst)) { // (2) 
  platform_notify_all(platform_state);
}

-------------------------------
// Waiting thread ("slow" path)
contention_state.fetch_add(1, mo_relaxed);    // (3)  <--- UPDATED
std::atomic_thread_fence(mo_seq_cst);         // (3a) <--- ADDED      
platform_wait(platform_state, old_value);     // (4)
contention_state.fetch_sub(1, mo_release);    // (5)

In practice, we might wish to make (3) be mo_release to prevent it reordering with preceding instructions and thus increasing the period for which contention_state is incremented. That could be useful to reduce the likelihood that another thread will decide to call platform_notify_all.

Pros

  • Same as for Option 1 above.
  • Its correctness is robust to future changes in the way the platform wait is implemented. Hyrum would be happy.
  • Since the extra fence (3a) is on the slow path anyway, the cost is negligible compared to the cost of the system call to platform_wait.

Cons

  • Same as for Option 1 above.
  • It's a lot harder to reason about if we make (3) weaker than seq_cst. It would be wise to add some explanatory comments!

Option 2: Read-don't-modify-write (a.k.a idempotent RMW)

This option removes all seq_cst operations, and uses an RMW operation to get the ordering we need.

-------------------------------
// Notifying thread
platform_state.fetch_add(1, mo_release);       // (1)
if (0 !=
  contention_state.fetch_add(0, mo_release)) { // (2) <--- UPDATED
  platform_notify_all(platform_state);
}

-------------------------------
// Waiting thread ("slow" path)
contention_state.fetch_add(1, mo_acquire);     // (3) <--- UPDATED
platform_wait(platform_state, old_value);      // (4)
contention_state.fetch_sub(1, mo_release);     // (5)

As before, we might want to upgrade (3) to mo_acq_rel, to prevent it reordering with an arbitrary number of previous operations.

On x86, the fetch_add(0, ...) will compile to either a lock xadd (gcc) or mfence;mov (clang). The former has the disadvantage of possibly introducing contention on a shared variable (thus extra cache coherence traffic). The latter would put a full memory barrier on the hot path.

There seem to be varying opinions about how expensive an mfence is in practice, though the consensus seems to have shifted in the last few years. Received wisdom from John McAlpin and Agner Fogg in 2015 was that it was fairly cheap. Clang introduced the optimisation from fetch_add(0,...) to mfence;mov in 2014 specifically to address the cache-line bouncing problem in sequence lock implementations.

There's been a change of heart on mfence for use in seq_cst atomic thread fences in particular. This might have come about as a result of some extensive benchmarking on various Intel chips in 2014 by Aleksey Shipilëv. GCC 11 switched to a lock-prefixed instruction(2020) for seq_cst atomic thread fences. The boost atomic library cites that as its reason for using a custom seq_cst atomic thread fence when compiling on clang. In 2023, the author of libfork benchmarked clang vs gcc and claimed that the mfence version of the seq_cst fence was 5x slower for their workload (claim later revised to 2x!). That led to them writing a custom seq_cst atomic thread fence too. There's currently an open PR on LLVM to switch to using a lock prefixed instruction for seq_cst thread fences. The key thing about the newer techniques is that they actually perform the lock prefixed instruction on a dummy local variable, so that they do not introduce extra contention on shared variables. MSVC STL does the same thing.

There appears to have been some work in 2019 to stop clang using mfence for idempotent RMW instructions but I don't think that went in.

Pros

  • We no longer need full seq_cst to express the ordering that we want.
  • The waiter thread now looks much more like a typical critical section, since it is book-ended by acquire and release operations.

Cons

  • Adds potentially expensive extra instructions in to the hot path. This may be unpalatable given that the contention tracking algorithm is explicitly trying to optimize the 'notify' side.
  • May increase contention (therefore cache-line bouncing) on the contention_state variable.

Summary

I think my money's still on Option 1 or Option 1a. Impact on the hot path is minimal, and even though they use 'unnecessary' seq_cst ordering, it looks like we can't actually do anything more efficient in practice.

@jiixyj
Copy link
Contributor

jiixyj commented Sep 28, 2024

But I was also not convinced that it was right to model futex_wait as a fully relaxed operation. My reasoning was that a x.store(1, mo_release) in one thread can surely 'synchronise' with a futex_wait that happens later in another thread, resulting in something that looks like a happens-before relationship. So I thought it must have at least load-acquire semantics. I struggle to justify that further, though.

This sounds reasonable indeed. Internally, the futex machinery has to do some kind of synchronization for sure. But as the user of that machinery one also has to take into account that there may be spurious wakeups after a futex_wait. So if you wake up, you never know if it was spurious or the result of a futex_notify (or even a completely unrelated futex_notify from a different part of the program!).

The way futex_wait was standardized in C++ avoids those issues by looping around the platform wait call and establishing memory synchronization through reads of the atomic variable.

I think my money's still on Option 1 or Option 1a. Impact on the hot path is minimal, and even though they use 'unnecessary' seq_cst ordering, it looks like we can't actually do anything more efficient in practice.

This sounds good! Personally I would prefer Option 1a, because (in my intuitive understanding at least) the std::atomic_thread_fence(mo_seq_cst); at (3a) "belongs" more to the whole synchronization scheme between the waiting thread and the two mo_seq_cst operations at (1) and (2) rather than to the platform_wait.

Keeping (3) mo_relaxed would follow the philosophy of always using the weakest possible memory orderings if one actually goes to the trouble of using anything weaker than seq_cst in the first place. But putting some comments on this would be a good idea in any case!

FWIW, I could verify all proposed solutions in my toy implementations of futex and eventcount with GenMC, a tool which I can wholly recommend!

@huixie90
Copy link
Contributor

huixie90 commented Nov 3, 2024

Thanks for the detailed analysis. I think all of this makes sense. And please raise a PR if that is convenient to you. Regarding performance, we do have some basic benchmarks w.r.t atomic wait and it would be interesting to see if it impacts the results on those platforms. I wonder if you have also looked at the MacOS ulock implementation on the arm architectural. If I understand correctly, it is potentially broken too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi.
Projects
None yet
Development

No branches or pull requests

3 participants