-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up atomic access to sleep_check_state #36785
Conversation
Fixes #36699 I belive. I think I know why this fixes it, but I need to validate that I got it correct before writing it up. Nevertheless, since I think this fixes the issue and the issue is release blocking, here's the fix ahead of the writeup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This fixes the problem for me. (I tried 10x longer version of #36699 a few times. No deadlock so far.)
So the issue is just that may_sleep is getting hoisted (by the compiler, I assume)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time staring at this in confusion :-)
At a surface level I can see is that there's various worrying-looking patterns going on here in terms of threads loading from sleep_check_state
without using atomic operations (not even relaxed loads) and also using multiple atomics to act on that state non atomically.
I noticed that there's some loads of sleep_check_state
in this file that this patch doesn't cover: line 436 and 441 — they should be relaxed loads at least? Also if sleep_check_state
is used to protect other state, do we need to be using acquire loads?
For the state transition not_sleeping <--> sleeping
, this is done in various places, and as a pair of atomics rather than a CAS which might be fine but seems non-obvious. I think it may be clearer to factor the transition into a function in the spirit of may_sleep
.
As to the exact cause of the deadlock I sure didn't see it by just eyeballing this diff! I look forward to the explanation.
@@ -495,7 +494,7 @@ JL_DLLEXPORT jl_task_t *jl_task_get_next(jl_value_t *trypoptask, jl_value_t *q) | |||
if (!_threadedregion && active && ptls->tid == 0) { | |||
// thread 0 is the only thread permitted to run the event loop | |||
// so it needs to stay alive | |||
if (ptls->sleep_check_state != not_sleeping) | |||
if (jl_atomic_load_relaxed(&ptls->sleep_check_state) != not_sleeping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (jl_atomic_load_relaxed(&ptls->sleep_check_state) != not_sleeping) | |
if (jl_atomic_load_relaxed(&ptls->sleep_check_state) == sleeping) |
@@ -476,8 +476,7 @@ JL_DLLEXPORT jl_task_t *jl_task_get_next(jl_value_t *trypoptask, jl_value_t *q) | |||
JL_UV_UNLOCK(); | |||
// optimization: check again first if we may have work to do | |||
if (!may_sleep(ptls)) { | |||
if (ptls->sleep_check_state != not_sleeping) | |||
jl_atomic_store(&ptls->sleep_check_state, not_sleeping); // let other threads know they don't need to wake us | |||
assert(ptls->sleep_check_state == not_sleeping); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just make every access of sleep_check_state
atomic as a matter of course?
assert(ptls->sleep_check_state == not_sleeping); | |
assert(jl_atomic_load_relaxed(&ptls->sleep_check_state) == not_sleeping); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. The change to use assert
also makes sense (since that's currently the implementation of may_sleep
), but assumes that we don't alter the implementation of may_sleep
to possibly have disjoint conditions that we want to check (i.e. the original partr code had a global sleep_check_state
, so that bulk operations were also attempted).
// sleep_check_state is only transitioned from not_sleeping to sleeping | ||
// by the thread itself. As a result, if this returns false, it will | ||
// continue returning false. If it returns true, there are no guarantees. | ||
return jl_atomic_load_relaxed(&ptls->sleep_check_state) == sleeping; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need something stronger than relaxed ordering here, because sleep_check_state
is meant to ensure consistency of other loads/stores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual condition variable is protected by a lock, so we don't depend on any memory access dependencies here, I don't think. That's why I picked relaxed. Do you disagree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you disagree?
I don't disagree, I'm just unsure.
The condition variable seems fine. But I noticed that may_sleep
is also used in a branch related to the event loop which is protected with jl_uv_mutex
. So may_sleep
doesn't always go along with the same lock. Which seems potentially confusing but perhaps harmless.
@@ -358,7 +355,7 @@ JL_DLLEXPORT void jl_wakeup_thread(int16_t tid) | |||
JULIA_DEBUG_SLEEPWAKE( wakeup_enter = cycleclock() ); | |||
if (tid == self || tid == -1) { | |||
// we're already awake, but make sure we'll exit uv_run | |||
if (ptls->sleep_check_state != not_sleeping) | |||
if (jl_atomic_load_relaxed(&ptls->sleep_check_state) == sleeping) | |||
jl_atomic_store(&ptls->sleep_check_state, not_sleeping); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this pair of operations looks like a CAS, but done non-atomically. Can that be a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a data race, because the only other writer is monotonic (makes the same change). In fact, we can probably relax the atomic store here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right I think that makes sense. I wondered whether the store could be less strict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon further reflection the atomic does guard the memory contents of the work queue, so we might want something stronger. I need to think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this now reads like an CAS, but it's not. I'd suggest writing this as != sleeping
, so it's clearer that it's just a regular (relaxed) store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the version which avoids the double negative. To be more explanatory it would be helpful to factor the uses of sleep_check_state
into a minimum of locations and expand the comments describing which state/invariant it's meant to protect and how this implies the atomic ops which are used.
Comparing to 1.4, the test case in the issue (if you put it inside a function) goes from 3.45 seconds to 3.0 seconds, so that's also excellent 🎉 (4 threads in my case) |
Yes, I think this code could use a bit of a refactor. However, I would like to do that more systematically and also make use of tsan support to check these things under the test suite. This PR is just here to get the bug fixed. |
Fine by me, this is my first look into this code. |
We should add the test case from the issue, just to check that it doesn't hang. |
I've updated the description with my root cause analysis of this issue. |
I think you're inviting hangs on non-x86 platforms that have an even weaker memory model. I would certainly recommend against backporting such a test. |
Fixes #36699 I belive. Here's my best understanding of what's happening.
The relevant operations are:
Thread 1:
Thread 2:
Now, x86 has a very strong memory model, but one thing in particular that is allowed is reordering loads before previous stores to another memory location. In particular, it is legal for the
atomic_load()
of sleep state to be speculated before the addition oftask
toW
. As a result, it is possible to have the following interleaving:Thus
T1
never wakes upT2
the observed deadlock results.This commit fixes that by dropping the
atomic_load
and using and atomic exchange instead, which may not be re-ordered in the same way. A different fix would be to place an explicit ReadWrite barrier before attempting to wake up the threads.That said, I think we need to take a close look at this code in general with respect to atomic correctness. I suspect there are more demons.