Clean up atomic access to sleep_check_state #36785

Keno · 2020-07-24T00:08:11Z

Fixes #36699 I belive. Here's my best understanding of what's happening.
The relevant operations are:

Thread 1:

push!(W, task) # assign the task to the waitqueue, implicitly a non-atomic assign to .head of the queue protected by a lock
# Some time later
if (atomic_load(sleep_state) == sleeping) {
   if (atomic_exchange(...)) {
     wake
   }
}

Thread 2:

if isempty(W) {
   atomic_store(sleep_state)
   !isempty(W) && continue()
   sleep() 
}

Now, x86 has a very strong memory model, but one thing in particular that is allowed is reordering loads before previous stores to another memory location. In particular, it is legal for the atomic_load() of sleep state to be speculated before the addition of task to W. As a result, it is possible to have the following interleaving:

T1:
r = atomic_load(sleep_state) # r == not_sleeping

T2: 
   atomic_store(sleep_state)
   isempty(W) # true
   sleep()

T1:
push!(W, task) # assign the task to the waitqueue, implicitly a non-atomic assign to .head of the queue protected by a lock
# Some time later
if (r == sleeping #= true =#) {

}

Thus T1 never wakes up T2 the observed deadlock results.

This commit fixes that by dropping the atomic_load and using and atomic exchange instead, which may not be re-ordered in the same way. A different fix would be to place an explicit ReadWrite barrier before attempting to wake up the threads.
That said, I think we need to take a close look at this code in general with respect to atomic correctness. I suspect there are more demons.

Fixes #36699 I belive. I think I know why this fixes it, but I need to validate that I got it correct before writing it up. Nevertheless, since I think this fixes the issue and the issue is release blocking, here's the fix ahead of the writeup.

tkf

Thanks! This fixes the problem for me. (I tried 10x longer version of #36699 a few times. No deadlock so far.)

vtjnash · 2020-07-24T00:58:51Z

So the issue is just that may_sleep is getting hoisted (by the compiler, I assume)?

c42f

I spent some time staring at this in confusion :-)

At a surface level I can see is that there's various worrying-looking patterns going on here in terms of threads loading from sleep_check_state without using atomic operations (not even relaxed loads) and also using multiple atomics to act on that state non atomically.

I noticed that there's some loads of sleep_check_state in this file that this patch doesn't cover: line 436 and 441 — they should be relaxed loads at least? Also if sleep_check_state is used to protect other state, do we need to be using acquire loads?

For the state transition not_sleeping <--> sleeping, this is done in various places, and as a pair of atomics rather than a CAS which might be fine but seems non-obvious. I think it may be clearer to factor the transition into a function in the spirit of may_sleep.

As to the exact cause of the deadlock I sure didn't see it by just eyeballing this diff! I look forward to the explanation.

c42f · 2020-07-24T00:48:35Z

src/partr.c

@@ -495,7 +494,7 @@ JL_DLLEXPORT jl_task_t *jl_task_get_next(jl_value_t *trypoptask, jl_value_t *q)
                if (!_threadedregion && active && ptls->tid == 0) {
                    // thread 0 is the only thread permitted to run the event loop
                    // so it needs to stay alive
-                    if (ptls->sleep_check_state != not_sleeping)
+                    if (jl_atomic_load_relaxed(&ptls->sleep_check_state) != not_sleeping)


Suggested change

if (jl_atomic_load_relaxed(&ptls->sleep_check_state) != not_sleeping)

if (jl_atomic_load_relaxed(&ptls->sleep_check_state) == sleeping)

c42f · 2020-07-24T01:06:16Z

src/partr.c

@@ -476,8 +476,7 @@ JL_DLLEXPORT jl_task_t *jl_task_get_next(jl_value_t *trypoptask, jl_value_t *q)
                    JL_UV_UNLOCK();
                    // optimization: check again first if we may have work to do
                    if (!may_sleep(ptls)) {
-                        if (ptls->sleep_check_state != not_sleeping)
-                            jl_atomic_store(&ptls->sleep_check_state, not_sleeping); // let other threads know they don't need to wake us
+                        assert(ptls->sleep_check_state == not_sleeping);


Should we just make every access of sleep_check_state atomic as a matter of course?

Suggested change

assert(ptls->sleep_check_state == not_sleeping);

assert(jl_atomic_load_relaxed(&ptls->sleep_check_state) == not_sleeping);

I agree. The change to use assert also makes sense (since that's currently the implementation of may_sleep), but assumes that we don't alter the implementation of may_sleep to possibly have disjoint conditions that we want to check (i.e. the original partr code had a global sleep_check_state, so that bulk operations were also attempted).

c42f · 2020-07-24T01:20:02Z

src/partr.c

+    // sleep_check_state is only transitioned from not_sleeping to sleeping
+    // by the thread itself. As a result, if this returns false, it will
+    // continue returning false. If it returns true, there are no guarantees.
+    return jl_atomic_load_relaxed(&ptls->sleep_check_state) == sleeping;


Do we need something stronger than relaxed ordering here, because sleep_check_state is meant to ensure consistency of other loads/stores?

The actual condition variable is protected by a lock, so we don't depend on any memory access dependencies here, I don't think. That's why I picked relaxed. Do you disagree?

Do you disagree?

I don't disagree, I'm just unsure.

The condition variable seems fine. But I noticed that may_sleep is also used in a branch related to the event loop which is protected with jl_uv_mutex. So may_sleep doesn't always go along with the same lock. Which seems potentially confusing but perhaps harmless.

c42f · 2020-07-24T01:22:13Z

src/partr.c

@@ -358,7 +355,7 @@ JL_DLLEXPORT void jl_wakeup_thread(int16_t tid)
    JULIA_DEBUG_SLEEPWAKE( wakeup_enter = cycleclock() );
    if (tid == self || tid == -1) {
        // we're already awake, but make sure we'll exit uv_run
-        if (ptls->sleep_check_state != not_sleeping)
+        if (jl_atomic_load_relaxed(&ptls->sleep_check_state) == sleeping)
            jl_atomic_store(&ptls->sleep_check_state, not_sleeping);


So this pair of operations looks like a CAS, but done non-atomically. Can that be a problem?

It's not a data race, because the only other writer is monotonic (makes the same change). In fact, we can probably relax the atomic store here.

Right I think that makes sense. I wondered whether the store could be less strict.

Upon further reflection the atomic does guard the memory contents of the work queue, so we might want something stronger. I need to think about it.

Yeah, this now reads like an CAS, but it's not. I'd suggest writing this as != sleeping, so it's clearer that it's just a regular (relaxed) store.

I prefer the version which avoids the double negative. To be more explanatory it would be helpful to factor the uses of sleep_check_state into a minimum of locations and expand the comments describing which state/invariant it's meant to protect and how this implies the atomic ops which are used.

JeffBezanson · 2020-07-24T02:28:15Z

Comparing to 1.4, the test case in the issue (if you put it inside a function) goes from 3.45 seconds to 3.0 seconds, so that's also excellent 🎉 (4 threads in my case)

Keno · 2020-07-24T02:55:14Z

At a surface level I can see is that there's various worrying-looking patterns going on here in terms of threads loading from sleep_check_state without using atomic operations

Yes, I think this code could use a bit of a refactor. However, I would like to do that more systematically and also make use of tsan support to check these things under the test suite. This PR is just here to get the bug fixed.

c42f · 2020-07-24T03:59:18Z

This PR is just here to get the bug fixed.

Fine by me, this is my first look into this code.

JeffBezanson · 2020-07-24T04:31:31Z

We should add the test case from the issue, just to check that it doesn't hang.

Keno · 2020-07-24T07:08:14Z

I've updated the description with my root cause analysis of this issue.

Keno · 2020-07-24T07:08:44Z

We should add the test case from the issue, just to check that it doesn't hang.

I think you're inviting hangs on non-x86 platforms that have an even weaker memory model. I would certainly recommend against backporting such a test.

Fixes #36699 (cherry picked from commit d81f044)

Fixes JuliaLang#36699

Keno requested review from tkf, vtjnash and JeffBezanson July 24, 2020 00:08

Keno mentioned this pull request Jul 24, 2020

Deadlock from simple repeated @spawn and wait #36699

Closed

Keno force-pushed the kf/36699 branch from fe1db30 to 68af23d Compare July 24, 2020 00:12

Keno added backport 1.5 bugfix This change fixes an existing bug multithreading Base.Threads and related functionality labels Jul 24, 2020

tkf approved these changes Jul 24, 2020

View reviewed changes

c42f reviewed Jul 24, 2020

View reviewed changes

JeffBezanson merged commit d81f044 into master Jul 24, 2020

JeffBezanson deleted the kf/36699 branch July 24, 2020 18:14

JeffBezanson pushed a commit that referenced this pull request Jul 24, 2020

Clean up atomic access to sleep_check_state (#36785)

9b37634

Fixes #36699 (cherry picked from commit d81f044)

JeffBezanson mentioned this pull request Jul 24, 2020

[release 1.5] more backports for 1.5-rc2 #36755

Merged

KristofferC removed the backport 1.5 label Aug 3, 2020

simeonschaub pushed a commit to simeonschaub/julia that referenced this pull request Aug 11, 2020

Clean up atomic access to sleep_check_state (JuliaLang#36785)

309e23f

Fixes JuliaLang#36699

vtjnash mentioned this pull request Aug 28, 2020

PR #32599 introduced deadlocks #35441

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up atomic access to sleep_check_state #36785

Clean up atomic access to sleep_check_state #36785

Keno commented Jul 24, 2020 •

edited

Loading

tkf left a comment

vtjnash commented Jul 24, 2020

c42f left a comment

c42f Jul 24, 2020

c42f Jul 24, 2020

vtjnash Jul 24, 2020

c42f Jul 24, 2020

Keno Jul 24, 2020

c42f Jul 24, 2020

c42f Jul 24, 2020

Keno Jul 24, 2020

c42f Jul 24, 2020

Keno Jul 24, 2020

vtjnash Jul 24, 2020 •

edited

Loading

c42f Jul 25, 2020

JeffBezanson commented Jul 24, 2020

Keno commented Jul 24, 2020

c42f commented Jul 24, 2020

JeffBezanson commented Jul 24, 2020

Keno commented Jul 24, 2020

Keno commented Jul 24, 2020

	if (jl_atomic_load_relaxed(&ptls->sleep_check_state) != not_sleeping)
	if (jl_atomic_load_relaxed(&ptls->sleep_check_state) == sleeping)

	assert(ptls->sleep_check_state == not_sleeping);
	assert(jl_atomic_load_relaxed(&ptls->sleep_check_state) == not_sleeping);

Clean up atomic access to sleep_check_state #36785

Clean up atomic access to sleep_check_state #36785

Conversation

Keno commented Jul 24, 2020 • edited Loading

tkf left a comment

Choose a reason for hiding this comment

vtjnash commented Jul 24, 2020

c42f left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vtjnash Jul 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JeffBezanson commented Jul 24, 2020

Keno commented Jul 24, 2020

c42f commented Jul 24, 2020

JeffBezanson commented Jul 24, 2020

Keno commented Jul 24, 2020

Keno commented Jul 24, 2020

Keno commented Jul 24, 2020 •

edited

Loading

vtjnash Jul 24, 2020 •

edited

Loading