coroutine semantics are unsound. proposal to fix them #1363

andrewrk · 2018-08-10T20:28:57Z

Currently there is a race condition when using multithreaded coroutines. Consider the following scenario:

Thread 1 does epoll_wait/kevent/GetQueuedCompletionStatus, blocking until some work can be done. The thread wakes up, and it is about to execute resume coroutine_handle.
Thread 2 is executing the coroutine, which reaches a suspend point, and then reads its atomic state and learns that it is scheduled to be canceled (destroyed). It runs the defers and errdefers, which dutifully remove the coroutine from the epoll set. However because of (1) it isn't even in the epoll set/kqueue/iocp. (1) is about to resume the coroutine and there's nothing (2) can do to stop it. (2) proceeds to destroy the coroutine frame. The memory is gone.
Thread 1 does resume coroutine_handle which now points to invalid memory. Boom.

In order to fix this, we need to introduce new syntax and semantics. Current semantics are:

async creates a promise which must be consumed with cancel or await.
suspend must be consumed with a resume.

The problem here described above - resume and cancel racing with each other. When a suspended coroutine is canceled, the memory must not be destroyed until the suspend is canceled or resumed. Proposal for new semantics:

async creates a promise which must be consumed with cancelasync or await.
suspend must be consumed with a cancelsuspend or resume.

With these semantics, each coroutine essentially has 0, 1, or 2 references, and when the reference count reaches 0 it is destroyed. There is the "async reference", which is the main one, and the "suspend reference", which might never exist.

Therefore it is crucial that when a coroutine uses suspend - which is a low level feature intended mainly for low level library implementations of concurrency primitives - that it ensures it will be consumed with either resume or cancelsuspend.

defer and errdefer will both run when a coroutine's cancellation process begins. This means that the first cancelasync or cancelsuspend will run the defers of the coroutine. When the other reference drops, the memory will be destroyed.

The text was updated successfully, but these errors were encountered:

shawnl · 2018-09-01T19:50:29Z

What you describe has the same race condition but between cancelsuspend and resume.

The way I see to fix this is to wrap the epoll_wait() with a read-write mutex+semaphore. The `epoll_wait() thread does:

acquire the mutex
wait till the semaphore in 0
epoll_wait()
resume co-routine
release mutex
goto 1

While the cancel thread does:

increment the semaphore. If the mutex is acquired goto 4, else goto 2.
deliver an event that will wake up the epoll_wait() thread <= prevents starvation
wait for the epoll thread to set the mutex to 0
cancel the co-routine
decrement the semaphore

But this introduces a deadlock where the epoll_wait() thread is trying to resume the thread that is trying to cancel itself. It can't drop this event (because it is using EPOLLET) so after it detects a dead-lock it has to let itsself be resumed (while holding the lock preventing epoll_wait() from being called) and cancel the co-routine on the next suspend.

But that introduces a stall in the main loop while the resume thread is running, so a second lock has to be introduced, individual to promise [1]. The cancelling thread:

detect deadlock, if not destroy thread and return
lock promise
remove fd from epoll
release epoll lock
suspend
on resume (from epoll thread) unlock promise
destroy thread

Dead locks are detected with two locks taken in opposite orders.

shawnl · 2018-09-02T01:26:12Z

My proposal can be done entirely in user-space, except to keep cancel thread-safe it needs a new suspendnotfinal keyword.

Speaking of, I don't really like cancel. Why can't it just be a method on the promise. .scheduleCancellation()?

kristate · 2018-09-02T03:25:12Z

@shawnl check-out #1307

kristate · 2018-09-02T17:21:28Z

So, I'm guessing that one of the better things to do here is to keep the current keyword semantics, except to hold-off on defer & delete until a resume is called instead of doing it after a suspend.

The epoll loop will always call resume on its promises, so if we are in cancel state then we defer and delete and then return control flow back to the loop. If not in a cancel state, we simply resume.

And hey, I'm made some ASCII art:

        +-[PROMISE / INSIDE COROUTINE]--+                                      
        |'''''''''''''''''''''''''''''''|                                      
        |''''''+------------------------+--+--------------+                    
        |''''''v''''''''''''''''''''''''|  |              |                    
+-----+ |''+-------+''''+-----------+'''|  |              |                    
|async|-+->|suspend|-+->|  return   |'''|  |              |                    
+-----+ |''+-------+'|''+-----------+'''|  |              |                    
        |''''''''''''|''+-----------+'''|  |              |                    
        |''''''''''''+->|cancel(noop|'''|  |              |                    
        |''''''''''''|''+-----------+'''|  |              |                    
        +------------+------------------+  |              |                    
                     |  +---------------+  |              |                    
                     +->|     await     |--+              |                    
                     |  +---------------+   +------+      |                    
                     |  +---------------+   | set  |      |                    
                     +->|    cancel     |-->|cancel|      |NO                  
                     |  +---------------+   | bit  |  .-------.      +--------+
                     |  +---------------+   +------+ /  TO BE  \ YES | defer& |
                     +->|    resume     |---------->( CANCELED? )--->| delete |
                        +---------------+            `.       ,'     +--------+
                                                       `-----'

shawnl · 2018-09-02T17:30:35Z

The epoll loop will always call resume on its promises,

No. If no events come it they never get resumed. The cancel part can be taken out of the promise type and become part of the loop: .scheduleCancellation(). (which works according to above locking semantics) Then before suspending a co-routine has to check if it was scheduled to be cancelled while it was running, and if so destroys itself.

I opened a LLVM bug on whether co-routines are allowed to destroy themselves (as zig currently does in suspend)

https://bugs.llvm.org/show_bug.cgi?id=38805

andrewrk · 2018-09-03T23:58:27Z

I think the scenario I outlined above is not actually a problem, because a coroutine couldn't be in the epoll set and executing at the same time. But here's a problem scenario:

Thread 1 does epoll_wait/kevent/GetQueuedCompletionStatus, blocking until some work can be done. The thread wakes up, and it is about to execute resume coroutine_handle.
Thread 2 does a cancel on the coroutine. Since it's at a suspend point, it runs the defers and errdefers, which dutifully remove the coroutine from the epoll set. However because of (1) it isn't even in the epoll set/kqueue/iocp. (1) is about to resume the coroutine and there's nothing (2) can do to stop it. (2) proceeds to destroy the coroutine frame. The memory is gone.
Thread 1 does resume coroutine_handle which now points to invalid memory. Boom.

With the proposed semantics:

Thread 1 does epoll_wait/kevent/GetQueuedCompletionStatus, blocking until some work can be done. The thread wakes up, and it is about to execute resume coroutine_handle.
Thread 2 does a cancelasync on the coroutine. Since it's at a suspend point, it runs the defers and errdefers, which dutifully remove the coroutine from the epoll set. However because of (1) it isn't even in the epoll set/kqueue/iocp. (1) is about to resume the coroutine and there's nothing (2) can do to stop it. However (2) does not destroy the coroutine frame because the suspend reference is present. The memory is still there.
Thread 1 does resume coroutine_handle which finds out that the thread has been canceled. So instead of resuming, it frees the memory.

ghost · 2018-09-05T09:08:43Z

I think it would be best if Andrew were to implement his proposal and we could test it/ fuzz it.
I tried to think through this discussion and while Andrews explanation makes sense to me, I've made just enough concurrency bugs myself to know that this does not proof the correctness.

Testing the implementation does not give a proof either of course but I think its the best we can currently do until maybe zig attract some (other) domain expert that thinks it through.

I'd certainly be willing to run some test suit for an extensive time, compared to peoples time, hardware is cheap 😆

kristate · 2018-09-05T09:41:04Z

When I get some time later tonight, I will try to make another ASCII chart trying to clarify things. It's an area where we should be documenting.

ghost · 2018-09-05T09:46:20Z

rather than ascii chart we'd need a formal state diagram/ automata with a formal proof, but I've seen a professor attempting the same and still failing (false positive actually) so...brute force is the best we have although its still no good.

kristate · 2018-09-05T09:50:41Z

@monouser7dig yes, I agree. An ASCII chart is no replacement for a formal proof. Inch by inch.

ghost · 2018-10-01T17:01:06Z

so after 0.3.0 is this the first major thing on the list?
like get async working, because the self hosted compiler needs it, network needs it (sort of?) and thus

async (rewrite coroutines)
self hosted + comptime allocator/ bugs + tooling
docs
??? lots of things

@andrewrk is this roughly accurate or what is to be expected?

andrewrk · 2018-10-01T17:09:40Z

Yes, this is accurate. Reworking coroutines is the first major thing on my list. That, plus something like one bug fix + one documentation improvement per day. roadmap

Matthias247 · 2018-10-13T05:50:11Z

Maybe related to this: I've written a few lines in Rust's futures repository why synchronous cancellation might not work out well for all use-cases: rust-lang/futures-rs#1278

I guess similar things would apply to zig's coroutines as well, and it might in some cases be favorable to force them to run to completion (or at least to a safe exit point).

See #1363

andrewrk · 2019-08-16T17:48:34Z

This issue is obsoleted with the merge of #3033.

andrewrk added this to the 0.3.0 milestone Aug 10, 2018

andrewrk mentioned this issue Aug 10, 2018

re-enable stage2 tests #1364

Closed

6 tasks

This was referenced Sep 2, 2018

wrong c calling conventions on windows with bool #1447

Closed

complete the implementation of std.mutex on all platforms #1455

Closed

shawnl mentioned this issue Sep 2, 2018

breaking: remove cancel, replace with destroy and quit #1457

Closed

shawnl mentioned this issue Sep 3, 2018

initial arm64 support #1429

Merged

4 tasks

andrewrk modified the milestones: 0.3.0, 0.4.0 Sep 11, 2018

This was referenced Oct 2, 2018

Add doc comment for tokenLocationPtr #1618

Merged

syntax for guaranteed coroutine frame allocation elision #1260

Closed

andrewrk added a commit that referenced this issue Nov 15, 2018

disable windows test until coroutines rewrite lands

b8b36f3

See #1363

andrewrk removed this from the 0.4.0 milestone Jan 31, 2019

andrewrk added this to the 0.5.0 milestone Jan 31, 2019

This was referenced Feb 18, 2019

Documentation for the standard library #965

Closed

package manager #943

Closed

http client in the standard library #2007

Closed

andrewrk mentioned this issue Apr 29, 2019

The Coroutine Rewrite Issue #2377

Closed

andrewrk closed this as completed Aug 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coroutine semantics are unsound. proposal to fix them #1363

coroutine semantics are unsound. proposal to fix them #1363

andrewrk commented Aug 10, 2018

shawnl commented Sep 1, 2018 •

edited

Loading

shawnl commented Sep 2, 2018 •

edited

Loading

kristate commented Sep 2, 2018

kristate commented Sep 2, 2018 •

edited

Loading

shawnl commented Sep 2, 2018 •

edited

Loading

andrewrk commented Sep 3, 2018

ghost commented Sep 5, 2018

kristate commented Sep 5, 2018

ghost commented Sep 5, 2018

kristate commented Sep 5, 2018 •

edited

Loading

ghost commented Oct 1, 2018

andrewrk commented Oct 1, 2018

Matthias247 commented Oct 13, 2018

andrewrk commented Aug 16, 2019

coroutine semantics are unsound. proposal to fix them #1363

coroutine semantics are unsound. proposal to fix them #1363

Comments

andrewrk commented Aug 10, 2018

shawnl commented Sep 1, 2018 • edited Loading

shawnl commented Sep 2, 2018 • edited Loading

kristate commented Sep 2, 2018

kristate commented Sep 2, 2018 • edited Loading

shawnl commented Sep 2, 2018 • edited Loading

andrewrk commented Sep 3, 2018

ghost commented Sep 5, 2018

kristate commented Sep 5, 2018

ghost commented Sep 5, 2018

kristate commented Sep 5, 2018 • edited Loading

ghost commented Oct 1, 2018

andrewrk commented Oct 1, 2018

Matthias247 commented Oct 13, 2018

andrewrk commented Aug 16, 2019

shawnl commented Sep 1, 2018 •

edited

Loading

shawnl commented Sep 2, 2018 •

edited

Loading

kristate commented Sep 2, 2018 •

edited

Loading

shawnl commented Sep 2, 2018 •

edited

Loading

kristate commented Sep 5, 2018 •

edited

Loading