-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coroutine semantics are unsound. proposal to fix them #1363
Comments
What you describe has the same race condition but between The way I see to fix this is to wrap the epoll_wait() with a read-write mutex+semaphore. The `epoll_wait() thread does:
While the cancel thread does:
But this introduces a deadlock where the But that introduces a stall in the main loop while the resume thread is running, so a second lock has to be introduced, individual to promise [1]. The cancelling thread:
Dead locks are detected with two locks taken in opposite orders. |
My proposal can be done entirely in user-space, except to keep Speaking of, I don't really like |
So, I'm guessing that one of the better things to do here is to keep the current keyword semantics, except to hold-off on defer & delete until a resume is called instead of doing it after a suspend. The epoll loop will always call resume on its promises, so if we are in And hey, I'm made some ASCII art:
|
No. If no events come it they never get resumed. The I opened a LLVM bug on whether co-routines are allowed to destroy themselves (as zig currently does in |
I think the scenario I outlined above is not actually a problem, because a coroutine couldn't be in the epoll set and executing at the same time. But here's a problem scenario:
With the proposed semantics:
|
I think it would be best if Andrew were to implement his proposal and we could test it/ fuzz it. Testing the implementation does not give a proof either of course but I think its the best we can currently do until maybe zig attract some (other) domain expert that thinks it through. I'd certainly be willing to run some test suit for an extensive time, compared to peoples time, hardware is cheap 😆 |
When I get some time later tonight, I will try to make another ASCII chart trying to clarify things. It's an area where we should be documenting. |
rather than ascii chart we'd need a formal state diagram/ automata with a formal proof, but I've seen a professor attempting the same and still failing (false positive actually) so...brute force is the best we have although its still no good. |
@monouser7dig yes, I agree. An ASCII chart is no replacement for a formal proof. Inch by inch. |
so after 0.3.0 is this the first major thing on the list?
@andrewrk is this roughly accurate or what is to be expected? |
Yes, this is accurate. Reworking coroutines is the first major thing on my list. That, plus something like one bug fix + one documentation improvement per day. roadmap |
Maybe related to this: I've written a few lines in Rust's I guess similar things would apply to zig's coroutines as well, and it might in some cases be favorable to force them to run to completion (or at least to a safe exit point). |
This issue is obsoleted with the merge of #3033. |
Currently there is a race condition when using multithreaded coroutines. Consider the following scenario:
epoll_wait
/kevent
/GetQueuedCompletionStatus
, blocking until some work can be done. The thread wakes up, and it is about to executeresume coroutine_handle
.resume
the coroutine and there's nothing (2) can do to stop it. (2) proceeds to destroy the coroutine frame. The memory is gone.resume coroutine_handle
which now points to invalid memory. Boom.In order to fix this, we need to introduce new syntax and semantics. Current semantics are:
async
creates apromise
which must be consumed withcancel
orawait
.suspend
must be consumed with aresume
.The problem here described above -
resume
andcancel
racing with each other. When a suspended coroutine is canceled, the memory must not be destroyed until thesuspend
is canceled or resumed. Proposal for new semantics:async
creates apromise
which must be consumed withcancelasync
orawait
.suspend
must be consumed with acancelsuspend
orresume
.With these semantics, each coroutine essentially has 0, 1, or 2 references, and when the reference count reaches 0 it is destroyed. There is the "async reference", which is the main one, and the "suspend reference", which might never exist.
Therefore it is crucial that when a coroutine uses
suspend
- which is a low level feature intended mainly for low level library implementations of concurrency primitives - that it ensures it will be consumed with eitherresume
orcancelsuspend
.defer
anderrdefer
will both run when a coroutine's cancellation process begins. This means that the firstcancelasync
orcancelsuspend
will run the defers of the coroutine. When the other reference drops, the memory will be destroyed.The text was updated successfully, but these errors were encountered: