Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: yield after require #43339

Closed
wants to merge 1 commit into from
Closed

RFC: yield after require #43339

wants to merge 1 commit into from

Conversation

timholy
Copy link
Member

@timholy timholy commented Dec 5, 2021

Under specific circumstances (package callbacks that run under
@async), this can change the ordering of operations:
some may miss a yield window due to locking. This adds
a yield when the lock is freed.

It's worth showing how this new @testset behaves under Julia 1.6 and current nightly (see the diff for explanation). This comparison is done adding a yield() after the @eval using Outer41602. For Julia 1.6, @show cti cto yields

cti = Any[1.638712255575944e9, (Inner41602, 0.4309260845184326), (Outer41602, 0.4384500980377197)]
cto = Any[(Outer41602, 0.447174072265625)]

implying that Inner's callback runs first for both Inner and Outer (in that order). On current nightly, one gets

cti = Any[1.638712268409791e9, (Outer41602, 0.11697912216186523), (Inner41602, 0.40651607513427734)]
cto = Any[(Outer41602, 0.12539410591125488)]

implying that Inner's callback runs first for Outer, followed by Outer's callback, and much later by Inner's callback for itself. Presumably, this occurs because when it initially tries to run its own callback, Base.require_lock is still being held for Outer, and hence it blocks the callback until the next yield. This PR restores the original ordering.

As mentioned in #41602 (comment), #41602 triggered failures in Revise. There's an alternative way to fix this in Revise itself, by getting rid of one or two forms of asynchronous processing. The one that must go is the asynchronous processing of the @require blocks; this is the first commit in timholy/Revise.jl#657. The one that's optional is the asynchronous handling of watch_package: it would be nice to leave that one asynchronous because it defers Revise's work until after Julia returns to the REPL. If we drop that commit, I can manually add a yield() in a couple of places in Revise's tests.

In other words, we don't have to change anything in Base (in which case this can be closed) if it's viewed as better being fixed in Revise: perhaps the ideal fix is to not make this change, take only the first commit of timholy/Revise.jl#657, and add a couple of yields in Revise's own tests. But, having finally understood where the breakage came from, I thought it was worth explaining the consequences of recent changes in case this proves to be the kind of breakage we want to avoid.

Under specific circumstances (package callbacks that run under
`@async`), this can change the ordering of operations:
some may miss a `yield` window due to locking. This adds
a `yield` when the lock is freed.
@timholy timholy requested a review from JeffBezanson December 5, 2021 15:13
@timholy
Copy link
Member Author

timholy commented Dec 6, 2021

I can fix the test error by changing < to <=. I'll do that before merging if we decide to proceed with this.

@DilumAluthge
Copy link
Member

Just to clarify, this isn't technically unique to Revise, right? Any package that adds package callbacks will be susceptible to this bug if the callbacks are run under @async, right?

In my opinion, it would make the most sense to fix this in a single place (i.e. in Base Julia), instead of having to fix it in multiple places (i.e. in every package that uses package callbacks).

@vtjnash
Copy link
Member

vtjnash commented Dec 6, 2021

Calling yield to fix observed issues is just always a terrible idea. The current idea on master (after #41602), is that package_callbacks must release this lock if they plan to do any "real work," and then must re-acquire it before return. It might be simpler for this code instead to release the lock around the callbacks, but we were not sure if anyone did "real work" that would be affected.

@timholy
Copy link
Member Author

timholy commented Dec 6, 2021

I basically agree, and mostly wanted to document the issue in case anyone else is affected. I'll fix it in Revise.

@timholy timholy closed this Dec 6, 2021
@timholy timholy deleted the teh/yield_after_require branch December 6, 2021 19:01
@vtjnash
Copy link
Member

vtjnash commented Dec 6, 2021

We could definitely unlock though, since it is undefined right now if you are allowed to do anything

@timholy
Copy link
Member Author

timholy commented Dec 6, 2021

I'll play with that tomorrow-ish and see what comes of that.

@timholy
Copy link
Member Author

timholy commented Dec 6, 2021

Actually, unlocking won't really make a difference for Revise, since Revise would rather do as much of the work asynchronously as possible. So by the time Revise gets a crack at it, the lock may be applied again.

@vtjnash
Copy link
Member

vtjnash commented Dec 9, 2021

Seems that this callback is already causing deadlocks with the presence of the Distributed._require_callback, so we need to fix it. In particular:

  1. The flush_gc_msgs will acquire the lock on the stream, then needs to look up the root_module_key for those messages

  2. Meanwhile, the require_callback will be holding the lock for root_module_key while it tries to lock the stream to send blocking messages there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants