Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix deadlock in plan_apply #13407
fix deadlock in plan_apply #13407
Changes from 4 commits
2f7f862
6fab937
5e0964e
247a03f
6fbd2a8
2a358ac
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be on L157 right after we receive from the chan?
AFAICT the only case where this move would make a difference is if
applyPlan
on L168 returns a non-nil error. It would thencontinue
with a non-nilplanIndexCh
that will never receive another result.I think this change (immediately nil'ing the chan) would match the behavior of the only other place we receiving on
planIndexCh
: L105 in theselect
. The case where we receive on the chan immediately and unconditionally sets itnil
there as well.To put another way: I think
planIndexCh
is always a one-shot chan. It will always receive exactly one index value and never get reused.If you think that's accurate I wonder if we should wrap planIndexCh in a little struct with a method like:
We also use
planIndexCh == nil
to signal "no outstanding plan application", so our struct would probably need a helper method for that too. Unsure if the extra struct would simplify or complicate reading this code... just an idea.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, good catch. I was focused on the error branch because there's where the bug showed up, but there's no reason to keep it once we've received a value.
Yes, agreed!
Hm... I like this idea in principle. But the logic here leans heavily on the specific semantics of channels (ex. a closed channel isn't nil) and wrapping it in a struct just seems like it'll obscure that by moving the logic away from where we're using it. The fallthrough
select
on line 104-113 makes this especially gross because it relies on the fact that we get 0 on error, so we can't even callrecv()
and check for 0 there so it needs it's ownindex, ok := recvOrDefault()
function.some example code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof yeah, this is a pattern to watch out for in the future. Not always a problem but clearly made this code more fragile than if a little effort had been put into encapsulating the logic in a little purpose-built struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - The following are observations/thoughts I just wanted to record somewhere in case they're useful (or in case my "LGTM" is based on flawed assumptions).
This does make this code differ from the worker's use of SnapshotMinIndex via
worker.snapshotMinIndex
. I think this is probably ideal:The plan applier blocking here puts a lot of backpressure on all schedulers and should dramatically help in cases where Raft/FSM is really struggling.
The shorter 5s min index timeout still on the worker seems fine because it fails up to the main eval retry loop which introduces its own additional retries.
...although looking at the code around wait-for-index failures in the worker it appears we lose/drop the RefreshIndex sent by the server if we hit that timeout-and-retry case. This would cause the retry to use its existing old snapshot. Probably not intentional but probably optimal as it lets workers keep creating plans and the plan applier will toss out the invalid bits with its stricter view of state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this alone have fixed the deadlock? I think so and since we always use this index inside a
max(...)
it seems like an always safe operation.Oof, good lesson about writing defensive code the first time round.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You never know if you will have time to come back and make it better in a second pass 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! That's how we did the initial test and fix. Nil'ing the channel gets us a few additional things:
asyncPlanWait
code further down (and doing so is super cheap).