-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix deadlock in plan_apply #13407
fix deadlock in plan_apply #13407
Changes from all commits
2f7f862
6fab937
5e0964e
247a03f
6fbd2a8
2a358ac
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
```release-note:bug | ||
core: Fixed a bug where the plan applier could deadlock if leader's state lagged behind plan's creation index for more than 5 seconds. | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -154,6 +154,7 @@ func (p *planner) planApply() { | |
// This also limits how out of date our snapshot can be. | ||
if planIndexCh != nil { | ||
idx := <-planIndexCh | ||
planIndexCh = nil | ||
prevPlanResultIndex = max(prevPlanResultIndex, idx) | ||
snap, err = p.snapshotMinIndex(prevPlanResultIndex, pending.plan.SnapshotIndex) | ||
if err != nil { | ||
|
@@ -177,7 +178,7 @@ func (p *planner) planApply() { | |
} | ||
} | ||
|
||
// snapshotMinIndex wraps SnapshotAfter with a 5s timeout and converts timeout | ||
// snapshotMinIndex wraps SnapshotAfter with a 10s timeout and converts timeout | ||
// errors to a more descriptive error message. The snapshot is guaranteed to | ||
// include both the previous plan and all objects referenced by the plan or | ||
// return an error. | ||
|
@@ -188,7 +189,11 @@ func (p *planner) snapshotMinIndex(prevPlanResultIndex, planSnapshotIndex uint64 | |
// plan result's and current plan's snapshot index. | ||
minIndex := max(prevPlanResultIndex, planSnapshotIndex) | ||
|
||
const timeout = 5 * time.Second | ||
// This timeout creates backpressure where any concurrent | ||
// Plan.Submit RPCs will block waiting for results. This sheds | ||
// load across all servers and gives raft some CPU to catch up, | ||
// because schedulers won't dequeue more work while waiting. | ||
const timeout = 10 * time.Second | ||
ctx, cancel := context.WithTimeout(context.Background(), timeout) | ||
snap, err := p.fsm.State().SnapshotMinIndex(ctx, minIndex) | ||
cancel() | ||
|
@@ -368,14 +373,12 @@ func updateAllocTimestamps(allocations []*structs.Allocation, timestamp int64) { | |
func (p *planner) asyncPlanWait(indexCh chan<- uint64, future raft.ApplyFuture, | ||
result *structs.PlanResult, pending *pendingPlan) { | ||
defer metrics.MeasureSince([]string{"nomad", "plan", "apply"}, time.Now()) | ||
defer close(indexCh) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would this alone have fixed the deadlock? I think so and since we always use this index inside a Oof, good lesson about writing defensive code the first time round. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You never know if you will have time to come back and make it better in a second pass 🤣 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes! That's how we did the initial test and fix. Nil'ing the channel gets us a few additional things:
|
||
|
||
// Wait for the plan to apply | ||
if err := future.Error(); err != nil { | ||
p.logger.Error("failed to apply plan", "error", err) | ||
pending.respond(nil, err) | ||
|
||
// Close indexCh on error | ||
close(indexCh) | ||
return | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - The following are observations/thoughts I just wanted to record somewhere in case they're useful (or in case my "LGTM" is based on flawed assumptions).
This does make this code differ from the worker's use of SnapshotMinIndex via
worker.snapshotMinIndex
. I think this is probably ideal:The plan applier blocking here puts a lot of backpressure on all schedulers and should dramatically help in cases where Raft/FSM is really struggling.
The shorter 5s min index timeout still on the worker seems fine because it fails up to the main eval retry loop which introduces its own additional retries.
...although looking at the code around wait-for-index failures in the worker it appears we lose/drop the RefreshIndex sent by the server if we hit that timeout-and-retry case. This would cause the retry to use its existing old snapshot. Probably not intentional but probably optimal as it lets workers keep creating plans and the plan applier will toss out the invalid bits with its stricter view of state.