-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delayed evaluations for stop_after_client_disconnect
can cause unwanted extra followup evaluations around job garbage collection
#8099
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget the changelog entry.
Just slightly. `lostLater` allocs should be used to create batched evaluations, but `handleDelayedReschedules` assumes that the allocations are in the untainted set. When it creates the in-place updates to those allocations at the end, it causes the allocation to be treated as running over in the planner, which causes the initial `stop_after_client_disconnect` evaluation to be retried by the worker.
Without protecting the loop that creates followUpEvals, a delayed eval is allowed to create an immediate subsequent delayed eval. For both `stop_after_client_disconnect` and the `reschedule` block, a delayed eval should always produce some immediate result (running or blocked) and then only after the outcome of that eval produce a second delayed eval.
e02c326
to
eaec537
Compare
…nted extra followup evaluations around job garbage collection (#8099) * client/heartbeatstop: reversed time condition for startup grace * scheduler/generic_sched: use `delayInstead` to avoid a loop Without protecting the loop that creates followUpEvals, a delayed eval is allowed to create an immediate subsequent delayed eval. For both `stop_after_client_disconnect` and the `reschedule` block, a delayed eval should always produce some immediate result (running or blocked) and then only after the outcome of that eval produce a second delayed eval. * scheduler/reconcile: lostLater are different than delayedReschedules Just slightly. `lostLater` allocs should be used to create batched evaluations, but `handleDelayedReschedules` assumes that the allocations are in the untainted set. When it creates the in-place updates to those allocations at the end, it causes the allocation to be treated as running over in the planner, which causes the initial `stop_after_client_disconnect` evaluation to be retried by the worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change makes sense - I have a couple of questions here. I'll dig into it a bit more and take it for a test drive later today to understand the implications.
// a new eval to the planner in createBlockedEval. If rescheduling should | ||
// be delayed, do that instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the delay clause is only relevent for the new evals? If current evaluation is reused, its delay value will not change. Is that correct?
// Allocs that are lost and delayed have an attributeUpdate that correctly links to | ||
// the eval, but incorrectly has the current (running) status | ||
for _, d := range lostLater { | ||
a.result.attributeUpdates[d.allocID].SetStop(structs.AllocClientStatusLost, structs.AllocClientStatusLost) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context for why this is needed? should this logic be moved to handleDelayedLost
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, status is handled sort of twice in the planner, the alloc needs to be marked with the correct status but also sent to the planner in the NodeUpdate
collection not the NodeAllocation
collection. attributeUpdates
all get added to the NodeAllocation
part of the planner, which is wrong for our purposes. There's some followup to this behavior in #8105, which clarifies how the status gets applied.
I've been using this graph to follow the code: https://github.com/langmartin/nomad-dev/blob/master/doc/delayed-reschedules.svg. That's a hand-drawn graph, so it may have errors. red is control flow and green is data flow.
if len(s.followUpEvals) > 0 { | ||
// Create follow up evals for any delayed reschedule eligible allocations, except in | ||
// the case that this evaluation was already delayed. | ||
if delayInstead { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm rusty here - do we ever have a case where a delayed reschedule eligible evals result into more follow ups? Like a delayed reschedule eval is created, but then on its processing attempt, cluster is full, and one more blocking eval is created. In such case would we factor in whether .eval.WaitUntil has passed, not that it's just zero?
@@ -87,6 +87,8 @@ type GenericScheduler struct { | |||
ctx *EvalContext | |||
stack *GenericStack | |||
|
|||
// followUpEvals are evals with WaitUntil set, which are delayed until that time | |||
// before being rescheduled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rescheduled
sounds unclear to me - I believe it means the scheduler re-processes in this context, not necessarily that these evals are for scheduled allocations due to client loss/drain/etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're submitted to the worker via RPC, which goes through the eval_endpoint
, raft, the fsm, state_store, and then evalbroker.processEnqueue
, where it gets delayHeap.Push
ed. evalbroker.runDelayedEvalsWatcher
checks the head of the delay heap, and waits until the first eval is due to add it to the regular eval queue. worker.run
gets it from the channel and creates a new scheduler to process it then.
The followupEvals
are only used for these delays, which hold up all of the reschedule processing. After they're due, they may become blocked or otherwise stopped if the job is changed.
Does that make sense? The context isn't saved, they go all the way around the eval system. On client loss or drain, the node drain eval creates the plan that changes all the affected allocs to lost
. If the reschedule
rules don't prevent it, replacement allocs will also be in that plan request. If reschedule
or stop_after_client_disconnect
prevent creating an immediate replacement alloc, it's only in that case you get a followupEval
.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
A couple of very small changes, and then the changes in
generic_sched
andreconciler
. The risk should be limited to the extradelayInstead
condition check, which is also applied to evals that are delayed for thereschedule
block.Manual testing (e2e version coming): https://github.com/langmartin/nomad-dev/tree/master/cfg/heartyeet
Fixes #8098