-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Back-pressure on Nacks and ensure scheduling progress on failures #2555
Conversation
Add a delay when an evaluation is nacked that starts off small but compounds to a larger delay for subsequent Nacks. This creates some back pressure.
Create a follow up evaluation when reaping failed evaluations. This ensures that a job will still make eventual progress.
nomad/eval_broker.go
Outdated
var delay time.Duration | ||
|
||
switch { | ||
case prevDequeues <= 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make each case just return an explicit value? Makes it easier, especially since we don't post-process the value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean here. The number of retries is a config option so not sure how you enumerate all possibilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean instead of setting delay
in the outer block and an empty clause, do an explicit return
in each and remove the outer variable.
nomad/eval_broker.go
Outdated
|
||
// initialNackReenqueueDelay is the delay applied before re-enqueuing a | ||
// Nacked evaluation for the first time | ||
initialNackReenqueueDelay = time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's thread these through a config like EvalNackTimeout
, that way there isn't a weird hack for testing either.
nomad/leader.go
Outdated
// failedEvalFollowUpWaitRange defines the the range of additional time from | ||
// the minimum in which to wait before retrying a failed evaluation. A value | ||
// from this range should be selected using a uniform distribution. | ||
failedEvalFollowUpWaitRange = 9 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a really wide window. I would start more conservative, like 1 minute baseline with 5 minute max.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might consider making these threaded through via config as well. I can see wanting to tune this potentially.
nomad/structs/structs.go
Outdated
EvalTriggerNodeUpdate = "node-update" | ||
EvalTriggerScheduled = "scheduled" | ||
EvalTriggerRollingUpdate = "rolling-update" | ||
EvalTriggerFailedFollowUp = "failed-eval-follow-up" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can omit "eval" -> "failed-follow-up", since it's on an eval, that is implied.
nomad/leader.go
Outdated
// Update via Raft | ||
req := structs.EvalUpdateRequest{ | ||
Evals: []*structs.Evaluation{newEval}, | ||
Evals: []*structs.Evaluation{newEval, followupEval}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename newEval
to updateEval
. It took me a while to realize it's not a "new" eval, just an update of the existing one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, but LGTM!
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR adds two things to increase robustness under high contention: