Back-pressure on Nacks and ensure scheduling progress on failures #2555

dadgar · 2017-04-12T21:53:49Z

This PR adds two things to increase robustness under high contention:

Evaluation's that are Nack'd are not immediately re-enqueued, providing some back-pressure.
If an evaluations hits its delivery limit and is marked as failed, when reaped, a follow up evaluation will be created with a fairly substantial delay. This gives the cluster time to recover and ensures progress for jobs with failed evaluations.

Add a delay when an evaluation is nacked that starts off small but compounds to a larger delay for subsequent Nacks. This creates some back pressure.

Create a follow up evaluation when reaping failed evaluations. This ensures that a job will still make eventual progress.

armon · 2017-04-13T20:26:28Z

nomad/eval_broker.go

+	var delay time.Duration
+
+	switch {
+	case prevDequeues <= 0:


Can we make each case just return an explicit value? Makes it easier, especially since we don't post-process the value

Not sure what you mean here. The number of retries is a config option so not sure how you enumerate all possibilities.

I mean instead of setting delay in the outer block and an empty clause, do an explicit return in each and remove the outer variable.

armon · 2017-04-13T20:31:30Z

nomad/eval_broker.go

+
+	// initialNackReenqueueDelay is the delay applied before re-enqueuing a
+	// Nacked evaluation for the first time
+	initialNackReenqueueDelay = time.Second


Let's thread these through a config like EvalNackTimeout, that way there isn't a weird hack for testing either.

armon · 2017-04-13T20:32:20Z

nomad/leader.go

+	// failedEvalFollowUpWaitRange defines the the range of additional time from
+	// the minimum in which to wait before retrying a failed evaluation. A value
+	// from this range should be selected using a uniform distribution.
+	failedEvalFollowUpWaitRange = 9 * time.Minute


This seems like a really wide window. I would start more conservative, like 1 minute baseline with 5 minute max.

We might consider making these threaded through via config as well. I can see wanting to tune this potentially.

armon · 2017-04-13T20:34:52Z

nomad/structs/structs.go

+	EvalTriggerNodeUpdate     = "node-update"
+	EvalTriggerScheduled      = "scheduled"
+	EvalTriggerRollingUpdate  = "rolling-update"
+	EvalTriggerFailedFollowUp = "failed-eval-follow-up"


I think you can omit "eval" -> "failed-follow-up", since it's on an eval, that is implied.

armon · 2017-04-13T20:35:56Z

nomad/leader.go

 			// Update via Raft
 			req := structs.EvalUpdateRequest{
-				Evals: []*structs.Evaluation{newEval},
+				Evals: []*structs.Evaluation{newEval, followupEval},


Can you rename newEval to updateEval. It took me a while to realize it's not a "new" eval, just an update of the existing one.

armon

Left some comments, but LGTM!

github-actions · 2023-04-02T02:10:51Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added 2 commits April 12, 2017 13:41

Delay Nack re-enqueue

57ed873

Add a delay when an evaluation is nacked that starts off small but compounds to a larger delay for subsequent Nacks. This creates some back pressure.

Reaping failed evaluations creates follow up eval

7e64557

Create a follow up evaluation when reaping failed evaluations. This ensures that a job will still make eventual progress.

dadgar requested a review from armon April 12, 2017 21:53

armon reviewed Apr 13, 2017

View reviewed changes

armon approved these changes Apr 13, 2017

View reviewed changes

dadgar added 3 commits April 14, 2017 13:19

Easy feedback fixes

53f9540

Push to configs

3062b26

NewEvalBroker comment

d3807db

dadgar merged commit 477c97e into master Apr 14, 2017

dadgar deleted the f-nack-delay branch April 14, 2017 22:27

github-actions bot locked as resolved and limited conversation to collaborators Apr 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Back-pressure on Nacks and ensure scheduling progress on failures #2555

Back-pressure on Nacks and ensure scheduling progress on failures #2555

dadgar commented Apr 12, 2017

armon Apr 13, 2017

dadgar Apr 14, 2017

armon Apr 14, 2017

armon Apr 13, 2017

armon Apr 13, 2017

armon Apr 13, 2017

armon Apr 13, 2017

armon Apr 13, 2017

armon left a comment

github-actions bot commented Apr 2, 2023

Back-pressure on Nacks and ensure scheduling progress on failures #2555

Back-pressure on Nacks and ensure scheduling progress on failures #2555

Conversation

dadgar commented Apr 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

armon left a comment

Choose a reason for hiding this comment

github-actions bot commented Apr 2, 2023