Restart unhealthy tasks #3105

schmichael · 2017-08-26T05:51:05Z

Fixes #876

Unhealthy checks can now restart tasks. See docs in PR for intended behavior.

Code
Docs
Changelog
Tests

samart · 2017-09-04T00:38:37Z

I would vote to make grace period just grace period, not including interval. more straightforward. How long you want to wait before checking is very different from what interval you want once you are ready.

jippi · 2017-09-04T07:14:24Z

client/task_runner.go

@@ -1674,6 +1693,25 @@ func (r *TaskRunner) Restart(source, reason string) {
 	}
 }

+// RestartBy deadline. Restarts a task iff the last time it was started was
+// before the deadline.  Returns true if restart occurs; false if skipped.
+func (r *TaskRunner) RestartBy(deadline time.Time, source, reason string) {


it doesn't return anything?

alxark · 2017-09-05T22:11:28Z

should we wait for this PR to be merged before next release? it's really very important feature and i'm sure it's blocking some users to start using docker.

schmichael · 2017-09-06T21:25:33Z

should we wait for this PR to be merged before next release? it's really very important feature and i'm sure it's blocking some users to start using docker.
-- @alxark

We decided kind of the opposite. :) Because this is such an important feature we didn't want to rush it into the 0.6.1 point release. The delay since was caused by me being on vacation, but rest assured this feature is my highest priority.

schmichael · 2017-09-13T16:50:29Z

client/consul_template.go

@@ -439,7 +439,8 @@ func (tm *TaskTemplateManager) handleTemplateRerenders(allRenderedTime time.Time
 				}

 				if restart {
-					tm.config.Hooks.Restart(consulTemplateSourceName, "template with change_mode restart re-rendered")
+					const failure = false
+					tm.config.Hooks.Restart(consulTemplateSourceName, "template with change_mode restart re-rendered", failure)


My little const pattern is pretty funky, and I'd be happy to remove it.

I do it because I hate seeing method calls with literal booleans in them and having no idea what those booleans do without looking at the method signature and/or docs.

schmichael · 2017-09-13T21:10:46Z

client/task_runner_test.go

@@ -1251,8 +1269,7 @@ func TestTaskRunner_Template_NewVaultToken(t *testing.T) {
 	})

 	// Error the token renewal
-	vc := ctx.tr.vaultClient.(*vaultclient.MockVaultClient)
-	renewalCh, ok := vc.RenewTokens[token]
+	renewalCh, ok := ctx.vault.RenewTokens[token]


Sorry for the unrelated test changes. Just made the mocks easier to access.

schmichael · 2017-09-13T21:11:48Z

command/agent/consul/catalog_testing.go

@@ -25,3 +27,119 @@ func (m *MockCatalog) Service(service, tag string, q *api.QueryOptions) ([]*api.
 	m.logger.Printf("[DEBUG] mock_consul: Service(%q, %q, %#v) -> (nil, nil, nil)", service, tag, q)
 	return nil, nil, nil
 }
+
+// MockAgent is a fake in-memory Consul backend for ServiceClient.
+type MockAgent struct {


Not new; moved from unit_test.go and exported for use in client/

schmichael · 2017-09-13T21:13:17Z

command/agent/consul/check_watcher.go

+	// Must test >= because if limit=1, restartAt == first failure
+	if now.Equal(restartAt) || now.After(restartAt) {
+		// hasn't become healthy by deadline, restart!
+		c.logger.Printf("[DEBUG] consul.health: restarting alloc %q task %q due to unhealthy check %q", c.allocID, c.taskName, c.checkName)


INFO or WARN level? Left at DEBUG since Task Events are probably a more accessible source for this information.

schmichael · 2017-09-13T21:15:31Z

command/agent/consul/unit_test.go

 		execs:         make(chan int, 100),
 	}
 }

-// fakeConsul is a fake in-memory Consul backend for ServiceClient.


Exported as MockAgent

dadgar · 2017-09-13T21:15:36Z

website/source/docs/job-specification/service.html.md

+
+- `limit` `(int: 0)` - Restart task after `limit` failing health checks. For
+  example 1 causes a restart on the first failure. The default, `0`, disables
+  healtcheck based restarts. Failures must be consecutive. A single passing


healtcheck -> health check

dadgar · 2017-09-13T21:18:48Z

website/source/docs/job-specification/service.html.md

+
+In this example the `mysqld` task has `90s` from startup to begin passing
+healthchecks. After the grace period if `mysqld` would remain unhealthy for
+`60s` (as determined by `limit * interval`) it would be restarted after `8s`


restarted after 8s isn't really accurate. It would be killed and then wait 8s till starting again. Where is the .25 coming from?

dadgar · 2017-09-13T21:19:42Z

website/source/docs/job-specification/service.html.md

@@ -162,6 +168,72 @@ scripts.
 - `tls_skip_verify` `(bool: false)` - Skip verifying TLS certificates for HTTPS
  checks. Requires Consul >= 0.7.2.

+#### `check_restart` Stanza


This probably worth pulling out into its own page and side bar?

dadgar · 2017-09-13T21:26:17Z

nomad/structs/structs.go

@@ -2916,6 +2988,9 @@ type Service struct {

 	Tags   []string        // List of tags for the service
 	Checks []*ServiceCheck // List of checks associated with the service
+
+	// CheckRestart will be propagated to Checks if set.
+	CheckRestart *CheckRestart


This should only exist on checks

dadgar · 2017-09-13T21:27:22Z

nomad/structs/structs.go

+
+	// If CheckRestart is set propagate it to checks
+	if s.CheckRestart != nil {
+		for _, check := range s.Checks {


This logic should be done at the api layer. Internal structs should represent how it is actually used (check_restart only on checks). See how update stanza is handled.

dadgar · 2017-09-13T21:47:57Z

client/restarts.go

@@ -196,6 +209,25 @@ func (r *RestartTracker) handleWaitResult() (string, time.Duration) {
 	return structs.TaskRestarting, r.jitter()
 }

+// handleFailure returns the new state and potential wait duration for
+// restarting the task due to a failure like an unhealthy Consul check.
+func (r *RestartTracker) handleFailure() (string, time.Duration) {


Can you refactor the three handle methods to use one common base?

dadgar · 2017-09-13T21:49:49Z

client/task_runner.go

+				} else {
+					// Since the restart isn't from a failure, restart immediately
+					// and don't count against the restart policy
+					r.restartTracker.SetRestartTriggered()


Would it be cleaner for SetRestartTriggered to just take the failure as a parameter?

dadgar · 2017-09-13T22:01:41Z

command/agent/consul/client.go

+			}
+
+			// Update all watched checks as CheckRestart fields aren't part of ID
+			if check.Watched() {


Get a test ensuring this

dadgar · 2017-09-13T22:02:03Z

command/agent/consul/unit_test.go

 func TestConsul_ChangeTags(t *testing.T) {
 	ctx := setupFake()

 	allocID := "allocid"
-	if err := ctx.ServiceClient.RegisterTask(allocID, ctx.Task, nil, nil); err != nil {
+	if err := ctx.ServiceClient.RegisterTask(allocID, ctx.Task, ctx.Restarter, nil, nil); err != nil {


Where are any of the tests ensuring that the watch gets created?

Added to TestConsul_ChangeChecks and TestConsul_RegServices

dadgar · 2017-09-13T22:02:39Z

command/agent/consul/check_watcher_test.go

+		t.Errorf("expected check 2 to not be restarted but found %d", n)
+	}
+}
+


Test checking that Watch() with updated check works

dadgar · 2017-09-13T22:14:57Z

command/agent/consul/check_watcher.go

+		remove:  true,
+	}
+	select {
+	case w.watchCh <- &c:


create an unwatchCh and just pass the cid?

That would create a race between adding and removing watches since select cases are randomly selected.

Instead I'll make watchCh -> watchUpdateCh and create a small checkRestartUpdate struct to handle the add vs remove case.

dadgar · 2017-09-13T22:17:24Z

command/agent/consul/check_watcher.go

+	for {
+		// Don't start watching until we actually have checks that
+		// trigger restarts.
+		for len(checks) == 0 {


Lets just create a single select block. We can start the timer when we get a watch: https://github.com/hashicorp/nomad/blob/master/client/alloc_runner_health_watcher.go#L447

dadgar · 2017-09-13T22:20:03Z

command/agent/consul/check_watcher.go

+	checks := map[string]*checkRestart{}
+
+	// timer for check polling
+	checkTimer := time.NewTimer(0)


A select on this will fire immediately. You need to stop it first and select from the channel

Fixed below

dadgar · 2017-09-13T22:20:34Z

command/agent/consul/check_watcher.go

+
+					// Begin polling
+					if !checkTimer.Stop() {
+						<-checkTimer.C


Do this with a select with a default

This is straight from the docs: https://golang.org/pkg/time/#Timer.Reset

...but I always forget this caveat on Stop:

assuming the program has not received from t.C already

Oof this API gets me every time.

dadgar · 2017-09-13T22:29:16Z

command/agent/consul/check_watcher.go

+			c.logger.Printf("[DEBUG] consul.health: alloc %q task %q check %q became unhealthy. Restarting in %s if not healthy",
+				c.allocID, c.taskName, c.checkName, c.timeLimit)
+		}
+		c.unhealthyStart = now


Can you rename unhealthyStart to unhealthyCheck?

At no other point do I refer to a single run of a check as a check in this file, so I'm not sure I understand the renaming. We don't even have visibility into individual runs of a check since we poll statuses at a fixed interval.

dadgar · 2017-09-13T22:30:17Z

command/agent/consul/check_watcher.go

+	restartDelay   time.Duration
+	grace          time.Duration
+	interval       time.Duration
+	timeLimit      time.Duration


document non-obvious fields: interval, timeLimit

Treat warnings as unhealthy by default

Reusing checkRestart for both adds/removes and the main check restarting logic was confusing.

@dadgar

@dadgar made the excellent observation in #3105 that TaskRunner removes and re-registers checks on restarts. This means checkWatcher doesn't need to do *any* internal restart tracking. Individual checks can just remove themselves and be re-added when the task restarts.

Before this commit if a task had 2 checks cause restarts at the same time, both would trigger restarts of the task! This change removes all checks for a task whenever one of them is restarted.

Watched was a silly name

All 3 error/failure cases share restart logic, but 2 of them have special cased conditions.

dadgar · 2017-09-14T23:55:34Z

api/tasks.go

+
+	s.CheckRestart.Canonicalize()
+
+	for _, c := range s.Checks {


Place a comment

dadgar · 2017-09-15T00:12:04Z

client/restarts.go

 	return r
 }

 // SetRestartTriggered is used to mark that the task has been signalled to be
 // restarted
-func (r *RestartTracker) SetRestartTriggered() *RestartTracker {
+func (r *RestartTracker) SetRestartTriggered(failure bool) *RestartTracker {


Comment on the param

dadgar · 2017-09-15T00:12:31Z

client/restarts_test.go

@@ -99,7 +99,7 @@ func TestClient_RestartTracker_RestartTriggered(t *testing.T) {
 	p := testPolicy(true, structs.RestartPolicyModeFail)
 	p.Attempts = 0
 	rt := newRestartTracker(p, structs.JobTypeService)
-	if state, when := rt.SetRestartTriggered().GetState(); state != structs.TaskRestarting && when != 0 {
+	if state, when := rt.SetRestartTriggered(false).GetState(); state != structs.TaskRestarting && when != 0 {


Unit test the new case

Added cde908e

dadgar · 2017-09-15T00:15:26Z

client/restarts.go

-			r.reason = ReasonDelay
-			return structs.TaskRestarting, r.getDelay()
+	// Handle restarts due to failures
+	if r.failure {


Invert it:

if !r.failure { return "", 0 } ...

dadgar · 2017-09-15T00:16:57Z

client/restarts.go

-		} else {
-			r.reason = ReasonDelay
-			return structs.TaskRestarting, r.getDelay()
+		if r.count > r.policy.Attempts {


Can you comment how you get to this case.

dadgar · 2017-09-15T00:26:06Z

website/source/docs/job-specification/check_restart.html.md

+          check_restart {
+            limit = 3
+            grace = "90s"
+


Remove space

dadgar · 2017-09-15T00:26:32Z

website/source/docs/job-specification/check_restart.html.md

+
+- `limit` `(int: 0)` - Restart task when a health check has failed `limit`
+  times.  For example 1 causes a restart on the first failure. The default,
+  `0`, disables healtcheck based restarts. Failures must be consecutive. A


healtcheck. Can you run a spell check on the doc sections

dadgar · 2017-09-15T00:28:01Z

website/source/docs/job-specification/check_restart.html.md

+```
+
+The [`restart` stanza][restart_stanza] controls the restart behavior of the
+task. In this case it will wait 10 seconds before restarting. Note that even if


In this case it will stop the task and then wait 10 seconds before starting it again..
Delete subsequent sentence

dadgar · 2017-09-15T00:49:42Z

command/agent/consul/check_watcher.go

+	}
+}
+
+// Unwatch a task.


Unwatch a check

dadgar · 2017-09-15T00:49:53Z

command/agent/consul/check_watcher.go

+	}
+}
+
+// Watch a task and restart it if unhealthy.


Watch a check and restart it's task if it becomes unhealthy

dadgar · 2017-09-15T22:38:56Z

website/source/docs/job-specification/check_restart.html.md

@@ -30,6 +30,8 @@ unhealthy for the `limit` specified in a `check_restart` stanza, it is
 restarted according to the task group's [`restart` policy][restart_stanza]. The
 `check_restart` settings apply to [`check`s][check_stanza], but may also be
 placed on [`service`s][service_stanza] to apply to all checks on a service.
+`check_restart` settings on `service` will only overwrite unset `check_restart`


If check_restart is set on both the check and service, the stanza's are merged with the check values taking precedence.

jippi · 2017-09-18T08:58:56Z

Amazing @schmichael !

github-actions · 2023-03-23T02:11:16Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

schmichael requested a review from dadgar August 26, 2017 05:51

jippi reviewed Sep 4, 2017

View reviewed changes

schmichael force-pushed the f-876-restart-unhealthy branch 2 times, most recently from 7d90933 to 7c85c05 Compare September 10, 2017 17:05

schmichael added a commit that referenced this pull request Sep 11, 2017

Add changelog entry for #3105

43f5f1a

schmichael force-pushed the f-876-restart-unhealthy branch from 43f5f1a to d43c3da Compare September 12, 2017 17:35

schmichael added a commit that referenced this pull request Sep 12, 2017

Add changelog entry for #3105

d43c3da

schmichael commented Sep 13, 2017

View reviewed changes

schmichael changed the title ~~WIP - Restart unhealthy tasks~~ Restart unhealthy tasks Sep 13, 2017

dadgar requested changes Sep 13, 2017

View reviewed changes

schmichael force-pushed the f-876-restart-unhealthy branch from 912e5db to ab0cae0 Compare September 13, 2017 23:53

schmichael force-pushed the f-876-restart-unhealthy branch from 764fdaa to 0052ec9 Compare September 13, 2017 23:57

schmichael added 8 commits September 14, 2017 16:46

Add restart fields

a720bb5

Nest restart fields in CheckRestart

bd1a342

Add check watcher for restarting unhealthy tasks

1608e59

Use existing restart policy infrastructure

ebbf87f

on_warning=false -> ignore_warnings=false

555d1e2

Treat warnings as unhealthy by default

Add comments and move delay calc to TaskRunner

c2d895d

Default grace period to 1s

78c72f8

Document new check_restart stanza

7e103f6

schmichael added 13 commits September 14, 2017 16:47

Fix whitespace

9fb2865

Canonicalize and Merge CheckRestart in api

092057a

Wrap check watch updates in a struct

8b8c164

Reusing checkRestart for both adds/removes and the main check restarting logic was confusing.

Simplify from 2 select loops to one

237c096

Handle multiple failing checks on a single task

40ed262

Before this commit if a task had 2 checks cause restarts at the same time, both would trigger restarts of the task! This change removes all checks for a task whenever one of them is restarted.

Watched -> TriggersRestart

5cd1d57

Watched was a silly name

DRY up restart handling a bit.

10dc1c7

All 3 error/failure cases share restart logic, but 2 of them have special cased conditions.

Rename unhealthy var and fix test indeterminism

3c0a42b

Test check watch updates

6f72270

Fold SetFailure into SetRestartTriggered

a508bb9

Add check_restart to jobspec tests

5141c95

Move check_restart to its own section.

1564e1c

schmichael force-pushed the f-876-restart-unhealthy branch from 9c8de7b to 1564e1c Compare September 14, 2017 23:50

dadgar requested changes Sep 15, 2017

View reviewed changes

schmichael added 7 commits September 15, 2017 14:34

Add comments

8014762

Cleanup and test restart failure code

cde908e

Name const after what it represents

fa836d8

Test converting CheckRestart from api->structs

924813d

Test CheckRestart.Validate

6bcf019

Minor corrections to check_restart docs

10ae18c

Fix comments: task -> check

967825d

dadgar approved these changes Sep 15, 2017

View reviewed changes

@dadgar is better at words than me

3d7446d

schmichael merged commit c1cf162 into master Sep 18, 2017

schmichael deleted the f-876-restart-unhealthy branch September 18, 2017 02:38

shantanugadgil mentioned this pull request Oct 1, 2017

[doc] in the 'check_restart' section, it should be 'grace_period', not 'grace'. #3262

Closed

github-actions bot locked as resolved and limited conversation to collaborators Mar 23, 2023

Restart unhealthy tasks #3105

Restart unhealthy tasks #3105

Conversation

schmichael commented Aug 26, 2017 • edited Loading

samart commented Sep 4, 2017

Choose a reason for hiding this comment

alxark commented Sep 5, 2017

schmichael commented Sep 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jippi commented Sep 18, 2017

github-actions bot commented Mar 23, 2023

schmichael commented Aug 26, 2017 •

edited

Loading

schmichael commented Sep 6, 2017 •

edited

Loading