-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart unhealthy tasks #3105
Restart unhealthy tasks #3105
Conversation
I would vote to make grace period just grace period, not including interval. more straightforward. How long you want to wait before checking is very different from what interval you want once you are ready. |
client/task_runner.go
Outdated
@@ -1674,6 +1693,25 @@ func (r *TaskRunner) Restart(source, reason string) { | |||
} | |||
} | |||
|
|||
// RestartBy deadline. Restarts a task iff the last time it was started was | |||
// before the deadline. Returns true if restart occurs; false if skipped. | |||
func (r *TaskRunner) RestartBy(deadline time.Time, source, reason string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't return anything?
should we wait for this PR to be merged before next release? it's really very important feature and i'm sure it's blocking some users to start using docker. |
We decided kind of the opposite. :) Because this is such an important feature we didn't want to rush it into the 0.6.1 point release. The delay since was caused by me being on vacation, but rest assured this feature is my highest priority. |
7d90933
to
7c85c05
Compare
43f5f1a
to
d43c3da
Compare
@@ -439,7 +439,8 @@ func (tm *TaskTemplateManager) handleTemplateRerenders(allRenderedTime time.Time | |||
} | |||
|
|||
if restart { | |||
tm.config.Hooks.Restart(consulTemplateSourceName, "template with change_mode restart re-rendered") | |||
const failure = false | |||
tm.config.Hooks.Restart(consulTemplateSourceName, "template with change_mode restart re-rendered", failure) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My little const pattern is pretty funky, and I'd be happy to remove it.
I do it because I hate seeing method calls with literal booleans in them and having no idea what those booleans do without looking at the method signature and/or docs.
@@ -1251,8 +1269,7 @@ func TestTaskRunner_Template_NewVaultToken(t *testing.T) { | |||
}) | |||
|
|||
// Error the token renewal | |||
vc := ctx.tr.vaultClient.(*vaultclient.MockVaultClient) | |||
renewalCh, ok := vc.RenewTokens[token] | |||
renewalCh, ok := ctx.vault.RenewTokens[token] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the unrelated test changes. Just made the mocks easier to access.
@@ -25,3 +27,119 @@ func (m *MockCatalog) Service(service, tag string, q *api.QueryOptions) ([]*api. | |||
m.logger.Printf("[DEBUG] mock_consul: Service(%q, %q, %#v) -> (nil, nil, nil)", service, tag, q) | |||
return nil, nil, nil | |||
} | |||
|
|||
// MockAgent is a fake in-memory Consul backend for ServiceClient. | |||
type MockAgent struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not new; moved from unit_test.go
and exported for use in client/
// Must test >= because if limit=1, restartAt == first failure | ||
if now.Equal(restartAt) || now.After(restartAt) { | ||
// hasn't become healthy by deadline, restart! | ||
c.logger.Printf("[DEBUG] consul.health: restarting alloc %q task %q due to unhealthy check %q", c.allocID, c.taskName, c.checkName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INFO or WARN level? Left at DEBUG since Task Events are probably a more accessible source for this information.
execs: make(chan int, 100), | ||
} | ||
} | ||
|
||
// fakeConsul is a fake in-memory Consul backend for ServiceClient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exported as MockAgent
|
||
- `limit` `(int: 0)` - Restart task after `limit` failing health checks. For | ||
example 1 causes a restart on the first failure. The default, `0`, disables | ||
healtcheck based restarts. Failures must be consecutive. A single passing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
healtcheck -> health check
|
||
In this example the `mysqld` task has `90s` from startup to begin passing | ||
healthchecks. After the grace period if `mysqld` would remain unhealthy for | ||
`60s` (as determined by `limit * interval`) it would be restarted after `8s` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restarted after 8s isn't really accurate. It would be killed and then wait 8s till starting again. Where is the .25 coming from?
@@ -162,6 +168,72 @@ scripts. | |||
- `tls_skip_verify` `(bool: false)` - Skip verifying TLS certificates for HTTPS | |||
checks. Requires Consul >= 0.7.2. | |||
|
|||
#### `check_restart` Stanza |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably worth pulling out into its own page and side bar?
nomad/structs/structs.go
Outdated
@@ -2916,6 +2988,9 @@ type Service struct { | |||
|
|||
Tags []string // List of tags for the service | |||
Checks []*ServiceCheck // List of checks associated with the service | |||
|
|||
// CheckRestart will be propagated to Checks if set. | |||
CheckRestart *CheckRestart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should only exist on checks
nomad/structs/structs.go
Outdated
|
||
// If CheckRestart is set propagate it to checks | ||
if s.CheckRestart != nil { | ||
for _, check := range s.Checks { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic should be done at the api layer. Internal structs should represent how it is actually used (check_restart only on checks). See how update stanza is handled.
client/restarts.go
Outdated
@@ -196,6 +209,25 @@ func (r *RestartTracker) handleWaitResult() (string, time.Duration) { | |||
return structs.TaskRestarting, r.jitter() | |||
} | |||
|
|||
// handleFailure returns the new state and potential wait duration for | |||
// restarting the task due to a failure like an unhealthy Consul check. | |||
func (r *RestartTracker) handleFailure() (string, time.Duration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you refactor the three handle methods to use one common base?
client/task_runner.go
Outdated
} else { | ||
// Since the restart isn't from a failure, restart immediately | ||
// and don't count against the restart policy | ||
r.restartTracker.SetRestartTriggered() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be cleaner for SetRestartTriggered
to just take the failure as a parameter?
command/agent/consul/client.go
Outdated
} | ||
|
||
// Update all watched checks as CheckRestart fields aren't part of ID | ||
if check.Watched() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get a test ensuring this
func TestConsul_ChangeTags(t *testing.T) { | ||
ctx := setupFake() | ||
|
||
allocID := "allocid" | ||
if err := ctx.ServiceClient.RegisterTask(allocID, ctx.Task, nil, nil); err != nil { | ||
if err := ctx.ServiceClient.RegisterTask(allocID, ctx.Task, ctx.Restarter, nil, nil); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are any of the tests ensuring that the watch gets created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to TestConsul_ChangeChecks and TestConsul_RegServices
t.Errorf("expected check 2 to not be restarted but found %d", n) | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test checking that Watch() with updated check works
remove: true, | ||
} | ||
select { | ||
case w.watchCh <- &c: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create an unwatchCh and just pass the cid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would create a race between adding and removing watches since select cases are randomly selected.
Instead I'll make watchCh -> watchUpdateCh
and create a small checkRestartUpdate
struct to handle the add vs remove case.
for { | ||
// Don't start watching until we actually have checks that | ||
// trigger restarts. | ||
for len(checks) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets just create a single select block. We can start the timer when we get a watch: https://github.com/hashicorp/nomad/blob/master/client/alloc_runner_health_watcher.go#L447
checks := map[string]*checkRestart{} | ||
|
||
// timer for check polling | ||
checkTimer := time.NewTimer(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A select on this will fire immediately. You need to stop it first and select from the channel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed below
|
||
// Begin polling | ||
if !checkTimer.Stop() { | ||
<-checkTimer.C |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do this with a select with a default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is straight from the docs: https://golang.org/pkg/time/#Timer.Reset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...but I always forget this caveat on Stop:
assuming the program has not received from t.C already
Oof this API gets me every time.
c.logger.Printf("[DEBUG] consul.health: alloc %q task %q check %q became unhealthy. Restarting in %s if not healthy", | ||
c.allocID, c.taskName, c.checkName, c.timeLimit) | ||
} | ||
c.unhealthyStart = now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename unhealthyStart to unhealthyCheck?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At no other point do I refer to a single run of a check as a check in this file, so I'm not sure I understand the renaming. We don't even have visibility into individual runs of a check since we poll statuses at a fixed interval.
restartDelay time.Duration | ||
grace time.Duration | ||
interval time.Duration | ||
timeLimit time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
document non-obvious fields: interval, timeLimit
912e5db
to
ab0cae0
Compare
764fdaa
to
0052ec9
Compare
Treat warnings as unhealthy by default
Reusing checkRestart for both adds/removes and the main check restarting logic was confusing.
Before this commit if a task had 2 checks cause restarts at the same time, both would trigger restarts of the task! This change removes all checks for a task whenever one of them is restarted.
Watched was a silly name
All 3 error/failure cases share restart logic, but 2 of them have special cased conditions.
9c8de7b
to
1564e1c
Compare
api/tasks.go
Outdated
|
||
s.CheckRestart.Canonicalize() | ||
|
||
for _, c := range s.Checks { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Place a comment
client/restarts.go
Outdated
return r | ||
} | ||
|
||
// SetRestartTriggered is used to mark that the task has been signalled to be | ||
// restarted | ||
func (r *RestartTracker) SetRestartTriggered() *RestartTracker { | ||
func (r *RestartTracker) SetRestartTriggered(failure bool) *RestartTracker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment on the param
@@ -99,7 +99,7 @@ func TestClient_RestartTracker_RestartTriggered(t *testing.T) { | |||
p := testPolicy(true, structs.RestartPolicyModeFail) | |||
p.Attempts = 0 | |||
rt := newRestartTracker(p, structs.JobTypeService) | |||
if state, when := rt.SetRestartTriggered().GetState(); state != structs.TaskRestarting && when != 0 { | |||
if state, when := rt.SetRestartTriggered(false).GetState(); state != structs.TaskRestarting && when != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit test the new case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added cde908e
client/restarts.go
Outdated
r.reason = ReasonDelay | ||
return structs.TaskRestarting, r.getDelay() | ||
// Handle restarts due to failures | ||
if r.failure { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invert it:
if !r.failure {
return "", 0
}
...
client/restarts.go
Outdated
} else { | ||
r.reason = ReasonDelay | ||
return structs.TaskRestarting, r.getDelay() | ||
if r.count > r.policy.Attempts { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment how you get to this case.
check_restart { | ||
limit = 3 | ||
grace = "90s" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove space
|
||
- `limit` `(int: 0)` - Restart task when a health check has failed `limit` | ||
times. For example 1 causes a restart on the first failure. The default, | ||
`0`, disables healtcheck based restarts. Failures must be consecutive. A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
healtcheck. Can you run a spell check on the doc sections
``` | ||
|
||
The [`restart` stanza][restart_stanza] controls the restart behavior of the | ||
task. In this case it will wait 10 seconds before restarting. Note that even if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it will stop the task and then wait 10 seconds before starting it again.
.
Delete subsequent sentence
} | ||
} | ||
|
||
// Unwatch a task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unwatch a check
} | ||
} | ||
|
||
// Watch a task and restart it if unhealthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Watch a check and restart it's task if it becomes unhealthy
@@ -30,6 +30,8 @@ unhealthy for the `limit` specified in a `check_restart` stanza, it is | |||
restarted according to the task group's [`restart` policy][restart_stanza]. The | |||
`check_restart` settings apply to [`check`s][check_stanza], but may also be | |||
placed on [`service`s][service_stanza] to apply to all checks on a service. | |||
`check_restart` settings on `service` will only overwrite unset `check_restart` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If check_restart
is set on both the check and service, the stanza's are merged with the check values taking precedence.
Amazing @schmichael ! |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Fixes #876
Unhealthy checks can now restart tasks. See docs in PR for intended behavior.