-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart unhealthy tasks #3105
Merged
+1,632
−230
Merged
Restart unhealthy tasks #3105
Changes from 7 commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
a720bb5
Add restart fields
schmichael bd1a342
Nest restart fields in CheckRestart
schmichael 1608e59
Add check watcher for restarting unhealthy tasks
schmichael ebbf87f
Use existing restart policy infrastructure
schmichael 555d1e2
on_warning=false -> ignore_warnings=false
schmichael c2d895d
Add comments and move delay calc to TaskRunner
schmichael 78c72f8
Default grace period to 1s
schmichael 7e103f6
Document new check_restart stanza
schmichael 850d991
Add changelog entry for #3105
schmichael 3db835c
Improve check watcher logging and add tests
schmichael 526528c
Removed partially implemented allocLock
schmichael 568b963
Remove unused lastStart field
schmichael 9fb2865
Fix whitespace
schmichael 092057a
Canonicalize and Merge CheckRestart in api
schmichael 8b8c164
Wrap check watch updates in a struct
schmichael 237c096
Simplify from 2 select loops to one
schmichael f8e872c
RestartDelay isn't needed as checks are re-added on restarts
schmichael 40ed262
Handle multiple failing checks on a single task
schmichael 5cd1d57
Watched -> TriggersRestart
schmichael 10dc1c7
DRY up restart handling a bit.
schmichael 3c0a42b
Rename unhealthy var and fix test indeterminism
schmichael 6f72270
Test check watch updates
schmichael a508bb9
Fold SetFailure into SetRestartTriggered
schmichael 5141c95
Add check_restart to jobspec tests
schmichael 1564e1c
Move check_restart to its own section.
schmichael 8014762
Add comments
schmichael cde908e
Cleanup and test restart failure code
schmichael fa836d8
Name const after what it represents
schmichael 924813d
Test converting CheckRestart from api->structs
schmichael 6bcf019
Test CheckRestart.Validate
schmichael 10ae18c
Minor corrections to check_restart docs
schmichael 967825d
Fix comments: task -> check
schmichael 3d7446d
@dadgar is better at words than me
schmichael File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -74,7 +74,9 @@ func (r *RestartTracker) SetWaitResult(res *dstructs.WaitResult) *RestartTracker | |
} | ||
|
||
// SetRestartTriggered is used to mark that the task has been signalled to be | ||
// restarted | ||
// restarted. Setting the failure to true restarts according to the restart | ||
// policy. When failure is false the task is restarted without considering the | ||
// restart policy. | ||
func (r *RestartTracker) SetRestartTriggered(failure bool) *RestartTracker { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Comment on the param |
||
r.lock.Lock() | ||
defer r.lock.Unlock() | ||
|
@@ -143,39 +145,42 @@ func (r *RestartTracker) GetState() (string, time.Duration) { | |
} | ||
|
||
// Handle restarts due to failures | ||
if r.failure { | ||
if r.startErr != nil { | ||
// If the error is not recoverable, do not restart. | ||
if !structs.IsRecoverable(r.startErr) { | ||
r.reason = ReasonUnrecoverableErrror | ||
return structs.TaskNotRestarting, 0 | ||
} | ||
} else if r.waitRes != nil { | ||
// If the task started successfully and restart on success isn't specified, | ||
// don't restart but don't mark as failed. | ||
if r.waitRes.Successful() && !r.onSuccess { | ||
r.reason = "Restart unnecessary as task terminated successfully" | ||
return structs.TaskTerminated, 0 | ||
} | ||
} | ||
if !r.failure { | ||
return "", 0 | ||
} | ||
|
||
if r.count > r.policy.Attempts { | ||
if r.policy.Mode == structs.RestartPolicyModeFail { | ||
r.reason = fmt.Sprintf( | ||
`Exceeded allowed attempts %d in interval %v and mode is "fail"`, | ||
r.policy.Attempts, r.policy.Interval) | ||
return structs.TaskNotRestarting, 0 | ||
} else { | ||
r.reason = ReasonDelay | ||
return structs.TaskRestarting, r.getDelay() | ||
} | ||
if r.startErr != nil { | ||
// If the error is not recoverable, do not restart. | ||
if !structs.IsRecoverable(r.startErr) { | ||
r.reason = ReasonUnrecoverableErrror | ||
return structs.TaskNotRestarting, 0 | ||
} | ||
} else if r.waitRes != nil { | ||
// If the task started successfully and restart on success isn't specified, | ||
// don't restart but don't mark as failed. | ||
if r.waitRes.Successful() && !r.onSuccess { | ||
r.reason = "Restart unnecessary as task terminated successfully" | ||
return structs.TaskTerminated, 0 | ||
} | ||
} | ||
|
||
r.reason = ReasonWithinPolicy | ||
return structs.TaskRestarting, r.jitter() | ||
// If this task has been restarted due to failures more times | ||
// than the restart policy allows within an interval fail | ||
// according to the restart policy's mode. | ||
if r.count > r.policy.Attempts { | ||
if r.policy.Mode == structs.RestartPolicyModeFail { | ||
r.reason = fmt.Sprintf( | ||
`Exceeded allowed attempts %d in interval %v and mode is "fail"`, | ||
r.policy.Attempts, r.policy.Interval) | ||
return structs.TaskNotRestarting, 0 | ||
} else { | ||
r.reason = ReasonDelay | ||
return structs.TaskRestarting, r.getDelay() | ||
} | ||
} | ||
|
||
return "", 0 | ||
r.reason = ReasonWithinPolicy | ||
return structs.TaskRestarting, r.jitter() | ||
} | ||
|
||
// getDelay returns the delay time to enter the next interval. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,6 +30,8 @@ unhealthy for the `limit` specified in a `check_restart` stanza, it is | |
restarted according to the task group's [`restart` policy][restart_stanza]. The | ||
`check_restart` settings apply to [`check`s][check_stanza], but may also be | ||
placed on [`service`s][service_stanza] to apply to all checks on a service. | ||
`check_restart` settings on `service` will only overwrite unset `check_restart` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If |
||
settings on `checks.` | ||
|
||
```hcl | ||
job "mysql" { | ||
|
@@ -66,7 +68,6 @@ job "mysql" { | |
check_restart { | ||
limit = 3 | ||
grace = "90s" | ||
|
||
ignore_warnings = false | ||
} | ||
} | ||
|
@@ -78,7 +79,7 @@ job "mysql" { | |
|
||
- `limit` `(int: 0)` - Restart task when a health check has failed `limit` | ||
times. For example 1 causes a restart on the first failure. The default, | ||
`0`, disables healtcheck based restarts. Failures must be consecutive. A | ||
`0`, disables health check based restarts. Failures must be consecutive. A | ||
single passing check will reset the count, so flapping services may not be | ||
restarted. | ||
|
||
|
@@ -124,8 +125,8 @@ restart { | |
``` | ||
|
||
The [`restart` stanza][restart_stanza] controls the restart behavior of the | ||
task. In this case it will wait 10 seconds before restarting. Note that even if | ||
the check passes in this time the restart will still occur. | ||
task. In this case it will stop the task and then wait 10 seconds before | ||
starting it again. | ||
|
||
Once the task restarts Nomad waits the `grace` period again before starting to | ||
check the task's health. | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Place a comment