Skip to content

Commit

Permalink
health: detect missing task checks
Browse files Browse the repository at this point in the history
Fixes a bug where an allocation is considered healthy if some of the
tasks are being restarted and as such, their checks aren't tracked by
consul agent client.

The underlying problem is that allocation registration in consul
agent/client code is mutable: tasks get removed as services from consul,
prior to stopping/restarting to allow for graceful removal from LBs.
The downside is that the health tracker may consider the allocation as
healthy if one of the task is down.

This uses the simplest approach to patch the problem by detecting the
number of expected checks against the registered checks.

I don't anticipate disrepency of counters.  `sreg.Checks` should only
contain checks that nomad agent explicitly registered and filter out
unexpected or unrelated checks:
https://github.com/hashicorp/nomad/blob/0ecda992317d3300e1c1da05170f8bba18410357/command/agent/consul/client.go#L1138-L1147
.

A better approach would have been to strictly compare the found check
IDs against an immutable list of expected IDs.  This sadly requires
significant code changes both to task runner service hooks and consul
hooks, that I'm not comfortable so close to cutting a new release.
  • Loading branch information
Mahmood Ali committed Mar 17, 2020
1 parent b9e3e12 commit 0047882
Showing 1 changed file with 11 additions and 0 deletions.
11 changes: 11 additions & 0 deletions client/allochealth/tracker.go
Original file line number Diff line number Diff line change
Expand Up @@ -372,10 +372,14 @@ OUTER:
// Detect if all the checks are passing
passed := true

foundChecks := 0

CHECKS:
for _, treg := range allocReg.Tasks {
for _, sreg := range treg.Services {
for _, check := range sreg.Checks {
foundChecks++

if check.Status == api.HealthPassing {
continue
}
Expand All @@ -387,6 +391,13 @@ OUTER:
}
}

if foundChecks < t.consulCheckCount {
// didn't see all the checks we expect, task might be slow to start
// or restarting tasks that had their service hooks deregistered
passed = false
t.setCheckHealth(false)
}

if !passed {
// Reset the timer since we have transitioned back to unhealthy
if primed {
Expand Down

0 comments on commit 0047882

Please sign in to comment.