Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consul: correctly interpret missing consul checks as unhealthy #15822

Merged
merged 4 commits into from
Jan 19, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .changelog/15822.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
```release-note:bug
consul: correctly interpret missing consul checks as unhealthy
```
89 changes: 61 additions & 28 deletions client/allochealth/tracker.go
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ import (

"github.com/hashicorp/consul/api"
"github.com/hashicorp/go-hclog"
"github.com/hashicorp/go-set"
"github.com/hashicorp/nomad/client/serviceregistration"
"github.com/hashicorp/nomad/client/serviceregistration/checks/checkstore"
cstructs "github.com/hashicorp/nomad/client/structs"
Expand Down Expand Up @@ -257,10 +258,9 @@ func (t *Tracker) setTaskHealth(healthy, terminal bool) {
// setCheckHealth is used to mark the checks as either healthy or unhealthy.
// returns true if health is propagated and no more health monitoring is needed
//
// todo: this is currently being shared by watchConsulEvents and watchNomadEvents,
//
// and must be split up if/when we support registering services (and thus checks)
// of different providers.
// todo: this is currently being shared by watchConsulEvents and watchNomadEvents
// and must be split up if/when we support registering services (and thus checks)
// of different providers.
func (t *Tracker) setCheckHealth(healthy bool) bool {
t.lock.Lock()
defer t.lock.Unlock()
Expand Down Expand Up @@ -437,6 +437,7 @@ func (h *healthyFuture) C() <-chan time.Time {
//
// Does not watch Nomad service checks; see watchNomadEvents for those.
func (t *Tracker) watchConsulEvents() {

// checkTicker is the ticker that triggers us to look at the checks in Consul
checkTicker := time.NewTicker(t.checkLookupInterval)
defer checkTicker.Stop()
Expand Down Expand Up @@ -502,30 +503,9 @@ OUTER:
// Detect if all the checks are passing
passed := true

CHECKS:
for _, treg := range allocReg.Tasks {
for _, sreg := range treg.Services {
for _, check := range sreg.Checks {
onUpdate := sreg.CheckOnUpdate[check.CheckID]
switch check.Status {
case api.HealthPassing:
continue
case api.HealthWarning:
if onUpdate == structs.OnUpdateIgnoreWarn || onUpdate == structs.OnUpdateIgnore {
continue
}
case api.HealthCritical:
if onUpdate == structs.OnUpdateIgnore {
continue
}
default:
}

passed = false
t.setCheckHealth(false)
break CHECKS
}
}
// scan for missing or unhealthy consul checks
if !evaluateConsulChecks(t.tg, allocReg) {
passed = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it meaningful that t.setCheckHealth(false) is no longer called alongside this bool being set? I see that it is called elsewhere, but haven't chased down the potential implications of its removal here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch, and I believe this is important in the edge case where

  1. checks become healthy
  2. tasks become healthy
  3. start minimum healthy time
  4. checks become unhealthy
  5. minimum healthy time ends <- we never updated check status to unhealthy, so we'll incorrectly report they are healthy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this back in, and also added TestTracker_ConsulChecks_HealthToUnhealthy to cover the case where health checks transition from healthy to unhealthy during the minimum health period.

}

if !passed {
Expand All @@ -543,6 +523,59 @@ OUTER:
}
}

func evaluateConsulChecks(tg *structs.TaskGroup, registrations *serviceregistration.AllocRegistration) bool {
// First, identify any case where a check definition is missing or outdated
// on the Consul side. Note that because check names are not unique, we must
// also keep track of the counts on each side and make sure those also match.
services := tg.ConsulServices()
expChecks := set.New[string](10)
expCount := 0
regChecks := set.New[string](10)
regCount := 0
for _, service := range services {
for _, check := range service.Checks {
expChecks.Insert(check.Name)
expCount++
}
}
for _, task := range registrations.Tasks {
for _, service := range task.Services {
for _, check := range service.Checks {
regChecks.Insert(check.Name)
regCount++
}
}
}
if expCount != regCount {
return false
}
if !expChecks.Equal(regChecks) {
return false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit-picking, but do we need a set and counter here? I think using a map[service name]: count would be enough and maybe easier to read?


// Now we can simply scan the status of each Check reported by Consul.
for _, task := range registrations.Tasks {
for _, service := range task.Services {
for _, check := range service.Checks {
onUpdate := service.CheckOnUpdate[check.CheckID]
switch check.Status {
case api.HealthWarning:
if onUpdate != structs.OnUpdateIgnoreWarn && onUpdate != structs.OnUpdateIgnore {
return false
}
case api.HealthCritical:
if onUpdate != structs.OnUpdateIgnore {
return false
}
}
}
}
}

// All checks are present and healthy.
return true
}

// watchNomadEvents is a watcher for the health of the allocation's Nomad checks.
// If all checks report healthy the watcher will exit after the MinHealthyTime has
// been reached, otherwise the watcher will continue to check unhealthy checks until
Expand Down
Loading