-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consul: correctly interpret missing consul checks as unhealthy #15822
Conversation
This PR fixes a bug where Nomad assumed any registered Checks would exist in the service registration coming back from Consul. In some cases, the Consul may be slow in processing the check registration, and the response object would not contain checks. Nomad would then scan the empty response looking for Checks with failing health status, finding none, and then marking a task/alloc as healthy. In reality, we must always use Nomad's view of what checks should exist as the source of truth, and compare that with the response Consul gives us, making sure they match, before scanning the Consul response for failing check statuses. Fixes #15536
50ec1f9
to
dcc19d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Minor coding nit-picking, but not a blocker 👍
client/allochealth/tracker.go
Outdated
expChecks := set.New[string](10) | ||
expCount := 0 | ||
regChecks := set.New[string](10) | ||
regCount := 0 | ||
for _, service := range services { | ||
for _, check := range service.Checks { | ||
expChecks.Insert(check.Name) | ||
expCount++ | ||
} | ||
} | ||
for _, task := range registrations.Tasks { | ||
for _, service := range task.Services { | ||
for _, check := range service.Checks { | ||
regChecks.Insert(check.Name) | ||
regCount++ | ||
} | ||
} | ||
} | ||
if expCount != regCount { | ||
return false | ||
} | ||
if !expChecks.Equal(regChecks) { | ||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit-picking, but do we need a set and counter here? I think using a map[service name]: count
would be enough and maybe easier to read?
} | ||
// scan for missing or unhealthy consul checks | ||
if !evaluateConsulChecks(t.tg, allocReg) { | ||
passed = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it meaningful that t.setCheckHealth(false)
is no longer called alongside this bool being set? I see that it is called elsewhere, but haven't chased down the potential implications of its removal here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch, and I believe this is important in the edge case where
- checks become healthy
- tasks become healthy
- start minimum healthy time
- checks become unhealthy
- minimum healthy time ends <- we never updated check status to unhealthy, so we'll incorrectly report they are healthy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this back in, and also added TestTracker_ConsulChecks_HealthToUnhealthy
to cover the case where health checks transition from healthy to unhealthy during the minimum health period.
This PR fixes a bug where Nomad assumed any registered Checks would exist
in the service registration coming back from Consul. In some cases, the
Consul may be slow in processing the check registration, and the response
object would not contain checks. Nomad would then scan the empty response
looking for Checks with failing health status, finding none, and then
marking a task/alloc as healthy.
In reality, we must always use Nomad's view of what checks should exist as
the source of truth, and compare that with the response Consul gives us,
making sure they match, before scanning the Consul response for failing
check statuses.
Fixes #15536