consul: correctly interpret missing consul checks as unhealthy #15822

shoenig · 2023-01-19T15:16:04Z

This PR fixes a bug where Nomad assumed any registered Checks would exist
in the service registration coming back from Consul. In some cases, the
Consul may be slow in processing the check registration, and the response
object would not contain checks. Nomad would then scan the empty response
looking for Checks with failing health status, finding none, and then
marking a task/alloc as healthy.

In reality, we must always use Nomad's view of what checks should exist as
the source of truth, and compare that with the response Consul gives us,
making sure they match, before scanning the Consul response for failing
check statuses.

Fixes #15536

This PR fixes a bug where Nomad assumed any registered Checks would exist in the service registration coming back from Consul. In some cases, the Consul may be slow in processing the check registration, and the response object would not contain checks. Nomad would then scan the empty response looking for Checks with failing health status, finding none, and then marking a task/alloc as healthy. In reality, we must always use Nomad's view of what checks should exist as the source of truth, and compare that with the response Consul gives us, making sure they match, before scanning the Consul response for failing check statuses. Fixes #15536

lgfa29

LGTM!

Minor coding nit-picking, but not a blocker 👍

lgfa29 · 2023-01-19T17:04:54Z

client/allochealth/tracker.go

+	expChecks := set.New[string](10)
+	expCount := 0
+	regChecks := set.New[string](10)
+	regCount := 0
+	for _, service := range services {
+		for _, check := range service.Checks {
+			expChecks.Insert(check.Name)
+			expCount++
+		}
+	}
+	for _, task := range registrations.Tasks {
+		for _, service := range task.Services {
+			for _, check := range service.Checks {
+				regChecks.Insert(check.Name)
+				regCount++
+			}
+		}
+	}
+	if expCount != regCount {
+		return false
+	}
+	if !expChecks.Equal(regChecks) {
+		return false
+	}


Minor nit-picking, but do we need a set and counter here? I think using a map[service name]: count would be enough and maybe easier to read?

client/serviceregistration/service_registration.go

gulducat · 2023-01-19T17:30:42Z

client/allochealth/tracker.go

-			}
+		// scan for missing or unhealthy consul checks
+		if !evaluateConsulChecks(t.tg, allocReg) {
+			passed = false


Is it meaningful that t.setCheckHealth(false) is no longer called alongside this bool being set? I see that it is called elsewhere, but haven't chased down the potential implications of its removal here.

Ah good catch, and I believe this is important in the edge case where

checks become healthy

tasks become healthy

start minimum healthy time

checks become unhealthy

minimum healthy time ends <- we never updated check status to unhealthy, so we'll incorrectly report they are healthy

Added this back in, and also added TestTracker_ConsulChecks_HealthToUnhealthy to cover the case where health checks transition from healthy to unhealthy during the minimum health period.

vercel bot deployed to Preview – nomad-storybook-and-ui January 19, 2023 15:19 View deployment

shoenig force-pushed the health-deployment branch from 50ec1f9 to dcc19d4 Compare January 19, 2023 15:25

shoenig changed the title ~~[no ci] consul: correctly understand missing consul checks as unhealthy~~ consul: correctly understand missing consul checks as unhealthy Jan 19, 2023

shoenig changed the title ~~consul: correctly understand missing consul checks as unhealthy~~ consul: correctly interpret missing consul checks as unhealthy Jan 19, 2023

vercel bot deployed to Preview – nomad-storybook-and-ui January 19, 2023 15:29 View deployment

shoenig added this to the 1.4.x milestone Jan 19, 2023

shoenig added the backport/1.4.x backport to 1.4.x release line label Jan 19, 2023

shoenig marked this pull request as ready for review January 19, 2023 15:44

shoenig requested review from lgfa29, gulducat and angrycub January 19, 2023 15:44

lgfa29 approved these changes Jan 19, 2023

View reviewed changes

consul: minor CR refactor using maps not sets

d103467

vercel bot deployed to Preview – nomad-storybook-and-ui January 19, 2023 17:29 View deployment

gulducat reviewed Jan 19, 2023

View reviewed changes

shoenig added 2 commits January 19, 2023 13:28

consul: observe transition from healthy to unhealthy checks

e16c489

consul: spell healthy correctly

2be222b

vercel bot deployed to Preview – nomad-storybook-and-ui January 19, 2023 19:35 View deployment

shoenig merged commit c56a4bc into main Jan 19, 2023

shoenig deleted the health-deployment branch January 19, 2023 20:01

hc-github-team-nomad-core mentioned this pull request Jan 19, 2023

Backport of consul: correctly interpret missing consul checks as unhealthy into release/1.4.x #15826

Merged

Blefish mentioned this pull request Mar 8, 2023

Nomad does not mark deployment as successful with dynamic service names #16382

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consul: correctly interpret missing consul checks as unhealthy #15822

consul: correctly interpret missing consul checks as unhealthy #15822

shoenig commented Jan 19, 2023

lgfa29 left a comment

lgfa29 Jan 19, 2023

gulducat Jan 19, 2023

shoenig Jan 19, 2023

shoenig Jan 19, 2023

consul: correctly interpret missing consul checks as unhealthy #15822

consul: correctly interpret missing consul checks as unhealthy #15822

Conversation

shoenig commented Jan 19, 2023

lgfa29 left a comment

Choose a reason for hiding this comment

lgfa29 Jan 19, 2023

Choose a reason for hiding this comment

gulducat Jan 19, 2023

Choose a reason for hiding this comment

shoenig Jan 19, 2023

Choose a reason for hiding this comment

shoenig Jan 19, 2023

Choose a reason for hiding this comment