Allocs are healthy if service checks get healthy before task health #7944

notnoop · 2020-05-13T12:10:31Z

This fixes a bug where an alloc is considered unhealthy if the alloc service checks pass before the tasks start up. This may occur in a case where a task takes a relatively long time to start (e.g. large image to download).

The bug was due to the check health detector loop wouldn't propagate the checks health again. This change here, ensures that we keep checking checks health until the final alloc health outcome is determined.

I've added a failing test first, so you can see the failure in https://circleci.com/gh/hashicorp/nomad/65986 . The test passes with the last commit.

Add a failing tests to show that if an alloc checks is marked healthy before the alloc tasks start up, the alloc may be forever considered unhealthy.

Fix a bug where if the alloc check becomes healthy before the task health, the alloc may never be considered healthy.

tgross · 2020-05-13T12:36:10Z

client/allochealth/tracker.go

-			t.setCheckHealth(true)
+			if t.setCheckHealth(true) {
+				// final health set and propagated
+				return


The docstring for this function says this is a "long lived goroutine". If we return here, how does the tracker detect service check failures that happen later on?

"long lived goroutine" is a bit misnomer - it runs until the alloc is marked healthy or terminally unhealthy. Once an alloc is marked healthy, it never marks it unhealthy.

The exiting goroutine happens today already. setCheckHealth and setTaskHealth call t.cancelFn() which will cause both watchTaskEvents and watchConsulEvents as they handle t.ctx.Done().

Ah, gotcha.

tgross

LGTM

github-actions · 2023-01-06T02:16:51Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Mahmood Ali added 2 commits May 13, 2020 07:43

tests: tests for health check sequencing

d4e4563

Add a failing tests to show that if an alloc checks is marked healthy before the alloc tasks start up, the alloc may be forever considered unhealthy.

allochealth: Fix when check health preceeds task health

22b65f2

Fix a bug where if the alloc check becomes healthy before the task health, the alloc may never be considered healthy.

notnoop requested review from lgfa29 and tgross May 13, 2020 12:10

notnoop self-assigned this May 13, 2020

tgross reviewed May 13, 2020

View reviewed changes

tgross approved these changes May 13, 2020

View reviewed changes

notnoop merged commit 3cb5551 into master May 13, 2020

notnoop deleted the b-health-checks-after-task-health branch May 13, 2020 13:34

github-actions bot locked as resolved and limited conversation to collaborators Jan 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allocs are healthy if service checks get healthy before task health #7944

Allocs are healthy if service checks get healthy before task health #7944

notnoop commented May 13, 2020

tgross May 13, 2020

notnoop May 13, 2020

tgross May 13, 2020

tgross left a comment

github-actions bot commented Jan 6, 2023

Allocs are healthy if service checks get healthy before task health #7944

Allocs are healthy if service checks get healthy before task health #7944

Conversation

notnoop commented May 13, 2020

tgross May 13, 2020

Choose a reason for hiding this comment

notnoop May 13, 2020

Choose a reason for hiding this comment

tgross May 13, 2020

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 6, 2023