-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocs are healthy if service checks get healthy before task health #7944
Conversation
Add a failing tests to show that if an alloc checks is marked healthy before the alloc tasks start up, the alloc may be forever considered unhealthy.
Fix a bug where if the alloc check becomes healthy before the task health, the alloc may never be considered healthy.
t.setCheckHealth(true) | ||
if t.setCheckHealth(true) { | ||
// final health set and propagated | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring for this function says this is a "long lived goroutine". If we return here, how does the tracker detect service check failures that happen later on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"long lived goroutine" is a bit misnomer - it runs until the alloc is marked healthy or terminally unhealthy. Once an alloc is marked healthy, it never marks it unhealthy.
The exiting goroutine happens today already. setCheckHealth
and setTaskHealth
call t.cancelFn()
which will cause both watchTaskEvents
and watchConsulEvents
as they handle t.ctx.Done()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, gotcha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This fixes a bug where an alloc is considered unhealthy if the alloc service checks pass before the tasks start up. This may occur in a case where a task takes a relatively long time to start (e.g. large image to download).
The bug was due to the check health detector loop wouldn't propagate the checks health again. This change here, ensures that we keep checking checks health until the final alloc health outcome is determined.
I've added a failing test first, so you can see the failure in https://circleci.com/gh/hashicorp/nomad/65986 . The test passes with the last commit.