-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix task leak during client restore when allocrunner prerun hook fails #17104
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. We need a changelog entry via make cl
, but also my comment about the commit description applies generally. If we squash-merge this as we typically do, we'll end up with a PR description that's just a reference somewhere else, instead of being self-contained within the commit message.
"github.com/hashicorp/nomad/client/allocrunner/interfaces" | ||
) | ||
|
||
var ErrFailHookError = errors.New("failed successfully") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆
foundMessages := make(map[string]bool) | ||
for _, event := range state.Events { | ||
foundMessages[event.DisplayMessage] = true | ||
} | ||
test.True(t, foundMessages[reason], test.Sprintf("expected '%s' in events: %#v", reason, foundMessages)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Map contains value that meets this condition" seems like it'd be nice new frequently-used assertion for @shoenig's test library. But out of scope for this PR.
to avoid leaking task resources (e.g. containers, iptables) if allocRunner prerun fails during restore on client restart. now if prerun fails, TaskRunner.MarkFailedKill() will only emit an event, mark the task as failed, and cancel the tr's killCtx, so then ar.runTasks() -> tr.Run() can take care of the actual cleanup. removed from (formerly) tr.MarkFailedDead(), now handled by tr.Run(): * set task state as dead * save task runner local state * task stop hooks also done in tr.Run() now that it's not skipped: * handleKill() to kill tasks while respecting their shutdown delay, and retrying as needed * also includes task preKill hooks * clearDriverHandle() to destroy the task and associated resources * task exited hooks
ad89b8f
to
ae7016d
Compare
Thanks! I added the changelog, and pre-squashed my commits with a more verbose message. How's it look? Generally I do heavily edit the squash commit message like that when I go to merge, but it occurs to me that you couldn't know that, and can't pre-review what I may fill in there before merge 😋 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Fixes #17102 -- I describe the issue more fully there.
My approach here is to stop skipping over
allocRunner.runTasks()
on prerun error. That way, instead of duplicating more cleanup code, which may change in the future, the sameTaskRunner.Run()
code that usually handles task cleanup can do what it needs to as appropriate with tasks that fail prerun during the alloc restore process.In pursuit of that, I made an error-inducing
FailHook
and added the ability to include it as part of client Config for the client integration test. I could remove the non-Prerun interface implementations, but I figured while I'm at it, may as well make a thing that can be induced to fail at any stage in case it's useful?