Fix task leak during client restore when allocrunner prerun hook fails #17104

gulducat · 2023-05-06T02:58:43Z

Fixes #17102 -- I describe the issue more fully there.

My approach here is to stop skipping over allocRunner.runTasks() on prerun error. That way, instead of duplicating more cleanup code, which may change in the future, the same TaskRunner.Run() code that usually handles task cleanup can do what it needs to as appropriate with tasks that fail prerun during the alloc restore process.

In pursuit of that, I made an error-inducing FailHook and added the ability to include it as part of client Config for the client integration test. I could remove the non-Prerun interface implementations, but I figured while I'm at it, may as well make a thing that can be induced to fail at any stage in case it's useful?

tgross

LGTM overall. We need a changelog entry via make cl, but also my comment about the commit description applies generally. If we squash-merge this as we typically do, we'll end up with a PR description that's just a reference somewhere else, instead of being self-contained within the commit message.

client/allocrunner/taskrunner/task_runner.go

tgross · 2023-05-08T14:03:29Z

client/allocrunner/fail_hook.go

+	"github.com/hashicorp/nomad/client/allocrunner/interfaces"
+)
+
+var ErrFailHookError = errors.New("failed successfully")


tgross · 2023-05-08T14:11:11Z

client/allocrunner/taskrunner/task_runner_test.go

+	foundMessages := make(map[string]bool)
+	for _, event := range state.Events {
+		foundMessages[event.DisplayMessage] = true
+	}
+	test.True(t, foundMessages[reason], test.Sprintf("expected '%s' in events: %#v", reason, foundMessages))


"Map contains value that meets this condition" seems like it'd be nice new frequently-used assertion for @shoenig's test library. But out of scope for this PR.

to avoid leaking task resources (e.g. containers, iptables) if allocRunner prerun fails during restore on client restart. now if prerun fails, TaskRunner.MarkFailedKill() will only emit an event, mark the task as failed, and cancel the tr's killCtx, so then ar.runTasks() -> tr.Run() can take care of the actual cleanup. removed from (formerly) tr.MarkFailedDead(), now handled by tr.Run(): * set task state as dead * save task runner local state * task stop hooks also done in tr.Run() now that it's not skipped: * handleKill() to kill tasks while respecting their shutdown delay, and retrying as needed * also includes task preKill hooks * clearDriverHandle() to destroy the task and associated resources * task exited hooks

gulducat · 2023-05-08T15:16:34Z

Thanks! I added the changelog, and pre-squashed my commits with a more verbose message. How's it look?

Generally I do heavily edit the squash commit message like that when I go to merge, but it occurs to me that you couldn't know that, and can't pre-review what I may fill in there before merge 😋

tgross

LGTM!

gulducat requested review from shoenig and tgross May 6, 2023 02:58

tgross reviewed May 8, 2023

View reviewed changes

gulducat force-pushed the b-restore-prerun-error-task-leak branch from ad89b8f to ae7016d Compare May 8, 2023 15:16

vercel bot deployed to Preview – nomad-storybook-and-ui May 8, 2023 15:22 View deployment

tgross approved these changes May 8, 2023

View reviewed changes

gulducat added type/bug backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line labels May 8, 2023

gulducat merged commit c2dc1c5 into main May 8, 2023

gulducat deleted the b-restore-prerun-error-task-leak branch May 8, 2023 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix task leak during client restore when allocrunner prerun hook fails #17104

Fix task leak during client restore when allocrunner prerun hook fails #17104

gulducat commented May 6, 2023

tgross left a comment

tgross May 8, 2023

tgross May 8, 2023

gulducat commented May 8, 2023 •

edited

Loading

tgross left a comment

Fix task leak during client restore when allocrunner prerun hook fails #17104

Fix task leak during client restore when allocrunner prerun hook fails #17104

Conversation

gulducat commented May 6, 2023

tgross left a comment

Choose a reason for hiding this comment

tgross May 8, 2023

Choose a reason for hiding this comment

tgross May 8, 2023

Choose a reason for hiding this comment

gulducat commented May 8, 2023 • edited Loading

tgross left a comment

Choose a reason for hiding this comment

gulducat commented May 8, 2023 •

edited

Loading