-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail alloc if alloc runner prestart hooks fail #5905
Conversation
// Mark a task as failed and not to run. Aimed to be invoked when alloc runner | ||
// prestart hooks failed. | ||
// Should never be called with Run(). | ||
func (tr *TaskRunner) MarkFailedDead(reason string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we introduce another function to call instead of Run()
breaking an invariant; but I had a very hard time rationalizing reusing Run()
when we never want to call any of the logic there and we would need to signal that the task should fail.
I felt that MarkFailedDead
is a reasonable compromise that does the bare minimum to mark task as failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job tying a lot of loose threads together and the test looks great. I'm all for getting this merged ASAP as it's a huge gap in AR's lifecycle management, but there's one case that still concerns me:
On client agent restart if AR.prerun()
errors, tasks may be leaked along with logmon and consul services.
2 solutions come to mind, but I'm sure there are more approaches:
- Go back to always calling TR.Run() and relying on a dead/terminal-esque check to immediately try to cleanup any existing driver handle and run stop hooks before exiting.
- Add a TR.Destroy method that is called from AR.Destroy that does a final cleanup pass (stop hooks and driver handles).
Option 1 is more 0.9 status quo, but seems fragile, confusing, and possibly tricky to implement right.
Option 2 is more 0.8 status quo IIRC, and I think it was dropped in favor of ~always calling TR.Run and using stop hooks.
TR.Destroy would be a best-effort cleanup: run stop hooks, call DestroyTask, and ignore any errors as in the common case all of those operations would have already been done. This makes testing it a little annoying as it probably should never return an error, but we can either have Destroy()
wrap destroyImpl() error
and test the impl or just assert expected actions took place.
When an alloc runner prestart hook fails, the task runners aren't invoked and they remain in a pending state. This leads to terrible results, some of which are: * Lockup in GC process as reported in #5861 * Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed * Alloc not being restarted/rescheduled to another node (as it's still in pending state) * Unexpected restart of alloc on a client restart, potentially days/weeks after alloc expected start time! Here, we treat all tasks to have failed if alloc runner prestart hook fails. This fixes the lockups, and permits the alloc to be rescheduled on another node. While it's desirable to retry alloc runner in such failures, I opted to treat it out of scope. I'm afraid of some subtles about alloc and task runners and their idempotency that's better handled in a follow up PR. This might be one of the root causes for #5840 .
Handle when prestart failed while restoring a task, to prevent accidentally leaking consul/logmon processes.
0398196
to
9980239
Compare
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
When an alloc runner prestart hook fails, the task runners aren't invoked
and they remain in a pending state.
This leads to terrible results, some of which are:
pending state)
alloc expected start time!
Here, we treat all tasks to have failed if alloc runner prestart hook fails.
This fixes the lockups, and permits the alloc to be rescheduled on another node.
While it's desirable to retry alloc runner in such failures, I opted to treat it
out of scope. I'm afraid of some subtles about alloc and task runners and their
idempotency that's better handled in a follow up PR.
This might be one of the root causes for
#5840 .