Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail alloc if alloc runner prestart hooks fail #5905

Merged
merged 2 commits into from
Jul 2, 2019
Merged

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Jun 29, 2019

When an alloc runner prestart hook fails, the task runners aren't invoked
and they remain in a pending state.

This leads to terrible results, some of which are:

  • Lockup in GC process as reported in client: fix gc deadlock when ar.prerun errors #5861
  • Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed
  • Alloc not being restarted/rescheduled to another node (as it's still in
    pending state)
  • Unexpected restart of alloc on a client restart, potentially days/weeks after
    alloc expected start time!

Here, we treat all tasks to have failed if alloc runner prestart hook fails.
This fixes the lockups, and permits the alloc to be rescheduled on another node.

While it's desirable to retry alloc runner in such failures, I opted to treat it
out of scope. I'm afraid of some subtles about alloc and task runners and their
idempotency that's better handled in a follow up PR.

This might be one of the root causes for
#5840 .

// Mark a task as failed and not to run. Aimed to be invoked when alloc runner
// prestart hooks failed.
// Should never be called with Run().
func (tr *TaskRunner) MarkFailedDead(reason string) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we introduce another function to call instead of Run() breaking an invariant; but I had a very hard time rationalizing reusing Run() when we never want to call any of the logic there and we would need to signal that the task should fail.

I felt that MarkFailedDead is a reasonable compromise that does the bare minimum to mark task as failed.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job tying a lot of loose threads together and the test looks great. I'm all for getting this merged ASAP as it's a huge gap in AR's lifecycle management, but there's one case that still concerns me:

On client agent restart if AR.prerun() errors, tasks may be leaked along with logmon and consul services.

2 solutions come to mind, but I'm sure there are more approaches:

  1. Go back to always calling TR.Run() and relying on a dead/terminal-esque check to immediately try to cleanup any existing driver handle and run stop hooks before exiting.
  2. Add a TR.Destroy method that is called from AR.Destroy that does a final cleanup pass (stop hooks and driver handles).

Option 1 is more 0.9 status quo, but seems fragile, confusing, and possibly tricky to implement right.

Option 2 is more 0.8 status quo IIRC, and I think it was dropped in favor of ~always calling TR.Run and using stop hooks.

TR.Destroy would be a best-effort cleanup: run stop hooks, call DestroyTask, and ignore any errors as in the common case all of those operations would have already been done. This makes testing it a little annoying as it probably should never return an error, but we can either have Destroy() wrap destroyImpl() error and test the impl or just assert expected actions took place.

Mahmood Ali added 2 commits July 2, 2019 18:35
When an alloc runner prestart hook fails, the task runners aren't invoked
and they remain in a pending state.

This leads to terrible results, some of which are:
* Lockup in GC process as reported in #5861
* Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed
* Alloc not being restarted/rescheduled to another node (as it's still in
  pending state)
* Unexpected restart of alloc on a client restart, potentially days/weeks after
  alloc expected start time!

Here, we treat all tasks to have failed if alloc runner prestart hook fails.
This fixes the lockups, and permits the alloc to be rescheduled on another node.

While it's desirable to retry alloc runner in such failures, I opted to treat it
out of scope.  I'm afraid of some subtles about alloc and task runners and their
idempotency that's better handled in a follow up PR.

This might be one of the root causes for
#5840 .
Handle when prestart failed while restoring a task, to prevent
accidentally leaking consul/logmon processes.
@notnoop notnoop force-pushed the b-ar-failed-prestart branch from 0398196 to 9980239 Compare July 2, 2019 10:39
@notnoop notnoop merged commit 72f67c5 into master Jul 2, 2019
@notnoop notnoop deleted the b-ar-failed-prestart branch July 2, 2019 12:25
@github-actions
Copy link

github-actions bot commented Feb 7, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants