Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent checkwatcher from deadlocking from too many check restarts #5957

Conversation

byronwolfman
Copy link
Contributor

@byronwolfman byronwolfman commented Jul 15, 2019

Fixes #5395

Each time the consul check watcher runs its main loop, it has three
choices: 1. update watched checks, 2. check & restart failed allocs,
or 3. exit. Updating watched checks drains checkUpdateCh; restarts
synchronously fill checkUpdateCh. If too many checks require a
restart during the main loop, the checkUpdateCh channel will become
full and block. Once this happens, the channel cannot be drained, and
the check watcher deadlocks.

A check restart travels through several functions, so here's a simplified
view of the run loop:

while {
  dequeueUpdates from w.checkUpdateCh {
    ...
  }
  checkTimer {
  for all check watches {
    restart if unhealthy via apply()
      -> apply() calls Restart()
        -> Restart() calls preKill()
          -> preKill() calls PreKilling() hook
            -> PreKilling() hook calls deregister()
              -> deregister() calls RemoveTask()
                -> RemoveTask() calls Unwatch()
                  -> Unwatch() pushes messages into w.checkUpdateCh
                     // deadlock if channel is full before we exit the check watch for loop
  }
}

To fix this, we will pass along along a bool to the prekill hook
indicating that the restart is due to a check restart so that during
task deregistration, we do not attempt to insert messages into the
checkUpdateCh channel. These checks are removed by the check watcher
once the task runner's Restart function returns anyway.

Here's a job file to help test for the deadlock:

job "example" {
  datacenters = ["dc1"]

  group "cache" {
    count = 7

    restart {
      attempts = 10
      interval = "5m"
      delay = "10s"
      mode = "delay"
    }

    task "redis" {
      driver = "docker"

      shutdown_delay = "10s"

      config {
        image = "redis:3.2"
        port_map {
          db = 6379
        }
      }

      resources {
        cpu    = 100
        memory = 32
        network {
          mbits = 10
          port "db" {}
        }
      }

      service {
        name = "redis-cache"
        tags = ["global", "cache"]
        port = "db"
        check {
          type = "script"
          command = "false"
          interval = "2s"
          timeout = "1s"
          check_restart = {
            limit = 3
            grace = "10s"
          }
        }
      }
    }
  }
}

A count of 7 is overkill but it helps speed things up. The job has
an intentionally faulty healthcheck. Run the job against a single node
using the 0.9.3 agent and then walk away for 5 minutes. The check
watcher should be deadlocked, and attempting to stop the job or drain
the node will fail/block. The same job file with this PR's changes
should be able to restart to its heart's content.

Additionally, the deregister function no longer removes consul tasks
twice. It used to do this when canary tasks had a different generated
consul ID, but this is no longer the case, and the second task removal
is unnecessary, only generating errors in nomad and consul logs. While
this itself does not cause deadlocking, it helped to speed up the
check watcher's deadlocking when it was present by inserting twice as
many messages into the channel.

A final addition: alloc restarts due to check restarts will no longer
wait through the shutdown delay. Unhealthy allocations should probably
just be restarted right away, and the use of the shutdown delay in the
prekill hook is probably an unintentional change due to the giant 0.9
refactor and alloc lifecycle changes.

Tests have been adjusted to account for this where I could find them.

I don't really write golang so I apologize if there are better or more
idiomatic approaches to this sort of change. Same goes for tests
because I am definitely lost when it comes to the testing apparatus on
a giant golang project like this.

Each time the consul check watcher runs its main loop, it has three
choices: 1. update watched checks, 2. check & restart failed allocs,
or exit. Updating watched checks drains `checkUpdateCh` and restarts
synchronously fill `checkUpdateCh`. If too many checks require a
restart during the main loop, the `checkUpdateCh` channel will become
full and block. Once this happens, the channel cannot be drained, and
the check watcher deadlocks.

To fix this, we will pass along along a bool to the prekill hook
indicating that the restart is due to a check restart so that during
task deregistration, we do not attempt to insert messages into the
`checkUpdateCh` channel. These checks are removed by the check watcher
once the task runner's Restart function returns anyway.

Additionally, the deregister function no longer removes consul tasks
twice. It used to do this when canary tasks had a different generated
consul ID, but this is no longer the case, and the second task removal
is unnecessary, only generating errors in nomad and consul logs. While
this itself does not cause deadlocking, it helped to speed up the
check watcher's deadlocking when it was present by inserting twice as
many messages into the channel.

Tests have been adjusted to account for this where I could find them.
@hashicorp-cla
Copy link

hashicorp-cla commented Jul 15, 2019

CLA assistant check
All committers have signed the CLA.

schmichael added a commit that referenced this pull request Jul 17, 2019
Fixes #5395
Alternative to #5957

Make task restarting asynchronous when handling check-based restarts.
This matches the pre-0.9 behavior where TaskRunner.Restart was an
asynchronous signal. The check-based restarting code was not designed
to handle blocking in TaskRunner.Restart. 0.9 made it reentrant and
could easily overwhelm the buffered update chan and deadlock.

Many thanks to @byronwolfman for his excellent debugging, PR, and
reproducer!

I created this alternative as changing the functionality of
TaskRunner.Restart has a much larger impact. This approach reverts to
old known-good behavior and minimizes the number of places changes are
made.
@schmichael
Copy link
Member

Amazing investigative work and a great PR!

I created #5975 as an alternative since changing the behavior of TaskRunner.Restart has a large impact across many subsystems. It's what caused this bug to begin with!

The idea behind #5975 is to restore the 0.8 behavior of Restart calls being asynchronous. This allows checkRestarter.Run to continue to apply updates without risking blocking.

Thanks for your hard work investigating this. It made it extremely easy for me to reproduce and diagnose on my end!

I'll leave this PR open until the team decides which approach is preferred.

@byronwolfman
Copy link
Contributor Author

Thanks for the alternative PR @schmichael! Looks much cleaner than mine so I'm willing to bet it will have fewer unintended consequences. :)

@preetapan preetapan removed the 0.9.4 label Jul 18, 2019
@byronwolfman
Copy link
Contributor Author

Closing this in lieu of #5975

@github-actions
Copy link

github-actions bot commented Feb 6, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

job still running though received kill signal
4 participants