Prevent checkwatcher from deadlocking from too many check restarts #5957

byronwolfman · 2019-07-15T02:03:52Z

Each time the consul check watcher runs its main loop, it has three
choices: 1. update watched checks, 2. check & restart failed allocs,
or 3. exit. Updating watched checks drains checkUpdateCh; restarts
synchronously fill checkUpdateCh. If too many checks require a
restart during the main loop, the checkUpdateCh channel will become
full and block. Once this happens, the channel cannot be drained, and
the check watcher deadlocks.

A check restart travels through several functions, so here's a simplified
view of the run loop:

while {
  dequeueUpdates from w.checkUpdateCh {
    ...
  }
  checkTimer {
  for all check watches {
    restart if unhealthy via apply()
      -> apply() calls Restart()
        -> Restart() calls preKill()
          -> preKill() calls PreKilling() hook
            -> PreKilling() hook calls deregister()
              -> deregister() calls RemoveTask()
                -> RemoveTask() calls Unwatch()
                  -> Unwatch() pushes messages into w.checkUpdateCh
                     // deadlock if channel is full before we exit the check watch for loop
  }
}

To fix this, we will pass along along a bool to the prekill hook
indicating that the restart is due to a check restart so that during
task deregistration, we do not attempt to insert messages into the
checkUpdateCh channel. These checks are removed by the check watcher
once the task runner's Restart function returns anyway.

Here's a job file to help test for the deadlock:

job "example" {
  datacenters = ["dc1"]

  group "cache" {
    count = 7

    restart {
      attempts = 10
      interval = "5m"
      delay = "10s"
      mode = "delay"
    }

    task "redis" {
      driver = "docker"

      shutdown_delay = "10s"

      config {
        image = "redis:3.2"
        port_map {
          db = 6379
        }
      }

      resources {
        cpu    = 100
        memory = 32
        network {
          mbits = 10
          port "db" {}
        }
      }

      service {
        name = "redis-cache"
        tags = ["global", "cache"]
        port = "db"
        check {
          type = "script"
          command = "false"
          interval = "2s"
          timeout = "1s"
          check_restart = {
            limit = 3
            grace = "10s"
          }
        }
      }
    }
  }
}

A count of 7 is overkill but it helps speed things up. The job has
an intentionally faulty healthcheck. Run the job against a single node
using the 0.9.3 agent and then walk away for 5 minutes. The check
watcher should be deadlocked, and attempting to stop the job or drain
the node will fail/block. The same job file with this PR's changes
should be able to restart to its heart's content.

Additionally, the deregister function no longer removes consul tasks
twice. It used to do this when canary tasks had a different generated
consul ID, but this is no longer the case, and the second task removal
is unnecessary, only generating errors in nomad and consul logs. While
this itself does not cause deadlocking, it helped to speed up the
check watcher's deadlocking when it was present by inserting twice as
many messages into the channel.

A final addition: alloc restarts due to check restarts will no longer
wait through the shutdown delay. Unhealthy allocations should probably
just be restarted right away, and the use of the shutdown delay in the
prekill hook is probably an unintentional change due to the giant 0.9
refactor and alloc lifecycle changes.

Tests have been adjusted to account for this where I could find them.

I don't really write golang so I apologize if there are better or more
idiomatic approaches to this sort of change. Same goes for tests
because I am definitely lost when it comes to the testing apparatus on
a giant golang project like this.

Each time the consul check watcher runs its main loop, it has three choices: 1. update watched checks, 2. check & restart failed allocs, or exit. Updating watched checks drains `checkUpdateCh` and restarts synchronously fill `checkUpdateCh`. If too many checks require a restart during the main loop, the `checkUpdateCh` channel will become full and block. Once this happens, the channel cannot be drained, and the check watcher deadlocks. To fix this, we will pass along along a bool to the prekill hook indicating that the restart is due to a check restart so that during task deregistration, we do not attempt to insert messages into the `checkUpdateCh` channel. These checks are removed by the check watcher once the task runner's Restart function returns anyway. Additionally, the deregister function no longer removes consul tasks twice. It used to do this when canary tasks had a different generated consul ID, but this is no longer the case, and the second task removal is unnecessary, only generating errors in nomad and consul logs. While this itself does not cause deadlocking, it helped to speed up the check watcher's deadlocking when it was present by inserting twice as many messages into the channel. Tests have been adjusted to account for this where I could find them.

hashicorp-cla · 2019-07-15T02:03:59Z

All committers have signed the CLA.

@byronwolfman

Fixes #5395 Alternative to #5957 Make task restarting asynchronous when handling check-based restarts. This matches the pre-0.9 behavior where TaskRunner.Restart was an asynchronous signal. The check-based restarting code was not designed to handle blocking in TaskRunner.Restart. 0.9 made it reentrant and could easily overwhelm the buffered update chan and deadlock. Many thanks to @byronwolfman for his excellent debugging, PR, and reproducer! I created this alternative as changing the functionality of TaskRunner.Restart has a much larger impact. This approach reverts to old known-good behavior and minimizes the number of places changes are made.

schmichael · 2019-07-17T22:31:13Z

Amazing investigative work and a great PR!

I created #5975 as an alternative since changing the behavior of TaskRunner.Restart has a large impact across many subsystems. It's what caused this bug to begin with!

The idea behind #5975 is to restore the 0.8 behavior of Restart calls being asynchronous. This allows checkRestarter.Run to continue to apply updates without risking blocking.

Thanks for your hard work investigating this. It made it extremely easy for me to reproduce and diagnose on my end!

I'll leave this PR open until the team decides which approach is preferred.

byronwolfman · 2019-07-17T22:49:40Z

Thanks for the alternative PR @schmichael! Looks much cleaner than mine so I'm willing to bet it will have fewer unintended consequences. :)

byronwolfman · 2019-07-18T14:52:11Z

Closing this in lieu of #5975

github-actions · 2023-02-06T02:15:43Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

byronwolfman mentioned this pull request Jul 15, 2019

job still running though received kill signal #5395

Closed

preetapan added the 0.9.4 label Jul 15, 2019

schmichael mentioned this pull request Jul 17, 2019

consul: fix deadlock in check-based restarts #5975

Merged

preetapan removed the 0.9.4 label Jul 18, 2019

byronwolfman closed this Jul 18, 2019

schmichael mentioned this pull request Jul 18, 2019

ShutdownDelay should apply on shutdowns and "healthy" restarts #5980

Open

2 tasks

github-actions bot locked as resolved and limited conversation to collaborators Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent checkwatcher from deadlocking from too many check restarts #5957

Prevent checkwatcher from deadlocking from too many check restarts #5957

byronwolfman commented Jul 15, 2019 •

edited

Loading

hashicorp-cla commented Jul 15, 2019 •

edited

Loading

schmichael commented Jul 17, 2019

byronwolfman commented Jul 17, 2019

byronwolfman commented Jul 18, 2019

github-actions bot commented Feb 6, 2023

Prevent checkwatcher from deadlocking from too many check restarts #5957

Prevent checkwatcher from deadlocking from too many check restarts #5957

Conversation

byronwolfman commented Jul 15, 2019 • edited Loading

hashicorp-cla commented Jul 15, 2019 • edited Loading

schmichael commented Jul 17, 2019

byronwolfman commented Jul 17, 2019

byronwolfman commented Jul 18, 2019

github-actions bot commented Feb 6, 2023

byronwolfman commented Jul 15, 2019 •

edited

Loading

hashicorp-cla commented Jul 15, 2019 •

edited

Loading