coordinate restarts across clients (template rerender, check restart, etc.) #10920

valodzka · 2021-07-21T09:38:18Z

Proposal

It should be possible to restart tasks one by one without causing downtime (similar to how it can be done with deploy max_parallel=1).

Use-cases

Job example:

job "test {
  group "main" {
    count = 2
    update {
      max_parallel = 1
    }
    service {
      name = "srv"
      check { ... }
    }
    task "a" {
      driver = "docker"
       ...
       template {
         env = true
         data = "X={{ key "test/x" }}"
       }
    }
  }
}

Task “a” takes some significant time to startup (for example 10 mins). It isn’t an issue during deployment because instance “a1” handles requests until “a2” starts and passes checks and vice versa. But if the value of config value “test/x” changed or the state of “srv” check became unhealthy for both task allocations (soft fail when restart recommended but not strictly required) I get 10 minutes of downtime until both services restarted.

Attempted Solutions

One solution is to use the splay setting, but it doesn't guarantee that allocations don't restart at the same time. Also if restart takes a long time (5-10min) splay should be huge (few hours) to make downtime intersection reasonably rare.

Another solution is to change the application stop process, handle SIGTERM, check if instance “a2” is restarting and if it is - block stop for 10 mins (+ configure kill_timeout for 10mins). But this solution is complex, error prone and impacts stop in other cases.

Without nomad it can be done in a simple way: consul lock restart-lock restart-service-and-wait-healthy, but with nomad I don't see any simple solutions.

The text was updated successfully, but these errors were encountered:

pznamensky · 2021-07-28T05:55:22Z

See also:
#6821
#6392
#6151

valodzka · 2021-07-31T19:37:45Z

also #10957

tgross · 2022-01-12T16:55:15Z

@valodzka is this feature request actually related to template restart only? Imagine the scenario where we have two clients, each with one allocation for the same job. And through bad luck both of the containers crash at the same time (ex. they OOM). Each client restarts its own allocations without coordination with the server, which allows us to keep the application up as much as possible.

But it sounds like you're really only concerned about restarts specific to updates from templates. In which case this issue is a duplicate of #6151

valodzka · 2022-01-12T17:34:48Z

@tgross restarts related to templates are my main pain point currently.

But I can imagine other cases when it would be reasonable:

If the administrator called restart manually then other restarts (manual or via template change) should be blocked.
If deploy is in progress and one of the instances is restarting then other restarts (manual or via template change) should be blocked.
If restart is in progress for any reason then deploy restart should be blocked until it complete.
If one service restarting due crash and at same time template changed restart of second instance should be postponed until first is ready

Crash is a special case because we cannot prevent apps from stopping so new starts should not be blocked of course. Same exception probably should be done for something like manual restarts with the -force flag. In other words there should be blocking for stopping allocation but not for starting new if one already stopped.

I consider #6151 as a special case of this but if implemented it will cover most issues arising in practice because restarts due template change happen at same time constantly and cases I described infrequently.

tgross · 2022-01-12T18:30:14Z

Crash is a special case because we cannot prevent apps from stopping so new starts should not be blocked of course. Same exception probably should be done for something like manual restarts with the -force flag. In other words there should be blocking for stopping allocation but not for starting new if one already stopped.

It seems like we're mixing up a bunch of different concerns here:

Deployments don't restart or reschedule allocations at all; they create new allocations and terminate old allocations. If allocations fail during a deployment, they may be restarted/rescheduled but that's not driven from the deployment.
Rescheduling happens when a client is disconnected, when an allocation can no longer be restarted, or when an allocation is intentionally stopped (nomad alloc stop). This creates an evaluation that goes all the way through the scheduling pipeline to create a new allocation.
Restarting happens when a task stops unexpectedly or when the task is restarted by a change signal (nomad alloc signal or a template rendering).

What you're asking seems to suggest that we should freeze all deployments and scheduling operations for all allocations for a job when any of those 3 operations above happen. This isn't likely something we're going to want to do, as it would cause deployments to become extremely brittle -- any failure would cause the entire deployment to fail.

valodzka · 2022-01-12T19:35:47Z

I might confuse nomad terminology so sorry for that.

we should freeze all deployments and scheduling operations for all allocations for a job when any of those 3 operations above happen

Deployment of groups in nomad currently independent one from another so I think it's reasonable to do the same here: freeze on group level, not all allocations for a job.

it would cause deployments to become extremely brittle

Okay, I underestimated side effects of my proposal. Then it probably can be reduced to restart case, that is template rendering, alloc signal, alloc restart and check_restart.

tgross · 2022-02-08T20:10:31Z

Ok, I'm going to rename this issue a bit and put it on the backlog for further discussion.

valodzka added the type/enhancement label Jul 21, 2021

tgross added the stage/waiting-reply label Jan 12, 2022

tgross self-assigned this Jan 12, 2022

tgross added the theme/restart/reschedule label Jan 12, 2022

tgross changed the title ~~Restart tasks without downtime, one by one or with max_parallel setting~~ coordinate restarts across clients (template rerender, check restart, etc.) Feb 8, 2022

tgross added stage/needs-discussion and removed stage/waiting-reply labels Feb 8, 2022

tgross removed their assignment Feb 8, 2022

tgross mentioned this issue Mar 2, 2023

Feature: Manually re-render templates #16271

Open

tgross added the theme/template label Mar 2, 2023

tgross mentioned this issue May 15, 2023

Support updating template contents as in-place update #6151

Open

tgross mentioned this issue Nov 9, 2023

Coordinated identity/template/vault change_mode #19052

Open

lgfa29 mentioned this issue Nov 24, 2023

Nomad variable update causes all allocations to restart at the same time. #19171

Closed

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coordinate restarts across clients (template rerender, check restart, etc.) #10920

coordinate restarts across clients (template rerender, check restart, etc.) #10920

valodzka commented Jul 21, 2021 •

edited

Loading

pznamensky commented Jul 28, 2021

valodzka commented Jul 31, 2021

tgross commented Jan 12, 2022 •

edited

Loading

valodzka commented Jan 12, 2022 •

edited

Loading

tgross commented Jan 12, 2022

valodzka commented Jan 12, 2022

tgross commented Feb 8, 2022

coordinate restarts across clients (template rerender, check restart, etc.) #10920

coordinate restarts across clients (template rerender, check restart, etc.) #10920

Comments

valodzka commented Jul 21, 2021 • edited Loading

Proposal

Use-cases

Attempted Solutions

pznamensky commented Jul 28, 2021

valodzka commented Jul 31, 2021

tgross commented Jan 12, 2022 • edited Loading

valodzka commented Jan 12, 2022 • edited Loading

tgross commented Jan 12, 2022

valodzka commented Jan 12, 2022

tgross commented Feb 8, 2022

valodzka commented Jul 21, 2021 •

edited

Loading

tgross commented Jan 12, 2022 •

edited

Loading

valodzka commented Jan 12, 2022 •

edited

Loading