Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coordinate restarts across clients (template rerender, check restart, etc.) #10920

Open
valodzka opened this issue Jul 21, 2021 · 7 comments
Open

Comments

@valodzka
Copy link
Contributor

valodzka commented Jul 21, 2021

Proposal

It should be possible to restart tasks one by one without causing downtime (similar to how it can be done with deploy max_parallel=1).

Use-cases

Job example:

job "test {
  group "main" {
    count = 2
    update {
      max_parallel = 1
    }
    service {
      name = "srv"
      check { ... }
    }
    task "a" {
      driver = "docker"
       ...
       template {
         env = true
         data = "X={{ key "test/x" }}"
       }
    }
  }
}

Task “a” takes some significant time to startup (for example 10 mins). It isn’t an issue during deployment because instance “a1” handles requests until “a2” starts and passes checks and vice versa. But if the value of config value “test/x” changed or the state of “srv” check became unhealthy for both task allocations (soft fail when restart recommended but not strictly required) I get 10 minutes of downtime until both services restarted.

Attempted Solutions

One solution is to use the splay setting, but it doesn't guarantee that allocations don't restart at the same time. Also if restart takes a long time (5-10min) splay should be huge (few hours) to make downtime intersection reasonably rare.

Another solution is to change the application stop process, handle SIGTERM, check if instance “a2” is restarting and if it is - block stop for 10 mins (+ configure kill_timeout for 10mins). But this solution is complex, error prone and impacts stop in other cases.

Without nomad it can be done in a simple way: consul lock restart-lock restart-service-and-wait-healthy, but with nomad I don't see any simple solutions.

@pznamensky
Copy link

See also:
#6821
#6392
#6151

@valodzka
Copy link
Contributor Author

also #10957

@tgross
Copy link
Member

tgross commented Jan 12, 2022

@valodzka is this feature request actually related to template restart only? Imagine the scenario where we have two clients, each with one allocation for the same job. And through bad luck both of the containers crash at the same time (ex. they OOM). Each client restarts its own allocations without coordination with the server, which allows us to keep the application up as much as possible.

But it sounds like you're really only concerned about restarts specific to updates from templates. In which case this issue is a duplicate of #6151

@valodzka
Copy link
Contributor Author

valodzka commented Jan 12, 2022

@tgross restarts related to templates are my main pain point currently.

But I can imagine other cases when it would be reasonable:

  1. If the administrator called restart manually then other restarts (manual or via template change) should be blocked.
  2. If deploy is in progress and one of the instances is restarting then other restarts (manual or via template change) should be blocked.
  3. If restart is in progress for any reason then deploy restart should be blocked until it complete.
  4. If one service restarting due crash and at same time template changed restart of second instance should be postponed until first is ready

Crash is a special case because we cannot prevent apps from stopping so new starts should not be blocked of course. Same exception probably should be done for something like manual restarts with the -force flag. In other words there should be blocking for stopping allocation but not for starting new if one already stopped.

I consider #6151 as a special case of this but if implemented it will cover most issues arising in practice because restarts due template change happen at same time constantly and cases I described infrequently. 

@tgross
Copy link
Member

tgross commented Jan 12, 2022

Crash is a special case because we cannot prevent apps from stopping so new starts should not be blocked of course. Same exception probably should be done for something like manual restarts with the -force flag. In other words there should be blocking for stopping allocation but not for starting new if one already stopped.

It seems like we're mixing up a bunch of different concerns here:

  • Deployments don't restart or reschedule allocations at all; they create new allocations and terminate old allocations. If allocations fail during a deployment, they may be restarted/rescheduled but that's not driven from the deployment.
  • Rescheduling happens when a client is disconnected, when an allocation can no longer be restarted, or when an allocation is intentionally stopped (nomad alloc stop). This creates an evaluation that goes all the way through the scheduling pipeline to create a new allocation.
  • Restarting happens when a task stops unexpectedly or when the task is restarted by a change signal (nomad alloc signal or a template rendering).

What you're asking seems to suggest that we should freeze all deployments and scheduling operations for all allocations for a job when any of those 3 operations above happen. This isn't likely something we're going to want to do, as it would cause deployments to become extremely brittle -- any failure would cause the entire deployment to fail.

@valodzka
Copy link
Contributor Author

I might confuse nomad terminology so sorry for that.

we should freeze all deployments and scheduling operations for all allocations for a job when any of those 3 operations above happen

Deployment of groups in nomad currently independent one from another so I think it's reasonable to do the same here: freeze on group level, not all allocations for a job.

it would cause deployments to become extremely brittle

Okay, I underestimated side effects of my proposal. Then it probably can be reduced to restart case, that is template rendering, alloc signal, alloc restart and check_restart.

@tgross
Copy link
Member

tgross commented Feb 8, 2022

Ok, I'm going to rename this issue a bit and put it on the backlog for further discussion.

@tgross tgross changed the title Restart tasks without downtime, one by one or with max_parallel setting coordinate restarts across clients (template rerender, check restart, etc.) Feb 8, 2022
@tgross tgross removed their assignment Feb 8, 2022
@tgross tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

3 participants