-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coordinate restarts across clients (template rerender, check restart, etc.) #10920
Comments
also #10957 |
@valodzka is this feature request actually related to template restart only? Imagine the scenario where we have two clients, each with one allocation for the same job. And through bad luck both of the containers crash at the same time (ex. they OOM). Each client restarts its own allocations without coordination with the server, which allows us to keep the application up as much as possible. But it sounds like you're really only concerned about restarts specific to updates from templates. In which case this issue is a duplicate of #6151 |
@tgross restarts related to templates are my main pain point currently. But I can imagine other cases when it would be reasonable:
Crash is a special case because we cannot prevent apps from stopping so new starts should not be blocked of course. Same exception probably should be done for something like manual restarts with the -force flag. In other words there should be blocking for stopping allocation but not for starting new if one already stopped. I consider #6151 as a special case of this but if implemented it will cover most issues arising in practice because restarts due template change happen at same time constantly and cases I described infrequently. |
It seems like we're mixing up a bunch of different concerns here:
What you're asking seems to suggest that we should freeze all deployments and scheduling operations for all allocations for a job when any of those 3 operations above happen. This isn't likely something we're going to want to do, as it would cause deployments to become extremely brittle -- any failure would cause the entire deployment to fail. |
I might confuse nomad terminology so sorry for that.
Deployment of groups in nomad currently independent one from another so I think it's reasonable to do the same here: freeze on group level, not all allocations for a job.
Okay, I underestimated side effects of my proposal. Then it probably can be reduced to restart case, that is template rendering, |
Ok, I'm going to rename this issue a bit and put it on the backlog for further discussion. |
Proposal
It should be possible to restart tasks one by one without causing downtime (similar to how it can be done with deploy max_parallel=1).
Use-cases
Job example:
Task “a” takes some significant time to startup (for example 10 mins). It isn’t an issue during deployment because instance “a1” handles requests until “a2” starts and passes checks and vice versa. But if the value of config value “test/x” changed or the state of “srv” check became unhealthy for both task allocations (soft fail when restart recommended but not strictly required) I get 10 minutes of downtime until both services restarted.
Attempted Solutions
One solution is to use the
splay
setting, but it doesn't guarantee that allocations don't restart at the same time. Also if restart takes a long time (5-10min) splay should be huge (few hours) to make downtime intersection reasonably rare.Another solution is to change the application stop process, handle SIGTERM, check if instance “a2” is restarting and if it is - block stop for 10 mins (+ configure kill_timeout for 10mins). But this solution is complex, error prone and impacts stop in other cases.
Without nomad it can be done in a simple way:
consul lock restart-lock restart-service-and-wait-healthy
, but with nomad I don't see any simple solutions.The text was updated successfully, but these errors were encountered: