Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment that download large docker images/artifacts transition into permament unhealthy state due to low healthy_deadline #3121

Closed
wuub opened this issue Aug 29, 2017 · 10 comments

Comments

@wuub
Copy link
Contributor

wuub commented Aug 29, 2017

Nomad version

0.6.2

Operating system and Environment details

Linux

Issue

Docker image downloading counts towards healthy_deadline (which is low "5m" when allocation lands on a fresh node)
https://www.nomadproject.io/docs/job-specification/update.html

In case of a sudden rush of allocations, when a node is downloading several docker images at once and is IO starved (which causes images to be downloaded even slower) almost all allocations transition into permament unhealthy state.

Reproduction steps

  1. Use low heathy deadline to launch an docker job with largish image.
  2. Allocation + deployment will transition into unhealthy state at some point
  3. But allocation will launch correctly at a later time.

tooltip_562

Expected behaviour

Since the pace at which docker is downloading images is not entirely under job's control (it's outside of resource allocation guarantees) I would assume it does not count towards allocated startup time.

OR

If the alloc finally starts, I would expect a deployment to resume even from failed state.

OR

Default value of such variable should be set to a much higher level.

@wuub
Copy link
Contributor Author

wuub commented Aug 29, 2017

I just want to stress, that this is super-not-fun to deal with to put it mildly. Our deployments look like this right now;:

tooltip_564

and we basically lost possibility to reliably release new versions.

Re-running ALL jobs with healthy_deadline changed in nomad files would take ages.

@dadgar question:

diff --git a/nomad/structs/structs.go b/nomad/structs/structs.go
index 03e4701..133a01d 100644
--- a/nomad/structs/structs.go
+++ b/nomad/structs/structs.go
@@ -2015,7 +2015,7 @@ var (
                MaxParallel:     1,
                HealthCheck:     UpdateStrategyHealthCheck_Checks,
                MinHealthyTime:  10 * time.Second,
-               HealthyDeadline: 5 * time.Minute,
+               HealthyDeadline: 25 * time.Minute,
                AutoRevert:      false,
                Canary:          0,
        }

If we run a v0.6.2 tag with diff like this on nomad servers, will it change the global default for jobs w/o healthy_deadline specified in jobspec?

@dadgar
Copy link
Contributor

dadgar commented Aug 29, 2017

@wuub That would only impact newly submitted jobs.

@wuub
Copy link
Contributor Author

wuub commented Aug 29, 2017

@dadgar ok, if so we will run a different patch instead for now on a scheduling node.

diff --git a/jobspec/parse.go b/jobspec/parse.go
index 14815ab..d119dfd 100644
--- a/jobspec/parse.go
+++ b/jobspec/parse.go
@@ -1176,6 +1176,9 @@ func parseUpdate(result **api.UpdateStrategy, list *ast.ObjectList) error {
        if err := checkHCLKeys(o.Val, valid); err != nil {
                return err
        }
+       if _, ok := m["healthy_deadline"]; !ok {
+               m["healthy_deadline"] = "30m"
+       }
 
        dec, err := mapstructure.NewDecoder(&mapstructure.DecoderConfig{
                DecodeHook:       mapstructure.StringToTimeDurationHookFunc(),

@wuub
Copy link
Contributor Author

wuub commented Aug 30, 2017

I can confirm that with a change to default healthy_deadline a significant set of new deployments completed successfully w/o additional changes to jobspec files.

@shantanugadgil
Copy link
Contributor

Not sure if the last comment I posted got posted or not (or if it was some other issue):

I am seeing similar behavior on my local setup; the worker nodes and the Docker Registry are both on the intranet.
Eventually the task starts, but usually I see a couple of failures in the log messages.
Nomad being able to "do whatever it takes to start the job", means that the overall experience is just a a delayed start, but yes, I wish there were no error messages to begin with!

Regards,
Shantanu

@wuub
Copy link
Contributor Author

wuub commented Sep 2, 2017

that the overall experience is just a a delayed start

this is not entirely true, if you're using rolling upgrades without auto revert the overall experience is much worse. this issue causes stalled deployment which in turn leads to non-homogeneous environment, with max_parallel allocations from new job version and count-max_parallel allocations from the previous one.

what's worse Nomad does not heal such failed deployment on its own, and it requires operator intervention.

@shantanugadgil
Copy link
Contributor

@wuub ok, got it. Sounds quite serious for your use case.
Though I am using Nomad quite a bit, I haven't moved my setups to use rolling upgrades yet.
Still putting the various components through their paces.

Cheers,
Shantanu

@stale
Copy link

stale bot commented May 10, 2019

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

@stale
Copy link

stale bot commented Jun 9, 2019

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

@stale stale bot closed this as completed Jun 9, 2019
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants