Deployment that download large docker images/artifacts transition into permament unhealthy state due to low healthy_deadline #3121

wuub · 2017-08-29T16:30:10Z

Nomad version

0.6.2

Operating system and Environment details

Linux

Issue

Docker image downloading counts towards healthy_deadline (which is low "5m" when allocation lands on a fresh node)
https://www.nomadproject.io/docs/job-specification/update.html

In case of a sudden rush of allocations, when a node is downloading several docker images at once and is IO starved (which causes images to be downloaded even slower) almost all allocations transition into permament unhealthy state.

Reproduction steps

Use low heathy deadline to launch an docker job with largish image.
Allocation + deployment will transition into unhealthy state at some point
But allocation will launch correctly at a later time.

Expected behaviour

Since the pace at which docker is downloading images is not entirely under job's control (it's outside of resource allocation guarantees) I would assume it does not count towards allocated startup time.

OR

If the alloc finally starts, I would expect a deployment to resume even from failed state.

OR

Default value of such variable should be set to a much higher level.

wuub · 2017-08-29T17:28:26Z

I just want to stress, that this is super-not-fun to deal with to put it mildly. Our deployments look like this right now;:

and we basically lost possibility to reliably release new versions.

Re-running ALL jobs with healthy_deadline changed in nomad files would take ages.

@dadgar question:

diff --git a/nomad/structs/structs.go b/nomad/structs/structs.go
index 03e4701..133a01d 100644
--- a/nomad/structs/structs.go
+++ b/nomad/structs/structs.go
@@ -2015,7 +2015,7 @@ var (
                MaxParallel:     1,
                HealthCheck:     UpdateStrategyHealthCheck_Checks,
                MinHealthyTime:  10 * time.Second,
-               HealthyDeadline: 5 * time.Minute,
+               HealthyDeadline: 25 * time.Minute,
                AutoRevert:      false,
                Canary:          0,
        }

If we run a v0.6.2 tag with diff like this on nomad servers, will it change the global default for jobs w/o healthy_deadline specified in jobspec?

dadgar · 2017-08-29T17:32:28Z

@wuub That would only impact newly submitted jobs.

wuub · 2017-08-29T17:53:11Z

@dadgar ok, if so we will run a different patch instead for now on a scheduling node.

diff --git a/jobspec/parse.go b/jobspec/parse.go
index 14815ab..d119dfd 100644
--- a/jobspec/parse.go
+++ b/jobspec/parse.go
@@ -1176,6 +1176,9 @@ func parseUpdate(result **api.UpdateStrategy, list *ast.ObjectList) error {
        if err := checkHCLKeys(o.Val, valid); err != nil {
                return err
        }
+       if _, ok := m["healthy_deadline"]; !ok {
+               m["healthy_deadline"] = "30m"
+       }
 
        dec, err := mapstructure.NewDecoder(&mapstructure.DecoderConfig{
                DecodeHook:       mapstructure.StringToTimeDurationHookFunc(),

wuub · 2017-08-30T09:43:08Z

I can confirm that with a change to default healthy_deadline a significant set of new deployments completed successfully w/o additional changes to jobspec files.

shantanugadgil · 2017-09-02T18:07:03Z

Not sure if the last comment I posted got posted or not (or if it was some other issue):

I am seeing similar behavior on my local setup; the worker nodes and the Docker Registry are both on the intranet.
Eventually the task starts, but usually I see a couple of failures in the log messages.
Nomad being able to "do whatever it takes to start the job", means that the overall experience is just a a delayed start, but yes, I wish there were no error messages to begin with!

Regards,
Shantanu

wuub · 2017-09-02T18:18:30Z

that the overall experience is just a a delayed start

this is not entirely true, if you're using rolling upgrades without auto revert the overall experience is much worse. this issue causes stalled deployment which in turn leads to non-homogeneous environment, with max_parallel allocations from new job version and count-max_parallel allocations from the previous one.

what's worse Nomad does not heal such failed deployment on its own, and it requires operator intervention.

shantanugadgil · 2017-09-02T18:55:18Z

@wuub ok, got it. Sounds quite serious for your use case.
Though I am using Nomad quite a bit, I haven't moved my setups to use rolling upgrades yet.
Still putting the various components through their paces.

Cheers,
Shantanu

stale · 2019-05-10T16:01:04Z

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

stale · 2019-06-09T16:05:46Z

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

github-actions · 2022-11-22T02:29:40Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

wuub mentioned this issue Sep 2, 2017

[unsure, still investigating] significant rise in vault leader CPU usage during and after upgrade to nomad 0.6.2 #3133

Closed

chelseakomlo mentioned this issue Mar 5, 2018

allocations stuck in pending (v0.7.1) #3932

Closed

stale bot added the stage/waiting-reply label May 10, 2019

stale bot closed this as completed Jun 9, 2019

github-actions bot locked as resolved and limited conversation to collaborators Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment that download large docker images/artifacts transition into permament unhealthy state due to low healthy_deadline #3121

Deployment that download large docker images/artifacts transition into permament unhealthy state due to low healthy_deadline #3121

wuub commented Aug 29, 2017

wuub commented Aug 29, 2017 •

edited

Loading

dadgar commented Aug 29, 2017

wuub commented Aug 29, 2017

wuub commented Aug 30, 2017

shantanugadgil commented Sep 2, 2017

wuub commented Sep 2, 2017

shantanugadgil commented Sep 2, 2017

stale bot commented May 10, 2019

stale bot commented Jun 9, 2019

github-actions bot commented Nov 22, 2022

Deployment that download large docker images/artifacts transition into permament unhealthy state due to low healthy_deadline #3121

Deployment that download large docker images/artifacts transition into permament unhealthy state due to low healthy_deadline #3121

Comments

wuub commented Aug 29, 2017

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected behaviour

wuub commented Aug 29, 2017 • edited Loading

dadgar commented Aug 29, 2017

wuub commented Aug 29, 2017

wuub commented Aug 30, 2017

shantanugadgil commented Sep 2, 2017

wuub commented Sep 2, 2017

shantanugadgil commented Sep 2, 2017

stale bot commented May 10, 2019

stale bot commented Jun 9, 2019

github-actions bot commented Nov 22, 2022

wuub commented Aug 29, 2017 •

edited

Loading