-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deployment that download large docker images/artifacts transition into permament unhealthy state due to low healthy_deadline #3121
Comments
I just want to stress, that this is super-not-fun to deal with to put it mildly. Our deployments look like this right now;: and we basically lost possibility to reliably release new versions. Re-running ALL jobs with healthy_deadline changed in nomad files would take ages. @dadgar question: diff --git a/nomad/structs/structs.go b/nomad/structs/structs.go
index 03e4701..133a01d 100644
--- a/nomad/structs/structs.go
+++ b/nomad/structs/structs.go
@@ -2015,7 +2015,7 @@ var (
MaxParallel: 1,
HealthCheck: UpdateStrategyHealthCheck_Checks,
MinHealthyTime: 10 * time.Second,
- HealthyDeadline: 5 * time.Minute,
+ HealthyDeadline: 25 * time.Minute,
AutoRevert: false,
Canary: 0,
} If we run a v0.6.2 tag with diff like this on nomad servers, will it change the global default for jobs w/o healthy_deadline specified in jobspec? |
@wuub That would only impact newly submitted jobs. |
@dadgar ok, if so we will run a different patch instead for now on a scheduling node. diff --git a/jobspec/parse.go b/jobspec/parse.go
index 14815ab..d119dfd 100644
--- a/jobspec/parse.go
+++ b/jobspec/parse.go
@@ -1176,6 +1176,9 @@ func parseUpdate(result **api.UpdateStrategy, list *ast.ObjectList) error {
if err := checkHCLKeys(o.Val, valid); err != nil {
return err
}
+ if _, ok := m["healthy_deadline"]; !ok {
+ m["healthy_deadline"] = "30m"
+ }
dec, err := mapstructure.NewDecoder(&mapstructure.DecoderConfig{
DecodeHook: mapstructure.StringToTimeDurationHookFunc(), |
I can confirm that with a change to default |
Not sure if the last comment I posted got posted or not (or if it was some other issue): I am seeing similar behavior on my local setup; the worker nodes and the Docker Registry are both on the intranet. Regards, |
this is not entirely true, if you're using rolling upgrades without what's worse Nomad does not heal such failed deployment on its own, and it requires operator intervention. |
@wuub ok, got it. Sounds quite serious for your use case. Cheers, |
Hey there Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this. Thanks! |
This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
0.6.2
Operating system and Environment details
Linux
Issue
Docker image downloading counts towards healthy_deadline (which is low "5m" when allocation lands on a fresh node)
https://www.nomadproject.io/docs/job-specification/update.html
In case of a sudden rush of allocations, when a node is downloading several docker images at once and is IO starved (which causes images to be downloaded even slower) almost all allocations transition into permament unhealthy state.
Reproduction steps
Expected behaviour
Since the pace at which docker is downloading images is not entirely under job's control (it's outside of resource allocation guarantees) I would assume it does not count towards allocated startup time.
OR
If the alloc finally starts, I would expect a deployment to resume even from failed state.
OR
Default value of such variable should be set to a much higher level.
The text was updated successfully, but these errors were encountered: