-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unhealthy task is only restarted once despite restart policy #9176
Comments
Hi @naag! I was able to confirm this same behavior using that job file. I also was able to reproduce when moving the I'll mark this as a bug and I'll start digging in a bit more. (Also, I've slightly edited your post to wrap the logs in a |
Thanks @tgross! Please let me know if there's anything else I can do to assist. |
In order to narrow down the behavior, I moved the But now that's making me realize that the behavior of
Which suggests it's not intended to be used at the group level at all, and that you should be getting an error when the job is validated. (And restarting once is also a weird behavior, for sure.) job "fail-service" {
datacenters = ["dc1"]
reschedule {
delay = "15s"
delay_function = "constant"
unlimited = true
}
group "cache" {
restart {
attempts = 3
interval = "30m"
delay = "5s"
mode = "fail"
}
task "main" {
driver = "docker"
config {
image = "thobe/fail_service:v0.1.0"
port_map {
http = 8080
}
}
env {
HEALTHY_FOR = 20
UNHEALTHY_FOR = 120
}
service {
name = "fail-service"
port = "http"
check_restart {
limit = 4
grace = "10s"
ignore_warnings = false
}
check {
type = "http"
port = "http"
path = "/health"
interval = "10s"
timeout = "2s"
}
}
resources {
cpu = 500
memory = 256
network {
mbits = 10
port "http" {}
}
}
}
}
}
|
Oh, it appears someone smarter than me has thought of that already. We'll want a service {
name = "fail-service"
port = "http"
task = "main"
} |
Ok, I've figured out why the behavior is different in the two cases. The problem isn't the
Not every hook implements every event. In this case, we have the allocation's In a job with a service at the task level, when the check fails and the task is restarted, we get hook events as follows:
But the hooks for group services fire only for So unfortunately there's a little architecture problem here with the group service hooks supporting check restarts. I think at this point I want to get a second opinion from a few of my colleagues on the Nomad team to figure out the best path forward. |
That's a very thorough investigation, thanks for taking the time @tgross 👍! However I didn't yet mention our intention to use Consul Connect, and it seems that this requires the As a workaround, we're currently considering to use |
I think you'll want to leave out the |
Did you have the chance to discuss this with your colleagues? Is there any chance of a PR being accepted considering it's an architectural issue as you say? |
If I understand correctly, the |
Add a warning about check_restart being limited to task networks and link to the relevant issue: #9176.
Apologies for the delay... I'm starting on the patch this week.
Correct, which is why that's not a very good workaround and this isn't something we intended to leave unfixed. |
Work in progress PR is #9869. Needs a good deal of testing and refactoring yet, but this should be on the right track now. |
#9869 has been merged and will go out with the next point release. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.12.6 (a8ea7c5f421297db434b45046fca7a9deef6df85)
Operating system and Environment details
Issue
We've observed that unhealthy tasks aren't restarted according to the
check_restart
andrestart
stanzas in Nomad 0.12.6, but since this is our very first Nomad installation, we cannot say if this is really limited to that particular version. For testing, we use the docker imagethobe/fail_service
to simulate a service that starts up healthy, stays healthy for 20 seconds, but then turns unhealthy for the next 120 seconds. Given our job file (see below), we assume that this would trigger the following sequence of events:restart
stanza withmode = "fail"
in combination with thereschedule
stanza, then the story resumes at step 1 indefinitelyOur hunch after some local debugging is that this is caused by the Consul health watcher being removed on task restart and never being reinstated. This in turn seems to be happening only when defining the
service
stanza at the group level. However when moving it to thetask
level, the check cannot find the port label defined in thenetwork
stanza at the group level:Please note that we've also opened a thread at discuss.hashicorp.com, but after talking to some people on Gitter, it seems that this is not a misunderstanding of Nomad but more likely a bug somewhere, that's why we opted to also open an issue here.
Reproduction steps
Start Nomad, Consul and run the job:
Then observe the output of
nomad alloc status ...
. The task will eventually restart, but only once.Job file
Nomad Client / Server logs
Consul Server logs
The text was updated successfully, but these errors were encountered: