-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad alloc stuck in pending state on a new job #9375
Comments
here is another job that nomad seams to not be able to restart or kill if the task gets a "Restart Signaled | Template with change_mode restart re-rendered"
here is a alloc that got a template restart but did not actually restart it.
the job spec only wants to have a count of 2 running but you can see on this host i have 3 docker jobs that are running but each one of the these are dead.
|
I am also using 1.0.0 beta3 and many of my jobs are getting pending, some of them are stuck in "Pending" during two days and from night to day they go to "Running" |
@marcchua Thanks for reporting the issue. This ticket is my top priority for the week. Having more data will be extremely helpful. In particular, I'd appreciate capturing the following when an alloc is marked in a pending state:
You can email them to [email protected] if they contain some sensitive info that you prefer not to post here. The data will help us pinpoint the problem quicker. Thank you so much! |
from what i have identified is once i correct the group network depreciation configuration. documented here https://www.nomadproject.io/docs/upgrade/upgrade-specific#mbits-and-task-network-resource-deprecation and add the kill_signal to the docker task things seam to be working allot better.
without these changes i have had up to 4 versions of a job running and unable to cleanup the jobs without having to purge jobs or restart servers. |
Thanks @oopstrynow for the info. If you are able, can you reproduce the issue on your end and send us the info I requested in my last comment. Also, did you also run into this issue with 1.0.0-beta1 (or an earlier version)? |
@oopstrynow I'm seeing this line in the client logs:
I did a little tracing through the code and that "reason" message looks like we should only see it if there was a request to the HTTP API (ref). While I definitely can't begin to suggest that's a root cause, is there any chance you have automation that's cleaning up allocations? |
Quick status update on where we are with this investigation. I've been trying to reproduce locally based on the information we have so far. I haven't been able to do so yet, but I do at least have part of a proposed mechanism at this point:
What I don’t know yet:
jobspec for testingjob "example" {
datacenters = ["dc1"]
group "cache" {
network {
port "db" {
to = 6379
}
}
task "sidecar" {
driver = "docker"
lifecycle {
hook = "prestart"
sidecar = true
}
config {
image = "docker.elastic.co/beats/filebeat:7.10.0"
command = "/bin/bash"
args = ["/local/script.sh"]
#args = ["-e"]
volumes = [
"local/filebeat.yml:/usr/share/filebeat/filebeat.yml",
]
}
template {
change_mode = "restart"
data = <<EOH
#!/bin/bash
trap 'echo "received SIGINT"' INT
filebeat -e
EOH
destination = "local/script.sh"
}
template {
change_mode = "restart"
data = <<EOH
http.enabled: false
monitoring.enabled: false
filebeat.inputs:
- type: log
paths:
- /alloc/logs/redis.stderr.[0-9]*
- /alloc/logs/redis.stdout.[0-9]*
processors:
- decode_json_fields:
fields: ["message"]
target: "json"
overwrite_keys: true
output.console:
pretty: true
EOH
destination = "local/filebeat.yml"
}
env {
KEY = "X"
}
resources {
cpu = 256
memory = 128
}
}
task "redis" {
driver = "docker"
config {
image = "redis:3.2"
ports = ["db"]
}
env {
KEY = "X"
}
resources {
cpu = 256
memory = 128
}
template {
change_mode = "restart"
data = <<EOF
{{ range service "nomad-client" }}
server {{ .Name }} {{ .Address }}:{{ .Port }}{{ end }}
EOF
destination = "local/template1"
}
template {
change_mode = "restart"
data = <<EOF
{{ range service "nomad" }}
server {{ .Name }} {{ .Address }}:{{ .Port }}{{ end }}
EOF
destination = "local/template2"
}
}
}
} |
i do not have anything cleaning up old allocations. maybe a new task is attempting to remove the older version of the task? |
Sorry i did not get back to you sooner. my goal is to work on reproducing this with a 0.12.8 server/client with a job file that does not include the 0.12.8 network changes that move the network to the group. then update the server to 1.0.00-beta3 and i suspect this will reproduce the alloc problems i found. below is what my fabio config looks like minus some vault parts i removed.
|
Yeah, with some fresh eyes I'm seeing that we can hit that if the server gives the client instructions to remove it in |
@oopstrynow I'm digging into the interaction between Nomad and Docker and its signal handling a bit and I want to mirror the OS/
|
here is the kernel and docker version we have. [root@ip-172-31-2-207 ~]# uname -a [root@ip-172-31-2-207 ~]# docker --version |
Ok, so at this point I have a partial reproduction and one newly found Nomad bug to fix. Our Docker driver has a method As it turns out, there are recently-opened bugs for this version of Docker that involve just that!
With that in hand, I deployed a build of Nomad with the The allocations that are restored after the client restart all have multiple
Including their Docker containers:
This state persists for as long as the deployment progress deadline (10min), at which point the Nomad servers intervene and force everything to get cleaned up. I took a goroutine dump of Nomad in this state and I find that I have two task runners running, one of them stuck on
It's not a 100% reproduction, because Nomad does eventually seem to get its act together in this repro. So either the job I'm using doesn't quite replicate the circumstances or maybe we need to block forever at the Docker API to replicate it. I'll work up a PR to add the timeout we need to the Docker driver... this is a longstanding bug but it may be that @oopstrynow is hitting it now because of an unlucky combination of the Docker bug and our changing the default signal handling behavior which impacted his application. I'm not convinced we've got the entire story yet but we're getting closer. |
@tgross wow this is awesome. thank you for digging into this issue |
I've just merged #9502, which will ship in the release candidate we're building later today for likely release tomorrow. |
@notnoop and I had a discussion with @oopstrynow and I think we're in a good spot with this and #9376 now:
I'm going to close this issue and #9376. Thanks for your help @oopstrynow and please let us know if you see anything like this again! |
Hi @tgross this just happened to us. An allocation kept on pending state on a new deployment due to another one not stopping. Nomad version 1.2.6 Here are the allocation events:
Relevant logs:
At this point after a few minutes I run:
I don't know if this is a docker problem, but it was killed correctly with the I got a dump of the goroutines before the docker rm and noticed a couple that I think are relevant.
Could a exec session be left there and that is blocking the container from stopping? |
@jorgemarey can you open a new issue for this? |
Yes, of course! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad alloc stuck in pending state
Nomad version
1.0.0-beta3
Operating system and Environment details
Amazon Linux 2
Linux ip-172-31-3-147.service.consul 4.14.203-156.332.amzn2.x86_64 #1 SMP Fri Oct 30 19:19:33 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Issue
when i deploy a new job the old alloc are not getting removed. i see the same issue on check_restart not restarting the jobs correctly.
Reproduction steps
Job file (if appropriate)
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: