-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rogue Processes #6980
Comments
Hi @vincenthuynh! Thanks for reporting this! Nomad supervises This looks like a bug. If you were to do the same to the |
Oh, also noting here that this is reproducible on any recent version of Nomad 0.10.x as well! |
Wanted to follow up here with some info. We don't expect Would love some info about how executor dies? Would you be able to dig into nomad logs to find any indication of how it's killed e.g. it panicked vs got killed by OOM? The cause can help us have some targeted remediation. Can you inspect where This reminds me of #5434 - I'll resurrect the PR and aim to include it to address this bug. Thanks for raising it. |
We went through and reaped any run-away processes this morning. I'll keep an eye on it again after the weekend and grab some logs if I find any. |
I was able to find a run-away process on one of our other nodes:
I will dump the nomad agent logs (in debug level!) and send it over. |
Hi again, Below is a job that uses a static port. It unexpectedly terminates and then is unable to restart on the same client due to a port conflict. The original process is left running on the client despite the parent process being killed. Alloc id:
Relevant client logs:
|
Hi @tgross @notnoop In the meantime, we are going to stop using the Java driver until this is resolved as the unregistered services are draining the resources from our clients. Thanks!
|
Just had the same issue for an |
In case anyone is interested... alleviating it using this script until we get a proper fix:
|
So after a lot of fighting this issue and cleaning up after Nomad, here's some code that helps prevent the issue when you're using bash for your command and have trouble getting all its child processes to die: cleanup() {
# kill all processes whose parent is this process
pkill -P $$
}
for sig in INT QUIT HUP TERM; do
trap "
cleanup
trap - $sig EXIT
kill -s $sig "'"$$"' "$sig"
done
trap cleanup EXIT credit: https://aweirdimagination.net/2020/06/28/kill-child-jobs-on-script-exit/ I still think that this should be handled better by Nomad, the process tree under the nomad executor should hopefully be relatively easy to identify and force kill. |
Think I've seen the same issue on Nomad 1.1.2 with an exec task (on CentOS 7.4). I'll try to keep an eye out and get some more information if it happens again. |
We've had this happen for couple of years, currently on Nomad 1.4.2. It happens fairly consistently with few (like 5 out of 40) exec-tasks that are using This only happens when our agent certificates get rotated by Ansible, which replaces the certificate files on the disk (before the old got invalid), and restarts those Nomad agents that had their certificates changed. The restart is done one agent at a time using I have rarely had this happen to me when doing manual restarts, but for those few tasks it happens on almost every time Ansible does the restart. But it still is random, and not 100% consistent. Here's a
For now I've just cleaned the current tasks, and don't have timestamps on hand for the last time this happened (to check the logs), but if there's something I should check that would help with this, the next time this should happen is in 2-3 weeks. Running on RHEL 8.7. |
The Nomad task for the forementioned process that most of the time seems to get it's executor killed is: task "zeoserver" {
driver = "exec"
resources {
cpu = 100
memory = 512
memory_max = 768
}
volume_mount {
volume = "var"
destination = "/var/lib/plone"
read_only = false
}
env {
PLONE_ZEOSERVER_ADDRESS = "${NOMAD_ADDR_zeo}"
PLONE_VAR = "/var/lib/plone"
}
config {
command = "/usr/local/bin/tini"
args = [
"--",
"/usr/local/bin/sh",
"-c",
<<EOH
set -e
if [ ! -e /bin ]; then
ln -s /usr/local/bin /bin
fi
exec plonectl-zeoserver
EOH
]
}
artifact {
source = "${local.repository}/${var.artifact}/${var.artifact}-${var.version}.tar.gz"
destination = "/"
}
} The
Other run-away tasks that have been caught includes Nginx, Redis and some Java apps run in the same fashion. |
@zemm give the Ref: https://github.com/krallin/tini#process-group-killing |
I just realized you are using
This will ensure that the plonectl-zeoserver process stays in the same process group as your Anyway, my .2c! |
I see in 1.4.2 that docker orphans its containers. I am not sure if it is related to a restart of the nomad service client (not the node itself), but we definitely see this when containers are static ports, and obviously the port is already in use. |
Note for internal HashiCorp people, see NOMAD-146 in for more info from a customer. For anybody following: we're likely picking this up next week. |
Noting what seems to be another report in #19062 |
Further research
The NOTE: There is a significant comment embedded in the struct // Pdeathsig, if non-zero, is a signal that the kernel will send to
// the child process when the creating thread dies. Note that the signal.
// is sent on thread termination, which may happen before process termination.
// There are more details at https://go.dev/issue/27505. However, this seems to be the correct way to watchdog the child work using the OS. There is prior art in containerd/go-runc, containerd/containerd (a test), & moby/moby os/execFor libcontainerThe There are further comments of note in libcontainer in the echoed later in Unfortunately I ran out of time to work further on this issue during this period. Hopefully these additional notes might help the next person who picks up this issue. |
…he `exec` task driver. (#20500) Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop. This commit introduces two mechanisms to avoid this behaviour: * Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task. * Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.
…he `exec` task driver. (#20500) Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop. This commit introduces two mechanisms to avoid this behaviour: * Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task. * Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.
…he `exec` task driver. (#20500) Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop. This commit introduces two mechanisms to avoid this behaviour: * Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task. * Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.
…he `exec` task driver. (#20500) (#20584) Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop. This commit introduces two mechanisms to avoid this behaviour: * Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task. * Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time. Co-authored-by: Juana De La Cuesta <[email protected]>
…ones using the `exec` task driver. into release/1.6.x (#20583) * [gh-6980] Client: clean up old allocs before running new ones using the `exec` task driver. (#20500) Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop. This commit introduces two mechanisms to avoid this behaviour: * Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task. * Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time. * fix: add the missing cgroups functions --------- Co-authored-by: Juana De La Cuesta <[email protected]> Co-authored-by: Juanadelacuesta <[email protected]>
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.9.6 (1f8eddf)
Operating system and Environment details
Debian 9.11
Issue
We are noticing on some long running clients, there were rogue processes which have been disassociated from the Nomad process. We first discovered this when we had reports of inconsistency in our production application which could only be explained by an old version of an application running somewhere.
The rogue processes were all from Service jobs, using the Java driver.
Here is an example with a process tree from one of our Nomad clients:
VM uptime:
59+ days
Nomad agent uptime:
Since end of Dec 2019
The nomad agent was not restarted/killed - which we thought could explain the run-away processes.
The rogue processes (5 of them) are at the very bottom of the list:
Reproduction steps
N/A
Job file (if appropriate)
Nomad Client logs (if appropriate)
Nomad Server logs (if appropriate)
The text was updated successfully, but these errors were encountered: