Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rogue Processes #6980

Closed
vincenthuynh opened this issue Jan 24, 2020 · 20 comments · Fixed by #20500
Closed

Rogue Processes #6980

vincenthuynh opened this issue Jan 24, 2020 · 20 comments · Fixed by #20500
Assignees
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/exec theme/driver/java type/bug

Comments

@vincenthuynh
Copy link

Nomad version

Nomad v0.9.6 (1f8eddf)

Operating system and Environment details

Debian 9.11

Issue

We are noticing on some long running clients, there were rogue processes which have been disassociated from the Nomad process. We first discovered this when we had reports of inconsistency in our production application which could only be explained by an old version of an application running somewhere.

The rogue processes were all from Service jobs, using the Java driver.

Here is an example with a process tree from one of our Nomad clients:
VM uptime: 59+ days
Nomad agent uptime: Since end of Dec 2019

The nomad agent was not restarted/killed - which we thought could explain the run-away processes.

The rogue processes (5 of them) are at the very bottom of the list:

root       726  2.2  0.1 2324544 102664 ?      Ssl   2019 1884:38 /usr/bin/dockerd -H fd:// -H tcp://127.0.0.1:2376
root       834  0.6  0.0 979068 17808 ?        Ssl   2019 586:06  \_ docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/contain
root     30890  0.2  0.0 356032  3852 ?        Sl   Jan11  43:35  |   \_ docker-containerd-shim a35d82c4ad8e3f68f06afc4cbaa293041c2491661c51b70ed0d6e53e4985e11a /var/run/docker/libcontainerd/a35d82c4ad8e3f68f06afc4cbaa293041c2491661c51b70e
consul   30909  0.0  0.0   4332   716 ?        Ss   Jan11   0:00  |   |   \_ /bin/sh -c /entrypoint.sh
consul   30934  0.0  0.0  21772  3172 ?        S    Jan11   0:00  |   |       \_ /bin/bash -x /entrypoint.sh
consul   30981  3.6  6.8 10644212 3671764 ?    Sl   Jan11 642:56  |   |           \_ /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -server -Xms4g -Xmx4g -XX:NewRatio=3 -XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshol
root      1195  0.0  0.0 413372  3820 ?        Sl   Jan11   0:41  |   \_ docker-containerd-shim 82877e0bd7ecee15d090a1d8f7e7ab255756c4403c2b51591d1612ce0a191a1d /var/run/docker/libcontainerd/82877e0bd7ecee15d090a1d8f7e7ab255756c4403c2b5159
root      1213  0.9  0.1 858964 67304 ?        Ssl  Jan11 174:22  |   |   \_ filebeat -e
root      1389  0.0  0.0 415292  3896 ?        Sl   Jan14   0:08  |   \_ docker-containerd-shim 194ff7254722d84d6330b5228695b16788504fb10719ee35ae3f8dd428f0181f /var/run/docker/libcontainerd/194ff7254722d84d6330b5228695b16788504fb10719ee35
mnguyen   1407  0.0  0.0  18052  2964 ?        Ss   Jan14   0:00  |       \_ /bin/bash /opt/bin/entry_point.sh
mnguyen   1452  0.0  0.0   4504  1640 ?        S    Jan14   0:00  |           \_ /bin/sh /usr/bin/xvfb-run -n 99 --server-args=-screen 0 1360x1020x24 -ac +extension RANDR java -jar /opt/selenium/selenium-server-standalone.jar -role node -h
mnguyen   1463  0.0  0.0 250068 42996 ?        Sl   Jan14   1:07  |               \_ Xvfb :99 -screen 0 1360x1020x24 -ac +extension RANDR -nolisten tcp -auth /tmp/xvfb-run.IvXjC8/Xauthority
mnguyen   1478  0.1  0.7 19186364 399060 ?     Sl   Jan14  24:29  |               \_ java -jar /opt/selenium/selenium-server-standalone.jar -role node -hub http://selenium-hub.service.consul:4444/grid/register -nodeConfig /opt/selenium/con
mnguyen   3122  0.0  0.0 106920  7020 ?        Sl   Jan17   0:00  |                   \_ /opt/selenium/chromedriver-2.29 --port=23631
root     30871  0.0  0.0 118340  3560 ?        Sl   Jan11   0:00  \_ /usr/bin/docker-proxy -proto tcp -host-ip 10.201.20.39 -host-port 8983 -container-ip 172.17.0.2 -container-port 8983
root     30883  0.0  0.0 190408  3392 ?        Sl   Jan11   0:00  \_ /usr/bin/docker-proxy -proto udp -host-ip 10.201.20.39 -host-port 8983 -container-ip 172.17.0.2 -container-port 8983
root      1369  0.0  0.0 108480  3536 ?        Sl   Jan14   0:00  \_ /usr/bin/docker-proxy -proto tcp -host-ip 10.201.20.39 -host-port 25597 -container-ip 172.17.0.3 -container-port 25597
root      1380  0.0  0.0 190408  3388 ?        Sl   Jan14   0:00  \_ /usr/bin/docker-proxy -proto udp -host-ip 10.201.20.39 -host-port 25597 -container-ip 172.17.0.3 -container-port 25597
root       732  0.0  0.0 118640 15560 ?        Ssl   2019  10:12 /usr/bin/google_osconfig_agent
ntp        787  0.0  0.0  97896  3684 ?        Ssl   2019   5:15 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -c /run/ntp.conf.dhcp -u 106:110
consul    1030  0.8  0.1 182772 61556 ?        Ssl   2019 729:04 /usr/local/bin/consul agent -config-dir=/etc/consul
root     12377  2.7  0.1 3922380 97740 ?       Ssl   2019 1674:46 /usr/local/bin/nomad agent -config=/etc/nomad/config.hcl
root      9011  0.3  0.0 665396 17472 ?        Sl   Jan11  63:06  \_ /usr/local/bin/nomad logmon
root      9025  0.2  0.0 1100776 27644 ?       Ssl  Jan11  42:36  \_ /usr/local/bin/nomad executor {"LogFile":"/var/lib/nomad/alloc/959fd52e-5658-24af-13e6-f39229e54b00/statsite/executor.out","LogLevel":"debug","FSIsolation":true}
nobody    9041  0.1  0.0 207192  2508 ?        Ss   Jan11  19:45  |   \_ /local/statsite -f local/statsite.conf
root     10058  0.3  0.0 664884 17276 ?        Sl   Jan11  63:50  \_ /usr/local/bin/nomad logmon
root     10085  0.2  0.0 1362920 27676 ?       Ssl  Jan11  43:28  \_ /usr/local/bin/nomad executor {"LogFile":"/var/lib/nomad/alloc/406d6e65-8241-b879-8b9e-0dc18f06970b/cloud-config/executor.out","LogLevel":"debug","FSIsolation":true}
nobody   10102  0.7  1.4 4705208 755628 ?      Ssl  Jan11 124:37  |   \_ /usr/lib/jvm/jdk1.8.0_161/bin/java -Xmx512m -server -jar local/cloud-config-server-1.0.0-2a64a95.jar
root     30366  0.4  0.0 1141212 18164 ?       Sl   Jan11  78:23  \_ /usr/local/bin/nomad logmon
root     30938  0.1  0.0 581804 18548 ?        Sl   Jan11  21:03  \_ /usr/local/bin/nomad docker_logger
root      1129  0.3  0.0 729268 17688 ?        Sl   Jan11  55:08  \_ /usr/local/bin/nomad logmon
root      1241  0.0  0.0 507816 18136 ?        Sl   Jan11   0:10  \_ /usr/local/bin/nomad docker_logger
root     12209  0.3  0.0 598452 18128 ?        Sl   Jan11  62:31  \_ /usr/local/bin/nomad logmon
root     12443  0.2  0.0 1305580 28372 ?       Ssl  Jan11  44:02  \_ /usr/local/bin/nomad executor {"LogFile":"/var/lib/nomad/alloc/e2eacc16-de4b-416d-95c6-b4af7644672d/cobwebs-operations/executor.out","LogLevel":"debug","FSIsolation":true
nobody   12460  0.3  1.9 10270904 1021288 ?    Ssl  Jan11  57:15  |   \_ /usr/lib/jvm/jdk1.8.0_161/bin/java -server -Xmx4096m -Dnetworkaddress.cache.ttl=30 -javaagent:local/newrelic/newrelic.jar -jar local/cobwebs-operations-gateway-7.0.31
root       569  0.3  0.0 665140 18620 ?        Sl   Jan14  44:25  \_ /usr/local/bin/nomad logmon
root      1435  0.0  0.0 509736 18656 ?        Sl   Jan14   0:04  \_ /usr/local/bin/nomad docker_logger
root     15625  0.4  0.0 655792 17268 ?        Sl   Jan21  12:45  \_ /usr/local/bin/nomad logmon
root     15638  0.2  0.0 1231848 28036 ?       Ssl  Jan21   7:56  \_ /usr/local/bin/nomad executor {"LogFile":"/var/lib/nomad/alloc/f64e8b49-df3d-bad2-fb82-a11bfb549f56/gcsproxy/executor.out","LogLevel":"debug","FSIsolation":true}
nobody   15653  0.0  0.0 481244 14596 ?        Ssl  Jan21   0:11  |   \_ /local/gcsproxy -b 127.0.0.1:38080
root      2300  0.4  0.0 663732 18188 ?        Sl   Jan22   7:53  \_ /usr/local/bin/nomad logmon
root     28687  0.2  0.0 1051088 27224 ?       Ssl  16:51   0:01  \_ /usr/local/bin/nomad executor {"LogFile":"/var/lib/nomad/alloc/921c59b3-d46a-6091-699f-6e36e3a7981d/report-web-9-avatica/executor.out","LogLevel":"debug","FSIsolation":tr
nobody   28703 21.0  1.3 9999776 716224 ?      Ssl  16:51   1:54      \_ /usr/lib/jvm/jdk1.8.0_161/bin/java -server -Xmx4g -XX:+UseConcMarkSweepGC -Djava.net.preferIPv4Stack=true -jar local/report-web-9.2.1-440a00da2-b150.jar
nobody    4928  0.4  4.5 10318440 2417572 ?    Ssl   2019 209:13 /usr/lib/jvm/jdk1.8.0_161/bin/java -Xmx4096m -server -Dnetworkaddress.cache.ttl=30 -Dnewrelic.config.max_stack_trace_lines=200 -javaagent:local/newrelic/newrelic.jar -jar loc
root     24087  0.0  0.0 776188  4524 ?        Ssl   2019   7:54 /usr/sbin/collectd
root     25192  0.1  0.1 1093904 56120 ?       Ssl   2019  78:59 /opt/stackdriver/collectd/sbin/stackdriver-collectd -C /etc/stackdriver/collectd.conf -P /var/run/stackdriver-agent.pid
nobody    7514  0.9  2.3 10324996 1259880 ?    Ssl   2019 492:31 /usr/lib/jvm/jdk1.8.0_161/bin/java -Xmx4096m -server -Dnetworkaddress.cache.ttl=30 -Dnewrelic.config.max_stack_trace_lines=200 -javaagent:local/newrelic/newrelic.jar -jar loc
nobody   28793  3.2  1.8 8024688 982448 ?      Ssl  Jan11 580:25 /usr/lib/jvm/jdk1.8.0_161/bin/java -Xmx2048m -server -Dnetworkaddress.cache.ttl=30 -javaagent:local/newrelic/newrelic.jar -jar local/auth-service-gateway-1.76.0.jar
nobody    6505  0.0  1.3 6328908 698560 ?      Ssl  Jan11  15:39 /usr/lib/jvm/jdk1.8.0_161/bin/java -server -Dnetworkaddress.cache.ttl=30 -Xmx1024m -jar local/search-service-gateway-1.75.0.jar
nobody   18169  0.2  1.3 7307912 706768 ?      Ssl  Jan17  19:14 /usr/lib/jvm/jdk1.8.0_161/bin/java -Xmx2048m -server -Dnetworkaddress.cache.ttl=30 -javaagent:local/newrelic/newrelic.jar -jar local/auth-service-gateway-1.76.0.jar
nobody    6299  0.2  1.2 7307912 686040 ?      Ssl  Jan11  18:58 /usr/lib/jvm/jdk1.8.0_161/bin/java -Xmx2048m -server -Dnetworkaddress.cache.ttl=30 -javaagent:local/newrelic/newrelic.jar -jar local/auth-service-gateway-1.75.0.jar

Reproduction steps

N/A

Job file (if appropriate)

Nomad Client logs (if appropriate)

Nomad Server logs (if appropriate)

@tgross
Copy link
Member

tgross commented Jan 24, 2020

Hi @vincenthuynh! Thanks for reporting this!

Nomad supervises exec and java with an executor process (you can see this in the lines for /usr/local/bin/nomad executor). It looks like if this executor process dies unexpectedly, its child process -- the task workload -- is re-parented to PID1 and not reaped. You can reproduce this pretty easily by running kill -9 against the executor.

This looks like a bug. If you were to do the same to the container-shim process running for a Docker task, you'd see that containerd remains running and becomes the parent of a new container-shim process when Nomad reconciles the state of the world and restarts the allocation. I'll mark this as a bug and make sure it gets into our schedule.

@tgross
Copy link
Member

tgross commented Jan 24, 2020

Oh, also noting here that this is reproducible on any recent version of Nomad 0.10.x as well!

@notnoop
Copy link
Contributor

notnoop commented Jan 24, 2020

Wanted to follow up here with some info. We don't expect executor process to die unexpectedly in normal operations. And if executor dies, I'd expect the java process stdout|stderr to close and for the process to receive a SIGPIPE signal on the next write and die (unless they have custom signal handler).

Would love some info about how executor dies? Would you be able to dig into nomad logs to find any indication of how it's killed e.g. it panicked vs got killed by OOM? The cause can help us have some targeted remediation. Can you inspect where stdout is pointing to (e.g. sudo ls -lha /proc/18169/fd/)? Does it not emit any logs to stdout|stderr?

This reminds me of #5434 - I'll resurrect the PR and aim to include it to address this bug. Thanks for raising it.

@vincenthuynh
Copy link
Author

We went through and reaped any run-away processes this morning. I'll keep an eye on it again after the weekend and grab some logs if I find any.

@tgross tgross self-assigned this Jan 27, 2020
@vincenthuynh
Copy link
Author

I was able to find a run-away process on one of our other nodes:

$ps
...
nobody   30815  0.1  1.0 6982276 554908 ?      Ssl  Jan11  45:51 /usr/lib/jvm/jdk1.8.0_161/bin/java -server -Xmx1024m -jar local/email-gateway-1.28.0.jar

$ls -lha /proc/30815/fd/
lrwx------ 1 nobody nogroup 64 Jan 11 11:01 0 -> /null
l-wx------ 1 nobody nogroup 64 Jan 11 11:01 1 -> /var/lib/nomad/alloc/6ec14a7a-bc14-7f6c-d9b2-7db650e8f85c/alloc/logs/.email-service.stdout.fifo (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 10 -> /urandom
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 11 -> /random
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 12 -> /random
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 13 -> /urandom
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 14 -> /urandom
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 15 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/ext/sunec.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 16 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/ext/sunjce_provider.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 17 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/resources.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 18 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/charsets.jar (deleted)
l-wx------ 1 nobody nogroup 64 Jan 11 11:01 2 -> /var/lib/nomad/alloc/6ec14a7a-bc14-7f6c-d9b2-7db650e8f85c/alloc/logs/.email-service.stderr.fifo (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 3 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/rt.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 4 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/ext/jfxrt.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 5 -> /local/email-gateway-1.28.0.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 6 -> /local/email-gateway-1.28.0.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 7 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/jce.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 8 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/jsse.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 9 -> /random
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 91 -> /local/email-gateway-1.28.0.jar (deleted)

I will dump the nomad agent logs (in debug level!) and send it over.

@vincenthuynh
Copy link
Author

Hi again,
We have some more information that might help with debugging the orphaned processes:

Below is a job that uses a static port. It unexpectedly terminates and then is unable to restart on the same client due to a port conflict. The original process is left running on the client despite the parent process being killed.

Alloc id: 5b3d3abb

Recent Events:
Time                       Type            Description
2020-02-27T11:01:46-05:00  Killing         Sent interrupt. Waiting 5s before force killing
2020-02-27T11:01:46-05:00  Not Restarting  Exceeded allowed attempts 2 in interval 30m0s and mode is "fail"
2020-02-27T11:01:46-05:00  Terminated      Exit Code: 1
2020-02-27T11:01:25-05:00  Started         Task started by client
2020-02-27T11:00:59-05:00  Restarting      Task restarting in 18.599244521s
2020-02-27T11:00:59-05:00  Terminated      Exit Code: 1
2020-02-27T11:00:30-05:00  Started         Task started by client
2020-02-27T11:00:12-05:00  Restarting      Task restarting in 18.212077249s
2020-02-27T11:00:12-05:00  Terminated      Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = transport is closing"
2020-02-23T08:06:08-05:00  Started         Task started by client

Relevant client logs:

Feb 27 11:00:11 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:11.398-0500 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=319.024µs
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.160-0500 [WARN]  client.alloc_runner.task_runner.task_hook.script_checks: tasklet timed out: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e task=timespent-app timeout=3s
Feb 27 11:00:12 nmd-z8g1 consul[1070]:     2020/02/27 11:00:12 [INFO] agent: Synced check "_nomad-check-96ccbf0069efd2438c488867684b1e5ffb1af85d"
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: panic: send on closed channel: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: : alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: goroutine 11935 [running]:: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: github.com/hashicorp/nomad/drivers/shared/executor.(*LibcontainerExecutor).handleExecWait(0xc00014c620, 0xc000468ba0, 0xc00010c640): alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: #011/opt/gopath/src/github.com/hashicorp/nomad/drivers/shared/executor/executor_linux.go:549 +0x11c: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: created by github.com/hashicorp/nomad/drivers/shared/executor.(*LibcontainerExecutor).Exec: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: #011/opt/gopath/src/github.com/hashicorp/nomad/drivers/shared/executor/executor_linux.go:470 +0x2ae: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.635-0500 [ERROR] client.driver_mgr.java: error receiving stream from Stats executor RPC, closing stream: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app error="rpc error: code = Unavailable desc = transport is closing"
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.635-0500 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e task=timespent-app error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: error while dialing: dial unix /tmp/plugin732811969: connect: connection refused""
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.635-0500 [DEBUG] client.driver_mgr.java.executor: plugin process exited: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app path=/usr/local/bin/nomad pid=25671 error="exit status 2"
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.637-0500 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e task=timespent-app reason="Restart within policy" delay=18.212077249s
Feb 27 11:00:12 nmd-z8g1 consul[1070]:     2020/02/27 11:00:12 [INFO] agent: Deregistered service "_nomad-task-5b3d3abb-4175-c539-7a6d-07d6c2713d3e-timespent-app-timespent-actuator-http"
Feb 27 11:00:12 nmd-z8g1 consul[1070]:     2020/02/27 11:00:12 [INFO] agent: Deregistered check "_nomad-check-96ccbf0069efd2438c488867684b1e5ffb1af85d"

@tgross tgross removed their assignment Mar 5, 2020
@vincenthuynh
Copy link
Author

Hi @tgross @notnoop
Could someone shed some light on what may be causing this error with the Java driver? We tend to see the rouge processes happen on a client when an allocation has this event message.

In the meantime, we are going to stop using the Java driver until this is resolved as the unregistered services are draining the resources from our clients.

Thanks!

2020-02-27T11:00:12-05:00  Terminated      Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = transport is closing"

@danlsgiga
Copy link
Contributor

Just had the same issue for an exec task running on Nomad v0.12.5

@manveru
Copy link
Contributor

manveru commented Jan 15, 2021

In case anyone is interested... alleviating it using this script until we get a proper fix:

#!/usr/bin/env bash
set -exuo pipefail

while true; do
  for pid in $(ps -eo 'ppid= uid= pid=' | egrep '^\s*1 65534' | awk '{ print $3 }'); do
    echo "killing $pid"
    ps "$pid"
    kill "$pid"
    sleep 30
    kill -9 "$pid" || true
  done

  sleep 60
done

@tgross tgross removed this from the unscheduled milestone Feb 12, 2021
@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Mar 4, 2021
@manveru
Copy link
Contributor

manveru commented Mar 12, 2021

So after a lot of fighting this issue and cleaning up after Nomad, here's some code that helps prevent the issue when you're using bash for your command and have trouble getting all its child processes to die:

cleanup() {
    # kill all processes whose parent is this process
    pkill -P $$
}

for sig in INT QUIT HUP TERM; do
  trap "
    cleanup
    trap - $sig EXIT
    kill -s $sig "'"$$"' "$sig"
done
trap cleanup EXIT

credit: https://aweirdimagination.net/2020/06/28/kill-child-jobs-on-script-exit/

I still think that this should be handled better by Nomad, the process tree under the nomad executor should hopefully be relatively easy to identify and force kill.

@martinmcnulty
Copy link

Think I've seen the same issue on Nomad 1.1.2 with an exec task (on CentOS 7.4). I'll try to keep an eye out and get some more information if it happens again.

@louievandyke louievandyke added the hcc/cst Admin - internal label Jun 29, 2022
@zemm
Copy link

zemm commented Jan 23, 2023

We've had this happen for couple of years, currently on Nomad 1.4.2.

It happens fairly consistently with few (like 5 out of 40) exec-tasks that are using bash-script or tini running bash-script as command. tini as an entrypoint was a failed attempt to mitigate this problem before understanding it any better, and even though it did not help, it was left in use. At least two of these tasks are always the same.

This only happens when our agent certificates get rotated by Ansible, which replaces the certificate files on the disk (before the old got invalid), and restarts those Nomad agents that had their certificates changed. The restart is done one agent at a time using service=name state=restarted, which I believe does systemctl restart. Nomad agents are running as SystemD services using mostly the SystemD service unit from the Nomad Deployment Guide.

I have rarely had this happen to me when doing manual restarts, but for those few tasks it happens on almost every time Ansible does the restart. But it still is random, and not 100% consistent.

Here's a pstree -p showing that the executor has died from 4199:

           ├─nomad(3762493)─┬─nomad(505783)─┬─{nomad}(505784)
           │                │               ├─{nomad}(505785)
           │                │               ├─{nomad}(505786)
...
           │                ├─nomad(505910)─┬─tini(505936)───.graphql-oidc-a(505943)
           │                │               ├─{nomad}(505911)
           │                │               ├─{nomad}(505912)
...
           │                ├─nomad(505956)─┬─tini(505969)───python(505975)
           │                │               ├─{nomad}(505957)
           │                │               ├─{nomad}(505958)
...
           |-tini(4058)---python(4199)
...

For now I've just cleaned the current tasks, and don't have timestamps on hand for the last time this happened (to check the logs), but if there's something I should check that would help with this, the next time this should happen is in 2-3 weeks.

Running on RHEL 8.7.

@zemm
Copy link

zemm commented Jan 23, 2023

The Nomad task for the forementioned process that most of the time seems to get it's executor killed is:

    task "zeoserver" {
      driver = "exec"
      resources {
        cpu = 100
        memory = 512
        memory_max = 768
      }
      volume_mount {
        volume = "var"
        destination = "/var/lib/plone"
        read_only = false
      }
      env {
        PLONE_ZEOSERVER_ADDRESS = "${NOMAD_ADDR_zeo}"
        PLONE_VAR = "/var/lib/plone"
      }
      config {
        command = "/usr/local/bin/tini"
        args = [
          "--",
          "/usr/local/bin/sh",
          "-c",
          <<EOH
set -e

if [ ! -e /bin ]; then
  ln -s /usr/local/bin /bin
fi

exec plonectl-zeoserver
EOH
        ]
      }
      artifact {
        source = "${local.repository}/${var.artifact}/${var.artifact}-${var.version}.tar.gz"
        destination = "/"
      }
    }

The /bin linking is because we are using extremely stripped down chroot on Nomad, since all of our artifacts are created with Nix, and are mostly self-contained. As said earlier, the run-away problem also happened before tini when we used /usr/local/bin/sh as command.

plonectl-zeoserver at the end is in itself a wrapper bash script running the actual exec /nix/store/.../bin/python -m plonectl.cli ... at the end of itself.

Other run-away tasks that have been caught includes Nginx, Redis and some Java apps run in the same fashion.

@danlsgiga
Copy link
Contributor

@zemm give the -g flag in tini a try. This will propagate any signals to all processes in the same process group. This will potentially prevent these rogue processes from being orphaned and become zombies.

Ref: https://github.com/krallin/tini#process-group-killing

@danlsgiga
Copy link
Contributor

I just realized you are using exec, which has a different behaviour compared to running the process in the background with &. Try adding the -g flag to tini and change exec plonectl-zeoserver to

plonectl-zeoserver &

wait

This will ensure that the plonectl-zeoserver process stays in the same process group as your sh so when tini propagates the Nomad signal, it is properly handled!

Anyway, my .2c!

@rismoney
Copy link

rismoney commented Jan 31, 2023

I see in 1.4.2 that docker orphans its containers. I am not sure if it is related to a restart of the nomad service client (not the node itself), but we definitely see this when containers are static ports, and obviously the port is already in use.

@mikenomitch
Copy link
Contributor

mikenomitch commented Oct 27, 2023

Note for internal HashiCorp people, see NOMAD-146 in for more info from a customer.

For anybody following: we're likely picking this up next week.

@lgfa29
Copy link
Contributor

lgfa29 commented Nov 22, 2023

Noting what seems to be another report in #19062

@angrycub angrycub self-assigned this Nov 22, 2023
@angrycub
Copy link
Contributor

angrycub commented Jan 5, 2024

Further research

  • I was able to verify that when the executor process is killed the workload process continues running and is reparented to systemd.

  • This issue does not occur on -dev nodes because they have special logic to completely kill workloads started by the node before exiting.

    • This behavior is intentionally not triggered when Nomad is running in production mode in order to enable Nomad agent upgrades without disrupting the running workload.
  • There appears to be configuration options that can be passed to both golang's os.Exec and libcontainer's Run calls.

The syscall.SysProcAttr struct contains a Pdeathsig Signal field that requests the OS to signal any children spawned by the process when the parent dies and provides the Signal to send to them.

NOTE: There is a significant comment embedded in the struct

// Pdeathsig, if non-zero, is a signal that the kernel will send to  
// the child process when the creating thread dies. Note that the signal.
// is sent on thread termination, which may happen before process termination.
// There are more details at https://go.dev/issue/27505.

However, this seems to be the correct way to watchdog the child work using the OS. There is prior art in containerd/go-runc, containerd/containerd (a test), & moby/moby

os/exec

For os/exec.Start, the provided Cmd a SysProcAttr field that takes a pointer to a syscall.SysProcAttr struct.

libcontainer

The libcontainer/configs.Config struct contains a ParentDeathSignal field that performs the same function. It appears that (l *LibcontainerExecutor) newLibcontainerConfig is where we build the Config that is ultimately run.

There are further comments of note in libcontainer in the newParentProcess function:

https://github.com/opencontainers/runc/blob/ab146f23357d867f3ed9a15c4e15507f78eba3b2/libcontainer/container_linux.go#L512-L514

echoed later in

https://github.com/opencontainers/runc/blob/ab146f23357d867f3ed9a15c4e15507f78eba3b2/libcontainer/container_linux.go#L521-L525

Unfortunately I ran out of time to work further on this issue during this period. Hopefully these additional notes might help the next person who picks up this issue.

@angrycub angrycub removed their assignment Jan 5, 2024
@tgross tgross assigned tgross and unassigned tgross Jan 10, 2024
@Juanadelacuesta Juanadelacuesta self-assigned this Mar 27, 2024
Juanadelacuesta added a commit that referenced this issue May 14, 2024
…he `exec` task driver. (#20500)

Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop.

This commit introduces two mechanisms to avoid this behaviour:

* Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task.
* Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.
Juanadelacuesta added a commit that referenced this issue May 14, 2024
…he `exec` task driver. (#20500)

Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop.

This commit introduces two mechanisms to avoid this behaviour:

* Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task.
* Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.
Juanadelacuesta added a commit that referenced this issue May 14, 2024
…he `exec` task driver. (#20500)

Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop.

This commit introduces two mechanisms to avoid this behaviour:

* Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task.
* Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.
Juanadelacuesta added a commit that referenced this issue May 15, 2024
…he `exec` task driver. (#20500) (#20584)

Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop.

This commit introduces two mechanisms to avoid this behaviour:

* Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task.
* Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.

Co-authored-by: Juana De La Cuesta <[email protected]>
Juanadelacuesta added a commit that referenced this issue May 15, 2024
…ones using the `exec` task driver. into release/1.6.x (#20583)

* [gh-6980] Client: clean up old allocs before running new ones  using the `exec` task driver. (#20500)

Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop.

This commit introduces two mechanisms to avoid this behaviour:

* Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task.
* Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.

* fix: add the missing cgroups functions

---------

Co-authored-by: Juana De La Cuesta <[email protected]>
Co-authored-by: Juanadelacuesta <[email protected]>
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/exec theme/driver/java type/bug
Projects
None yet