zombies #276

vixns · 2016-07-22T10:24:05Z

I'm using latest docker image and mesos containerizer.

On each topology change, the old haproxy process become a zombie.

19517 ?        Ss     0:00  |       |   \_ sh -c /marathon-lb/run sse --marathon ****
19519 ?        S      0:00  |       |   |   \_ /bin/bash /marathon-lb/run sse --marathon ****
19523 ?        S      0:00  |       |   |       \_ /usr/bin/runsv /marathon-lb/service/haproxy
19527 ?        S      0:00  |       |   |       |   \_ /bin/bash ./run
24062 ?        S      0:00  |       |   |       |       \_ sleep 0.5
19524 ?        Sl     0:00  |       |   |       \_ python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/n
19590 ?        Zs     0:00  |       |   \_ [haproxy] <defunct>
23612 ?        Zs     0:00  |       |   \_ [haproxy] <defunct>
23658 ?        Ss     0:00  |       |   \_ haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 662

The text was updated successfully, but these errors were encountered:

brndnmtthws · 2016-07-22T15:51:21Z

Oh dear. I wonder if d94b5fc or e36e8db introduced this.

brndnmtthws · 2016-07-22T15:54:04Z

I checked all the MLBs in my soak cluster and I'm not seeing this:

root@ip-10-0-6-34:/marathon-lb# ps waux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  20256  3060 ?        Ss   15:46   0:00 /bin/bash /marathon-lb/run sse -m http://master.mesos:8080 --health-check --haproxy
root         8  0.0  0.0   4088   712 ?        S    15:46   0:00 /usr/bin/runsv /marathon-lb/service/haproxy
root         9  0.1  0.1 142676 23500 ?        Sl   15:46   0:00 python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /mar
root        10  0.0  0.0  20264  3068 ?        S    15:46   0:00 /bin/bash ./run
root       490  0.1  0.0  40556 11556 ?        Ss   15:46   0:00 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 252
root      1255  0.0  0.0  20332  3360 ?        Ss+  15:52   0:00 /bin/bash
root      1325  0.3  0.0  20332  3356 ?        Ss   15:52   0:00 /bin/bash
root      1343  0.0  0.0   4224   716 ?        S    15:52   0:00 sleep 0.5
root      1344  0.0  0.0  34492  2848 ?        R+   15:52   0:00 ps waux
root@ip-10-0-6-34:/marathon-lb# ps waux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  20256  3060 ?        Ss   15:46   0:00 /bin/bash /marathon-lb/run sse -m http://master.mesos:8080 --health-check --haproxy
root         8  0.0  0.0   4088   712 ?        S    15:46   0:00 /usr/bin/runsv /marathon-lb/service/haproxy
root         9  0.1  0.1 142676 23496 ?        Sl   15:46   0:00 python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /mar
root        10  0.1  0.0  20320  3128 ?        S    15:46   0:00 /bin/bash ./run
root      1255  0.0  0.0  20332  3360 ?        Ss+  15:52   0:00 /bin/bash
root      1325  0.0  0.0  20332  3356 ?        Ss   15:52   0:00 /bin/bash
root      1675  0.0  0.0  40504 10636 ?        Ss   15:53   0:00 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 1519
root      1870  0.0  0.0   4224   704 ?        S    15:53   0:00 sleep 0.5
root      1871  0.0  0.0  34492  2824 ?        R+   15:53   0:00 ps waux
root@ip-10-0-6-34:/marathon-lb#

vixns · 2016-07-22T17:50:00Z

zombies seems related to mesos-executor, not marathon-lb, also had a sleepas zombie once while testing killing processes from the namespace.

host view :

 7716 ?        Ssl    0:00  |       \_ mesos-executor --launcher_dir=/usr/libexec/mesos --sandbox_directory=/mnt/mesos/sandbox --user=root --working_directory=/marathon-lb --rootfs=/mnt/mesos/provisioner/containers/3b381d5c-7490-4dcd-ab4b-81051226075a/backends/overlay/rootfses/a4beacac-2d7e-445b-80c8-a9b4e480c491
 7813 ?        Ss     0:00  |       |   \_ sh -c /marathon-lb/run sse --marathon https://marathon:8443 --auth-credentials user:pass --group 'external' --ssl-certs /certs --max-serv-port-ip-per-task 20050
 7823 ?        S      0:00  |       |   |   \_ /bin/bash /marathon-lb/run sse --marathon https://marathon:8443 --auth-credentials user:pass --group external --ssl-certs /certs --max-serv-port-ip-per-task 20050
 7827 ?        S      0:00  |       |   |       \_ /usr/bin/runsv /marathon-lb/service/haproxy
 7829 ?        S      0:00  |       |   |       |   \_ /bin/bash ./run
 8879 ?        S      0:00  |       |   |       |       \_ sleep 0.5
 7828 ?        Sl     0:00  |       |   |       \_ python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /marathon-lb/haproxy.cfg --ssl-certs /certs --command sv reload /marathon-lb/service/haproxy --sse --marathon https://marathon:8443 --auth-credentials user:pass --group external --max-serv-port-ip-per-task 20050
 7906 ?        Zs     0:00  |       |   \_ [haproxy] <defunct>
 8628 ?        Zs     0:00  |       |   \_ [haproxy] <defunct>
 8722 ?        Ss     0:00  |       |   \_ haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 144 52

from the namespace :

    1 ?        Ssl    0:01 mesos-executor --launcher_dir=/usr/libexec/mesos --sandbox_directory=/mnt/mesos/sandbox --user=root --working_directory=/marathon-lb --rootfs=/mnt/mesos/provisioner/containers/3b381d5c-7490-4dcd-ab4b-81051226075a/backends/overlay/rootfses/a4beacac-2d7e-445b-80c8-a9b4e480c491
   19 ?        Ss     0:00 sh -c /marathon-lb/run sse --marathon https://marathon:8443 --auth-credentials user:pass --group 'external' --ssl-certs /certs --max-serv-port-ip-per-task 20050
   20 ?        S      0:00  \_ /bin/bash /marathon-lb/run sse --marathon https://marathon:8443 --auth-credentials user:pass --group external --ssl-certs /certs --max-serv-port-ip-per-task 20050
   22 ?        S      0:00      \_ /usr/bin/runsv /marathon-lb/service/haproxy
   24 ?        S      0:00      |   \_ /bin/bash ./run
 1140 ?        S      0:00      |       \_ sleep 0.5
   23 ?        Sl     0:00      \_ python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /marathon-lb/haproxy.cfg --ssl-certs /certs --command sv reload /marathon-lb/service/haproxy --sse --marathon https://marathon:8443 --auth-credentials user:pass --group external --max-serv-port-ip-per-task 20050
   52 ?        Zs     0:00 [haproxy] <defunct>
  144 ?        Zs     0:00 [haproxy] <defunct>
  181 ?        Ss     0:00 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 144 52

pidof haproxy also returns zombies, does it matters to have multiple pids as haproxy arguments ?

brndnmtthws · 2016-07-22T18:11:07Z

That's quite strange. What version of Mesos?

I'm still not seeing the same thing:

ip-10-0-6-34 ~ # ps waux | grep haproxy
root     17189  0.0  0.0  20256  3060 ?        Ss   15:46   0:00 /bin/bash /marathon-lb/run sse -m http://master.mesos:8080 --health-check --haproxy-map --group external
root     17196  0.0  0.0   4088   712 ?        S    15:46   0:00 /usr/bin/runsv /marathon-lb/service/haproxy
root     17197  0.0  0.1 144812 23736 ?        Sl   15:46   0:01 python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /marathon-lb/haproxy.cfg --ssl-certs /etc/ssl/cert.pem --command sv reload /marathon-lb/service/haproxy --sse -m http://master.mesos:8080 --health-check --haproxy-map --group external
root     19815  0.1  0.0  40560 11764 ?        Ss   15:53   0:09 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 1519
root     22953  0.0  0.0   4404   696 pts/0    S+   18:10   0:00 grep --colour=auto haproxy
ip-10-0-6-34 ~ #

vixns · 2016-07-22T20:26:41Z

mesos compiled from git master ( 1.1.0 ) , ../configure --enable-ssl --enable-libevent --prefix=/usr --enable-optimize --enable-silent-rules --enable-xfs-disk-isolator

Just recompiled / tested a few minutes ago, same bug.

mesos isolators : namespaces/pid,cgroups/cpu,cgroups/mem,filesystem/linux,docker/runtime,network/cni,docker/volume

cni : simple loopback + bridge

ip a from container namespace

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
3: eth0@if515: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
    inet 10.xx.xx.xx/16 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::f8e2:xxxx:xxxx:xxxx/64 scope link
       valid_lft forever preferred_lft forever

marathon configuration :

{
  "id": "/internal/proxy/external",
  "cmd": "/marathon-lb/run sse --marathon https://marathon:8443 --auth-credentials user:pass --group 'external' --ssl-certs /certs --max-serv-port-ip-per-task 20050",
  "cpus": 0.01,
  "mem": 128,
  "disk": 0,
  "instances": 2,
  "container": {
    "type": "MESOS",
    "volumes": [
      {
        "containerPath": "/certs",
        "hostPath": "/config/haproxy/certs",
        "mode": "RO"
      },
      {
        "containerPath": "/marathon-lb/templates",
        "hostPath": "/config/haproxy/templates",
        "mode": "RO"
      }
    ],
    "docker": {
      "image": "mesosphere/marathon-lb:latest",
      "forcePullImage": true
    }
  },
  "env": {
    "PORTS": "9090"
  },
  "healthChecks": [
    {
      "path": "/_haproxy_health_check",
      "protocol": "HTTP",
      "gracePeriodSeconds": 10,
      "intervalSeconds": 10,
      "timeoutSeconds": 2,
      "maxConsecutiveFailures": 3,
      "ignoreHttp1xx": false,
      "port": 9090
    }
  ],
  "portDefinitions": [],
  "ipAddress": {
    "groups": [],
    "labels": {},
    "discovery": {
      "ports": [
        {
          "number": 9090,
          "name": "admin",
          "protocol": "tcp",
          "labels": {}
        }
      ]
    },
    "networkName": "vlan"
  }
}

brndnmtthws · 2016-07-22T20:37:13Z

I think it's worth filing an issue over at https://issues.apache.org/jira/secure/Dashboard.jspa. I suspect this is related to Mesos, rather than MLB specifically.

vixns · 2016-07-23T08:27:27Z

https://issues.apache.org/jira/browse/MESOS-5893

brndnmtthws · 2016-07-23T17:16:33Z

I'm going to close this for now, as I suspect it's a core Mesos issue.

This is to address issues #5, #71, #267, #276, and #318.

robsonpeixoto · 2016-10-04T12:11:31Z

How many time it need to remove old process ?
I have a process running for more than 10 minutes and "should" be dead.

root@mesos-lb-1:/marathon-lb# pgrep -a haproxy
82 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf
346 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 82
383 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 346
855 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 816
1301 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 1256
root@mesos-lb-1:/marathon-lb# ps aux  |grep 82
root         82  2.5  0.6  45116 11972 ?        Ss   11:53   0:24 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf
root        346  2.6  0.7  44832 13812 ?        Ss   11:55   0:23 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 82
root       2264  0.0  0.0  11100   708 ?        S+   12:10   0:00 grep 82
root@mesos-lb-1:/marathon-lb# date
Tue Oct  4 12:10:11 UTC 2016

vixns · 2016-10-04T12:24:23Z

With -sf, haproxy only dies when all connections are closed, it does not terminate open connections.
If you have server keepalive, or long-lived tcp services, processes will keep running as long as needed.
See #318 and #321

robsonpeixoto · 2016-10-04T12:36:18Z

Thanks @vixns

brndnmtthws added the bug label Jul 22, 2016

brndnmtthws closed this as completed Jul 23, 2016

brndnmtthws added a commit that referenced this issue Sep 26, 2016

Optionally send SIGTERM after reload.

ef22fbf

This is to address issues #5, #71, #267, #276, and #318.

brndnmtthws added a commit that referenced this issue Sep 26, 2016

Optionally send SIGTERM after reload.

e273faf

This is to address issues #5, #71, #267, #276, and #318.

brndnmtthws mentioned this issue Sep 26, 2016

Optionally send SIGTERM after reload. #321

Merged

brndnmtthws added a commit that referenced this issue Sep 26, 2016

Optionally send SIGTERM after reload.

806d93a

This is to address issues #5, #71, #267, #276, and #318.

brndnmtthws added a commit that referenced this issue Sep 29, 2016

Optionally send SIGTERM after reload.

f1df667

This is to address issues #5, #71, #267, #276, and #318.

brndnmtthws added a commit that referenced this issue Sep 29, 2016

Optionally send SIGTERM after reload.

568ec34

This is to address issues #5, #71, #267, #276, and #318.

brndnmtthws added a commit that referenced this issue Sep 29, 2016

Optionally send SIGTERM after reload. (#321)

1b82aa0

This is to address issues #5, #71, #267, #276, and #318.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zombies #276

zombies #276

vixns commented Jul 22, 2016

brndnmtthws commented Jul 22, 2016

brndnmtthws commented Jul 22, 2016

vixns commented Jul 22, 2016

brndnmtthws commented Jul 22, 2016

vixns commented Jul 22, 2016

brndnmtthws commented Jul 22, 2016

vixns commented Jul 23, 2016

brndnmtthws commented Jul 23, 2016

robsonpeixoto commented Oct 4, 2016

vixns commented Oct 4, 2016

robsonpeixoto commented Oct 4, 2016

zombies #276

zombies #276

Comments

vixns commented Jul 22, 2016

brndnmtthws commented Jul 22, 2016

brndnmtthws commented Jul 22, 2016

vixns commented Jul 22, 2016

brndnmtthws commented Jul 22, 2016

vixns commented Jul 22, 2016

brndnmtthws commented Jul 22, 2016

vixns commented Jul 23, 2016

brndnmtthws commented Jul 23, 2016

robsonpeixoto commented Oct 4, 2016

vixns commented Oct 4, 2016

robsonpeixoto commented Oct 4, 2016