Skip to content
This repository has been archived by the owner on Dec 4, 2024. It is now read-only.

zombies #276

Closed
vixns opened this issue Jul 22, 2016 · 11 comments
Closed

zombies #276

vixns opened this issue Jul 22, 2016 · 11 comments
Labels

Comments

@vixns
Copy link
Contributor

vixns commented Jul 22, 2016

I'm using latest docker image and mesos containerizer.

On each topology change, the old haproxy process become a zombie.

19517 ?        Ss     0:00  |       |   \_ sh -c /marathon-lb/run sse --marathon ****
19519 ?        S      0:00  |       |   |   \_ /bin/bash /marathon-lb/run sse --marathon ****
19523 ?        S      0:00  |       |   |       \_ /usr/bin/runsv /marathon-lb/service/haproxy
19527 ?        S      0:00  |       |   |       |   \_ /bin/bash ./run
24062 ?        S      0:00  |       |   |       |       \_ sleep 0.5
19524 ?        Sl     0:00  |       |   |       \_ python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/n
19590 ?        Zs     0:00  |       |   \_ [haproxy] <defunct>
23612 ?        Zs     0:00  |       |   \_ [haproxy] <defunct>
23658 ?        Ss     0:00  |       |   \_ haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 662
@brndnmtthws
Copy link
Contributor

Oh dear. I wonder if d94b5fc or e36e8db introduced this.

@brndnmtthws
Copy link
Contributor

I checked all the MLBs in my soak cluster and I'm not seeing this:

root@ip-10-0-6-34:/marathon-lb# ps waux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  20256  3060 ?        Ss   15:46   0:00 /bin/bash /marathon-lb/run sse -m http://master.mesos:8080 --health-check --haproxy
root         8  0.0  0.0   4088   712 ?        S    15:46   0:00 /usr/bin/runsv /marathon-lb/service/haproxy
root         9  0.1  0.1 142676 23500 ?        Sl   15:46   0:00 python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /mar
root        10  0.0  0.0  20264  3068 ?        S    15:46   0:00 /bin/bash ./run
root       490  0.1  0.0  40556 11556 ?        Ss   15:46   0:00 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 252
root      1255  0.0  0.0  20332  3360 ?        Ss+  15:52   0:00 /bin/bash
root      1325  0.3  0.0  20332  3356 ?        Ss   15:52   0:00 /bin/bash
root      1343  0.0  0.0   4224   716 ?        S    15:52   0:00 sleep 0.5
root      1344  0.0  0.0  34492  2848 ?        R+   15:52   0:00 ps waux
root@ip-10-0-6-34:/marathon-lb# ps waux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  20256  3060 ?        Ss   15:46   0:00 /bin/bash /marathon-lb/run sse -m http://master.mesos:8080 --health-check --haproxy
root         8  0.0  0.0   4088   712 ?        S    15:46   0:00 /usr/bin/runsv /marathon-lb/service/haproxy
root         9  0.1  0.1 142676 23496 ?        Sl   15:46   0:00 python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /mar
root        10  0.1  0.0  20320  3128 ?        S    15:46   0:00 /bin/bash ./run
root      1255  0.0  0.0  20332  3360 ?        Ss+  15:52   0:00 /bin/bash
root      1325  0.0  0.0  20332  3356 ?        Ss   15:52   0:00 /bin/bash
root      1675  0.0  0.0  40504 10636 ?        Ss   15:53   0:00 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 1519
root      1870  0.0  0.0   4224   704 ?        S    15:53   0:00 sleep 0.5
root      1871  0.0  0.0  34492  2824 ?        R+   15:53   0:00 ps waux
root@ip-10-0-6-34:/marathon-lb#

@vixns
Copy link
Contributor Author

vixns commented Jul 22, 2016

zombies seems related to mesos-executor, not marathon-lb, also had a sleepas zombie once while testing killing processes from the namespace.

host view :

 7716 ?        Ssl    0:00  |       \_ mesos-executor --launcher_dir=/usr/libexec/mesos --sandbox_directory=/mnt/mesos/sandbox --user=root --working_directory=/marathon-lb --rootfs=/mnt/mesos/provisioner/containers/3b381d5c-7490-4dcd-ab4b-81051226075a/backends/overlay/rootfses/a4beacac-2d7e-445b-80c8-a9b4e480c491
 7813 ?        Ss     0:00  |       |   \_ sh -c /marathon-lb/run sse --marathon https://marathon:8443 --auth-credentials user:pass --group 'external' --ssl-certs /certs --max-serv-port-ip-per-task 20050
 7823 ?        S      0:00  |       |   |   \_ /bin/bash /marathon-lb/run sse --marathon https://marathon:8443 --auth-credentials user:pass --group external --ssl-certs /certs --max-serv-port-ip-per-task 20050
 7827 ?        S      0:00  |       |   |       \_ /usr/bin/runsv /marathon-lb/service/haproxy
 7829 ?        S      0:00  |       |   |       |   \_ /bin/bash ./run
 8879 ?        S      0:00  |       |   |       |       \_ sleep 0.5
 7828 ?        Sl     0:00  |       |   |       \_ python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /marathon-lb/haproxy.cfg --ssl-certs /certs --command sv reload /marathon-lb/service/haproxy --sse --marathon https://marathon:8443 --auth-credentials user:pass --group external --max-serv-port-ip-per-task 20050
 7906 ?        Zs     0:00  |       |   \_ [haproxy] <defunct>
 8628 ?        Zs     0:00  |       |   \_ [haproxy] <defunct>
 8722 ?        Ss     0:00  |       |   \_ haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 144 52

from the namespace :

    1 ?        Ssl    0:01 mesos-executor --launcher_dir=/usr/libexec/mesos --sandbox_directory=/mnt/mesos/sandbox --user=root --working_directory=/marathon-lb --rootfs=/mnt/mesos/provisioner/containers/3b381d5c-7490-4dcd-ab4b-81051226075a/backends/overlay/rootfses/a4beacac-2d7e-445b-80c8-a9b4e480c491
   19 ?        Ss     0:00 sh -c /marathon-lb/run sse --marathon https://marathon:8443 --auth-credentials user:pass --group 'external' --ssl-certs /certs --max-serv-port-ip-per-task 20050
   20 ?        S      0:00  \_ /bin/bash /marathon-lb/run sse --marathon https://marathon:8443 --auth-credentials user:pass --group external --ssl-certs /certs --max-serv-port-ip-per-task 20050
   22 ?        S      0:00      \_ /usr/bin/runsv /marathon-lb/service/haproxy
   24 ?        S      0:00      |   \_ /bin/bash ./run
 1140 ?        S      0:00      |       \_ sleep 0.5
   23 ?        Sl     0:00      \_ python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /marathon-lb/haproxy.cfg --ssl-certs /certs --command sv reload /marathon-lb/service/haproxy --sse --marathon https://marathon:8443 --auth-credentials user:pass --group external --max-serv-port-ip-per-task 20050
   52 ?        Zs     0:00 [haproxy] <defunct>
  144 ?        Zs     0:00 [haproxy] <defunct>
  181 ?        Ss     0:00 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 144 52

pidof haproxy also returns zombies, does it matters to have multiple pids as haproxy arguments ?

@brndnmtthws
Copy link
Contributor

That's quite strange. What version of Mesos?

I'm still not seeing the same thing:

ip-10-0-6-34 ~ # ps waux | grep haproxy
root     17189  0.0  0.0  20256  3060 ?        Ss   15:46   0:00 /bin/bash /marathon-lb/run sse -m http://master.mesos:8080 --health-check --haproxy-map --group external
root     17196  0.0  0.0   4088   712 ?        S    15:46   0:00 /usr/bin/runsv /marathon-lb/service/haproxy
root     17197  0.0  0.1 144812 23736 ?        Sl   15:46   0:01 python3 /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config /marathon-lb/haproxy.cfg --ssl-certs /etc/ssl/cert.pem --command sv reload /marathon-lb/service/haproxy --sse -m http://master.mesos:8080 --health-check --haproxy-map --group external
root     19815  0.1  0.0  40560 11764 ?        Ss   15:53   0:09 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 1519
root     22953  0.0  0.0   4404   696 pts/0    S+   18:10   0:00 grep --colour=auto haproxy
ip-10-0-6-34 ~ #

@vixns
Copy link
Contributor Author

vixns commented Jul 22, 2016

mesos compiled from git master ( 1.1.0 ) , ../configure --enable-ssl --enable-libevent --prefix=/usr --enable-optimize --enable-silent-rules --enable-xfs-disk-isolator

Just recompiled / tested a few minutes ago, same bug.

mesos isolators : namespaces/pid,cgroups/cpu,cgroups/mem,filesystem/linux,docker/runtime,network/cni,docker/volume

cni : simple loopback + bridge

ip a from container namespace

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
3: eth0@if515: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
    inet 10.xx.xx.xx/16 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::f8e2:xxxx:xxxx:xxxx/64 scope link
       valid_lft forever preferred_lft forever

marathon configuration :

{
  "id": "/internal/proxy/external",
  "cmd": "/marathon-lb/run sse --marathon https://marathon:8443 --auth-credentials user:pass --group 'external' --ssl-certs /certs --max-serv-port-ip-per-task 20050",
  "cpus": 0.01,
  "mem": 128,
  "disk": 0,
  "instances": 2,
  "container": {
    "type": "MESOS",
    "volumes": [
      {
        "containerPath": "/certs",
        "hostPath": "/config/haproxy/certs",
        "mode": "RO"
      },
      {
        "containerPath": "/marathon-lb/templates",
        "hostPath": "/config/haproxy/templates",
        "mode": "RO"
      }
    ],
    "docker": {
      "image": "mesosphere/marathon-lb:latest",
      "forcePullImage": true
    }
  },
  "env": {
    "PORTS": "9090"
  },
  "healthChecks": [
    {
      "path": "/_haproxy_health_check",
      "protocol": "HTTP",
      "gracePeriodSeconds": 10,
      "intervalSeconds": 10,
      "timeoutSeconds": 2,
      "maxConsecutiveFailures": 3,
      "ignoreHttp1xx": false,
      "port": 9090
    }
  ],
  "portDefinitions": [],
  "ipAddress": {
    "groups": [],
    "labels": {},
    "discovery": {
      "ports": [
        {
          "number": 9090,
          "name": "admin",
          "protocol": "tcp",
          "labels": {}
        }
      ]
    },
    "networkName": "vlan"
  }
}

@brndnmtthws
Copy link
Contributor

I think it's worth filing an issue over at https://issues.apache.org/jira/secure/Dashboard.jspa. I suspect this is related to Mesos, rather than MLB specifically.

@vixns
Copy link
Contributor Author

vixns commented Jul 23, 2016

@brndnmtthws
Copy link
Contributor

I'm going to close this for now, as I suspect it's a core Mesos issue.

brndnmtthws added a commit that referenced this issue Sep 26, 2016
This is to address issues #5, #71, #267, #276, and #318.
brndnmtthws added a commit that referenced this issue Sep 26, 2016
This is to address issues #5, #71, #267, #276, and #318.
brndnmtthws added a commit that referenced this issue Sep 26, 2016
This is to address issues #5, #71, #267, #276, and #318.
brndnmtthws added a commit that referenced this issue Sep 29, 2016
This is to address issues #5, #71, #267, #276, and #318.
brndnmtthws added a commit that referenced this issue Sep 29, 2016
This is to address issues #5, #71, #267, #276, and #318.
brndnmtthws added a commit that referenced this issue Sep 29, 2016
This is to address issues #5, #71, #267, #276, and #318.
@robsonpeixoto
Copy link
Contributor

How many time it need to remove old process ?
I have a process running for more than 10 minutes and "should" be dead.

root@mesos-lb-1:/marathon-lb# pgrep -a haproxy
82 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf
346 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 82
383 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 346
855 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 816
1301 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 1256
root@mesos-lb-1:/marathon-lb# ps aux  |grep 82
root         82  2.5  0.6  45116 11972 ?        Ss   11:53   0:24 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf
root        346  2.6  0.7  44832 13812 ?        Ss   11:55   0:23 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 82
root       2264  0.0  0.0  11100   708 ?        S+   12:10   0:00 grep 82
root@mesos-lb-1:/marathon-lb# date
Tue Oct  4 12:10:11 UTC 2016

@vixns
Copy link
Contributor Author

vixns commented Oct 4, 2016

With -sf, haproxy only dies when all connections are closed, it does not terminate open connections.
If you have server keepalive, or long-lived tcp services, processes will keep running as long as needed.
See #318 and #321

@robsonpeixoto
Copy link
Contributor

Thanks @vixns

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants