Zombie processes on 1688.5.3 #2410

louismunro · 2018-04-18T20:58:29Z

Issue Report

Zombie processes (sshd) increasing until no new process can be created.

Bug

I have a container running the alpine-sshd image (running openssh 7.5) used to transfer some files which are then written to an EFS volume. Every time a connection to its sshd is opened and then closed a zombie sshd process is left behind.
I have reproduced the problem using another alpine based image running openssh 7.7 as well as a centos based image running openssh 7.4.

Furthermore, this problem does not exist for the same images on previous coreos versions, as tested using 1520.8.0 (ami-a89d3ad2 in us-east1).

Container Linux Version

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1688.5.3
VERSION_ID=1688.5.3
BUILD_ID=2018-04-03-0547
PRETTY_NAME="Container Linux by CoreOS 1688.5.3 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

AWS EC2 (ami-3f061b45 in us-east1)
Kubernetes 1.7.10
nodeInfo:
architecture: amd64
containerRuntimeVersion: docker://17.12.1-ce
kernelVersion: 4.14.32-coreos
kubeProxyVersion: v1.7.10
kubeletVersion: v1.7.10
operatingSystem: linux
osImage: Container Linux by CoreOS 1688.5.3 (Rhyolite)

Expected Behavior

sshd should reap it's children and the zombie count should stay at or near 0.

Actual Behavior

sshd defunct processes accumulate in the container for every closed session.

Reproduction Steps

kubectl run testzombies --image sickp/alpine-sshd (or run in docker directly. Feel free to build your own image. I tested that too.)
connect to the sshd daemon (even an incorrect login will trigger this)
Inside the container, check the status for the sshd processes. There will be a zombie for each closed connection.

Killing the container will reap the zombies created by sshd.

Other Information

As mentioned above this does not happen for the same container image running in CoreOS 1520.8.0.
I can't tell if the issue is with docker or CoreOS, but I believe that since the parent sshd process presumably "wait"s correctly for it's children, the kernel must be broken.

The text was updated successfully, but these errors were encountered:

euank · 2018-04-18T22:26:51Z

This is actually a kinda weird Kubernetes interaction, not a change in docker or the kernels' behaviour (I think).

On both AMIs referenced, the following will produce zombies with that container:

$ docker run --name pause -d gcr.io/google_containers/pause-amd64:3.0
$ docker run --pid=container:pause --rm --publish=2222:22 -it sickp/alpine-sshd:7.5-r2

$ for i in $(seq 1 10); do ssh -p 2222 -o BatchMode=yes -o StrictHostKeyChecking=no root@localhost; done
$ ps aux | grep sshd 
# many defunct processes

What's I suspect you're observing here is that the Kubernetes shared-pid feature turns itself on when it detects the Docker version is >= 1.13.1, so on the newer AMI Kubernetes is launching pods in a different way (similar to the above).

This can be most easily worked around in one of the following ways:

Update the pause container to gcr.io/google_containers/pause-amd64:3.1 to get this code change; this can be configured with the --pod-infra-container-image flag.
Update to k8s 1.10 where the above pause container was made the default.
Pass --docker-disable-shared-pid to the kubelet to opt out of the different behaviour

louismunro · 2018-04-19T18:35:17Z

Looks like you are right. Upgrading the pause container fixes this.
I had always assumed (incorrectly) that the pause container reaped defunct processes.
I hadn't realized that that was not the case prior to version 3.1.

Thank you for your help.

louismunro closed this as completed Apr 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zombie processes on 1688.5.3 #2410

Zombie processes on 1688.5.3 #2410

louismunro commented Apr 18, 2018 •

edited

Loading

euank commented Apr 18, 2018 •

edited

Loading

louismunro commented Apr 19, 2018

Zombie processes on 1688.5.3 #2410

Zombie processes on 1688.5.3 #2410

Comments

louismunro commented Apr 18, 2018 • edited Loading

Issue Report

Bug

Container Linux Version

Environment

Expected Behavior

Actual Behavior

Reproduction Steps

Other Information

euank commented Apr 18, 2018 • edited Loading

louismunro commented Apr 19, 2018

louismunro commented Apr 18, 2018 •

edited

Loading

euank commented Apr 18, 2018 •

edited

Loading