[bitnami/etcd] healthcheck.sh leaving behind etcdctl zombies on timeout #13989
Labels
etcd
solved
stale
15 days without activity
tech-issues
The user has a technical issue about an application
triage
Triage is needed
Name and Version
bitnami/etcd 8.5.8
What steps will reproduce the bug?
We run a kubeadm cluster using calico as the networking layer on physical machines. We've recently had some calico issues which caused networking interruptions between pods. When something like that happens we end up with a bunch of etcdctl zombies.
My guess is that due to the networking issues the etcdctl command takes a long time, which causes healthcheck.sh to get killed. For reasons unclear to me that doesn't properly kill the etcdctl command at that point, which, because the parent process is gone then gets attached to the etcd command running as PID 1, which obviously doesn't handle cleaning up zombies properly. I suspect it is very similar to this issue seen with redis: #5328
Due to the redis issue I think that enabling shareProcessNamespace on the etcd pod should already fix this issue, but the chart currently doesn't allow doing that.
Are you using any custom parameters or values?
We use a replicacount of 3
What is the expected behavior?
No zombies
What do you see instead?
An etcdctl zombie for every timed out healthcheck
The text was updated successfully, but these errors were encountered: