Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/etcd] healthcheck.sh leaving behind etcdctl zombies on timeout #13989

Closed
RobinGeuze opened this issue Dec 16, 2022 · 3 comments · Fixed by #14018
Closed

[bitnami/etcd] healthcheck.sh leaving behind etcdctl zombies on timeout #13989

RobinGeuze opened this issue Dec 16, 2022 · 3 comments · Fixed by #14018
Assignees
Labels
etcd solved stale 15 days without activity tech-issues The user has a technical issue about an application triage Triage is needed

Comments

@RobinGeuze
Copy link
Contributor

RobinGeuze commented Dec 16, 2022

Name and Version

bitnami/etcd 8.5.8

What steps will reproduce the bug?

We run a kubeadm cluster using calico as the networking layer on physical machines. We've recently had some calico issues which caused networking interruptions between pods. When something like that happens we end up with a bunch of etcdctl zombies.

My guess is that due to the networking issues the etcdctl command takes a long time, which causes healthcheck.sh to get killed. For reasons unclear to me that doesn't properly kill the etcdctl command at that point, which, because the parent process is gone then gets attached to the etcd command running as PID 1, which obviously doesn't handle cleaning up zombies properly. I suspect it is very similar to this issue seen with redis: #5328

Due to the redis issue I think that enabling shareProcessNamespace on the etcd pod should already fix this issue, but the chart currently doesn't allow doing that.

Are you using any custom parameters or values?

We use a replicacount of 3

What is the expected behavior?

No zombies

What do you see instead?

An etcdctl zombie for every timed out healthcheck

@RobinGeuze RobinGeuze added the tech-issues The user has a technical issue about an application label Dec 16, 2022
@github-actions github-actions bot added the triage Triage is needed label Dec 16, 2022
@carrodher carrodher added the etcd label Dec 19, 2022
@carrodher
Copy link
Member

It seems a very specific use case difficult to reproduce on our side and very tied to your scenario.

For information regarding the application itself, customization of the content within the application, or questions about the use of the technology or infrastructure; we highly recommend checking forums and user guides made available by the project behind the application or the technology.

That said, we will keep this ticket open until the stale bot closes it just in case someone from the community adds some valuable info.

If you think there is something fixable at the Helm chart level and you would like to contribute by creating a PR to solve the issue, the Bitnami team will be happy to review it and provide feedback. Here you can find the contributing guidelines.

@RobinGeuze
Copy link
Contributor Author

Hey @carrodher, I've created a pull request for a potential solution here: #14018

@github-actions
Copy link

github-actions bot commented Jan 4, 2023

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
etcd solved stale 15 days without activity tech-issues The user has a technical issue about an application triage Triage is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants