Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed health checks leaves zombie processes #136

Open
azatoth opened this issue Jun 27, 2024 · 3 comments
Open

Failed health checks leaves zombie processes #136

azatoth opened this issue Jun 27, 2024 · 3 comments

Comments

@azatoth
Copy link

azatoth commented Jun 27, 2024

When the health check has failed, we notice defunct zombie processes are left behind

Expected Behavior

That the parent would clean up the process table

Current Behavior

everytime the healthcheck fails, a zombie is left behind.

root@ip-10-0-0-39:/workspace# THC_PORT=8080 THC_PATH=/actuator/health /layers/paketo-buildpacks_health-checker/thc/bin/thc
Error:
request error: http://localhost:8080/actuator/health: Network Error: Network Error: Error encountered in the status line: timed out reading response
root@ip-10-0-0-39:/workspace# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
cnb          1     0  7 07:41 ?        00:01:04 java org.springframework.boot.loader.launch.JarLauncher
root        42     0  0 07:41 ?        00:00:00 /managed-agents/execute-command/amazon-ssm-agent
root       112    42  0 07:41 ?        00:00:00 /managed-agents/execute-command/ssm-agent-worker
cnb        185     1  0 07:42 ?        00:00:00 [thc] <defunct>
root       186   112  0 07:43 ?        00:00:01 /managed-agents/execute-command/ssm-session-worker ecs-execute-command-s44anltuwjhhfu3vtvoadatnyu
root       194   186  0 07:43 pts/0    00:00:00 sh
cnb        201     1  0 07:43 ?        00:00:00 [thc] <defunct>
root       203   194  0 07:43 pts/0    00:00:00 bash
cnb        214     1  0 07:43 ?        00:00:00 [thc] <defunct>
cnb        224     1  0 07:43 ?        00:00:00 [thc] <defunct>
cnb        232     1  0 07:44 ?        00:00:00 [thc] <defunct>
cnb        239     1  0 07:44 ?        00:00:00 [thc] <defunct>
cnb        248     1  0 07:44 ?        00:00:00 [thc] <defunct>
cnb        256     1  0 07:45 ?        00:00:00 [thc] <defunct>
cnb        265     1  0 07:45 ?        00:00:00 [thc] <defunct>
cnb        274     1  0 07:45 ?        00:00:00 [thc] <defunct>
cnb        281     1  0 07:46 ?        00:00:00 [thc] <defunct>
cnb        481     1  0 07:53 ?        00:00:00 [thc] <defunct>
cnb        491     1  0 07:53 ?        00:00:00 [thc] <defunct>
cnb        527     1  0 07:54 ?        00:00:00 [thc] <defunct>
cnb        538     1  0 07:55 ?        00:00:00 [thc] <defunct>
cnb        546     1  0 07:55 ?        00:00:00 [thc] <defunct>
cnb        560     1  0 07:55 ?        00:00:00 [thc] <defunct>
cnb        568     1  0 07:56 ?        00:00:00 /layers/paketo-buildpacks_health-checker/thc/bin/thc
root       569   203  0 07:56 pts/0    00:00:00 ps -ef
@dmikusa
Copy link
Contributor

dmikusa commented Jun 27, 2024

The JVM is running as PID1, it's also the parent PID for the health check processes being run (I'm guessing just because it's PID1). At the same time, I doubt the JVM is set up to handle the responsibilities of PID1. PID1 is special and has to handle signal propagation and reaping zombie processes.

The only option I know of would be to insert a process, like tini that would handle the PID1 responsibilities. That's something that would have to be coordinated with Java buildpack, because that's is what's setting the start command here, and it would need a way for other buildpacks to signal that it should include a process like tini. If it exposed an option like that, then we could make the health checker buildpack tell it to include tini.

Can you elaborate on the impact here & can you share the docker health check options you're using?

I was also think that if the health check fails, usually the container would be restarted. Just trying to understand the specifics of your set up. Thanks

@azatoth
Copy link
Author

azatoth commented Jul 2, 2024

So the impact is pretty small as, as you said, usually the container would be restarted; I noticed during the initial grace period which we have set to 10m, and as it didn't look right I thought its best to report it.

@dmikusa
Copy link
Contributor

dmikusa commented Jul 5, 2024

Thanks for the report, much appreciated.

Given the impact seems to be low and the effort to resolve this would be high, I'm going to leave this as is for now. I will leave this issue open though. If others are having this issue and the impact is higher, please reach out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants