-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--health-on-failure=restart doesn't restart container? #17777
Comments
Thanks for reaching out, @GaryRevell! How do you determine whether the container got restarted? Running a simple example |
Thanks Valentin, I'll try this but am having some problems (proxy?) running this example on my system(s). Once I get that sorted I'll confirm all is OK. |
Thanks, @GaryRevell! Looking forward to hearing back from you. |
Hi @vrothberg , OK, so I've been doing some more testing on my Mac using a simple bash script and the output it generates. podman version 4.3.1 Here's the script:
And here's the output it generates:
So, I create two containers, one with each true/false health command , interval of 15secs, and on-failure to restart & retries=2. However, after 15 seconds the false_container becomes unhealthy as expected but doesn't restart as requested. Can you tell me what I'm doing wrong so that it does restart? I think we're mostly there, the podman documentation is vague to say the least and there aren't a great number of working examples to crib from. Look forward to hearing your comments etc. Thanks! Gary |
Thanks, @GaryRevell! I can reproduce the issue and will look into it. |
I opened #17830 to fix the issue. @GaryRevell since you are running Podman on RHEL. Please open a bugzilla in case you desire a backport to RHEL. |
Bugzilla created, thanks for your work on this @vrothberg . |
As described in containers#17777, the `restart` on-failure action did not behave correctly when the health check is being run by a transient systemd unit. It ran just fine when being executed outside such a unit, for instance, manually or, as done in the system tests, in a scripted fashion. There were two issue causing the `restart` on-failure action to misbehave: 1) The transient systemd units used the default `KillMode=cgroup` which will nuke all processes in the specific cgroup including the recently restarted container/conmon once the main `podman healthcheck run` process exits. 2) Podman attempted to remove the transient systemd unit and timer during restart. That is perfectly fine when manually restarting the container but not when the restart itself is being executed inside such a transient unit. Ultimately, Podman tried to shoot itself in the foot. Fix both issues by moving the restart logic in the cleanup process. Instead of restarting the container, the `healthcheck run` will just stop the container and the cleanup process will restart the container once it has turned unhealthy. Fixes: containers#17777 Signed-off-by: Valentin Rothberg <[email protected]>
As described in containers#17777, the `restart` on-failure action did not behave correctly when the health check is being run by a transient systemd unit. It ran just fine when being executed outside such a unit, for instance, manually or, as done in the system tests, in a scripted fashion. There were two issue causing the `restart` on-failure action to misbehave: 1) The transient systemd units used the default `KillMode=cgroup` which will nuke all processes in the specific cgroup including the recently restarted container/conmon once the main `podman healthcheck run` process exits. 2) Podman attempted to remove the transient systemd unit and timer during restart. That is perfectly fine when manually restarting the container but not when the restart itself is being executed inside such a transient unit. Ultimately, Podman tried to shoot itself in the foot. Fix both issues by moving the restart logic in the cleanup process. Instead of restarting the container, the `healthcheck run` will just stop the container and the cleanup process will restart the container once it has turned unhealthy. Backport of commit 9563415. Fixes: containers#17777 Signed-off-by: Valentin Rothberg <[email protected]>
As described in containers#17777, the `restart` on-failure action did not behave correctly when the health check is being run by a transient systemd unit. It ran just fine when being executed outside such a unit, for instance, manually or, as done in the system tests, in a scripted fashion. There were two issue causing the `restart` on-failure action to misbehave: 1) The transient systemd units used the default `KillMode=cgroup` which will nuke all processes in the specific cgroup including the recently restarted container/conmon once the main `podman healthcheck run` process exits. 2) Podman attempted to remove the transient systemd unit and timer during restart. That is perfectly fine when manually restarting the container but not when the restart itself is being executed inside such a transient unit. Ultimately, Podman tried to shoot itself in the foot. Fix both issues by moving the restart logic in the cleanup process. Instead of restarting the container, the `healthcheck run` will just stop the container and the cleanup process will restart the container once it has turned unhealthy. Backport of commit 9563415. Fixes: containers#17777 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180125 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180126 Signed-off-by: Valentin Rothberg <[email protected]>
As described in containers#17777, the `restart` on-failure action did not behave correctly when the health check is being run by a transient systemd unit. It ran just fine when being executed outside such a unit, for instance, manually or, as done in the system tests, in a scripted fashion. There were two issue causing the `restart` on-failure action to misbehave: 1) The transient systemd units used the default `KillMode=cgroup` which will nuke all processes in the specific cgroup including the recently restarted container/conmon once the main `podman healthcheck run` process exits. 2) Podman attempted to remove the transient systemd unit and timer during restart. That is perfectly fine when manually restarting the container but not when the restart itself is being executed inside such a transient unit. Ultimately, Podman tried to shoot itself in the foot. Fix both issues by moving the restart logic in the cleanup process. Instead of restarting the container, the `healthcheck run` will just stop the container and the cleanup process will restart the container once it has turned unhealthy. Backport of commit 9563415. Fixes: containers#17777 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180104 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180108 Signed-off-by: Valentin Rothberg <[email protected]>
Issue Description
This is an RFI and potentially a bug report.
I've been working on setting up health checks for our podman containers and have followed the instructions on this page:
https://www.redhat.com/sysadmin/podman-edge-healthcheck
It's mentioned that one of the --health-on-failure= options is restart so I tried it rather than kill which is given in the example.
However, it never appears to restart the container when the current one is set to unhealthy, is this a bug OR am I not using the option correctly?
$ podman run --replace -d --name test-container --health-cmd /healthcheck --health-on-failure=restart --health-retries=1 health-check-action
When I use the kill option this works, as does none from recollection. Some example commands below:
Steps to reproduce the issue
Steps to reproduce the issue
Describe the results you received
The container wasn't restarted as expected.
Describe the results you expected
I expected the container to be restarted and in a healthy state.
podman info output
Podman in a container
No
Privileged Or Rootless
Privileged
Upstream Latest Release
Yes
Additional environment details
Additional environment details
Additional information
Happy to provide any extra information and screenshots needed.
I'm running these tests as root as I was getting a podman build error when using my own account.
The text was updated successfully, but these errors were encountered: