-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix --health-on-failure=restart in transient unit #17830
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vrothberg The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kill Mode=None is deprecated: systemd/systemd#15928
We should not use it for new units IMO. As I see the problem is that we start the container from within the healthcheck command, wouldn't it make more sense to just stop it and then let the podman cleanup process start it again. i.e. like the regular restart policy.
I am not that concerned about it.
That is an interesting idea! I will think about it over the weekend but also want others to chime in and see what they think. @mheon @giuseppe WDYT? |
I see no reason not to let the cleanup process handle this, it works quite well for restart policy. |
I moved the restart logic into the cleanup process and adjusted the tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks but you forgot to update the commit message to reflect the current change
libpod/healthcheck.go
Outdated
if err != nil { | ||
return false, err | ||
} | ||
return healthCheck.Status == "unhealthy", nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a const that you can use for this string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea! Yes, there's define.HealthCheckUnhealthy
. Will update.
"Fix both issues by moving the restart logic in the cleanup process. Please let me know what you think is missing in the commit message. |
You still have |
As described in containers#17777, the `restart` on-failure action did not behave correctly when the health check is being run by a transient systemd unit. It ran just fine when being executed outside such a unit, for instance, manually or, as done in the system tests, in a scripted fashion. There were two issue causing the `restart` on-failure action to misbehave: 1) The transient systemd units used the default `KillMode=cgroup` which will nuke all processes in the specific cgroup including the recently restarted container/conmon once the main `podman healthcheck run` process exits. 2) Podman attempted to remove the transient systemd unit and timer during restart. That is perfectly fine when manually restarting the container but not when the restart itself is being executed inside such a transient unit. Ultimately, Podman tried to shoot itself in the foot. Fix both issues by moving the restart logic in the cleanup process. Instead of restarting the container, the `healthcheck run` will just stop the container and the cleanup process will restart the container once it has turned unhealthy. Fixes: containers#17777 Signed-off-by: Valentin Rothberg <[email protected]>
Thanks for catching! I may need my afternoon coffee :^) |
@containers/podman-maintainers PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/lgtm |
As described in #17777, the
restart
on-failure action did not behave correctly when the health check is being run by a transient systemd unit. It ran just fine when being executed outside such a unit, for instance, manually or, as done in the system tests, in a scripted fashion.There were two issue causing the
restart
on-failure action to misbehave:The transient systemd units used the default
KillMode=cgroup
whichwill nuke all processes in the specific cgroup including the recently
restarted container/conmon once the main
podman healthcheck run
process exits. Setting the kill mode to none addresses this problem.
Podman attempted to remove the transient systemd unit and timer
during restart. That is perfectly fine when manually restarting the
container but not when the restart itself is being executed inside
such a transient unit. Ultimately, Podman tried to shoot itself in
the foot.
Fix both issues by moving the restart logic in the cleanup process.
Instead of restarting the container, the
healthcheck run
will juststop the container and the cleanup process will restart the container
once it has turned unhealthy.
Fixes: #17777
Does this PR introduce a user-facing change?