[v4.3.1-rhel] fix --health-on-failure=restart in transient unit #17864
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As described in #17777, the
restart
on-failure action did not behave correctly when the health check is being run by a transient systemd unit. It ran just fine when being executed outside such a unit, for instance, manually or, as done in the system tests, in a scripted fashion.There were two issue causing the
restart
on-failure action to misbehave:The transient systemd units used the default
KillMode=cgroup
whichwill nuke all processes in the specific cgroup including the recently
restarted container/conmon once the main
podman healthcheck run
process exits.
Podman attempted to remove the transient systemd unit and timer
during restart. That is perfectly fine when manually restarting the
container but not when the restart itself is being executed inside
such a transient unit. Ultimately, Podman tried to shoot itself in
the foot.
Fix both issues by moving the restart logic in the cleanup process. Instead of restarting the container, the
healthcheck run
will just stop the container and the cleanup process will restart the container once it has turned unhealthy.Backport of commit 9563415.
Fixes: #17777
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180104
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180108
Does this PR introduce a user-facing change?
@TomSweeneyRedHat @Luap99 @mheon @rhatdan PTAL