[v4.3.1-rhel] fix --health-on-failure=restart in transient unit #17864

vrothberg · 2023-03-21T08:31:33Z

As described in #17777, the restart on-failure action did not behave correctly when the health check is being run by a transient systemd unit. It ran just fine when being executed outside such a unit, for instance, manually or, as done in the system tests, in a scripted fashion.

There were two issue causing the restart on-failure action to misbehave:

The transient systemd units used the default KillMode=cgroup which
will nuke all processes in the specific cgroup including the recently
restarted container/conmon once the main podman healthcheck run
process exits.
Podman attempted to remove the transient systemd unit and timer
during restart. That is perfectly fine when manually restarting the
container but not when the restart itself is being executed inside
such a transient unit. Ultimately, Podman tried to shoot itself in
the foot.

Fix both issues by moving the restart logic in the cleanup process. Instead of restarting the container, the healthcheck run will just stop the container and the cleanup process will restart the container once it has turned unhealthy.

Backport of commit 9563415.

Fixes: #17777
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180104
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180108

Does this PR introduce a user-facing change?

Fix a bug in --health-on-failure=restart not restarting the container when health state turns unhealthy.

@TomSweeneyRedHat @Luap99 @mheon @rhatdan PTAL

As described in containers#17777, the `restart` on-failure action did not behave correctly when the health check is being run by a transient systemd unit. It ran just fine when being executed outside such a unit, for instance, manually or, as done in the system tests, in a scripted fashion. There were two issue causing the `restart` on-failure action to misbehave: 1) The transient systemd units used the default `KillMode=cgroup` which will nuke all processes in the specific cgroup including the recently restarted container/conmon once the main `podman healthcheck run` process exits. 2) Podman attempted to remove the transient systemd unit and timer during restart. That is perfectly fine when manually restarting the container but not when the restart itself is being executed inside such a transient unit. Ultimately, Podman tried to shoot itself in the foot. Fix both issues by moving the restart logic in the cleanup process. Instead of restarting the container, the `healthcheck run` will just stop the container and the cleanup process will restart the container once it has turned unhealthy. Backport of commit 9563415. Fixes: containers#17777 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180104 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180108 Signed-off-by: Valentin Rothberg <[email protected]>

openshift-ci · 2023-03-21T08:31:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vrothberg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rhatdan · 2023-03-21T11:23:35Z

/lgtm

openshift-ci bot added the release-note label Mar 21, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 21, 2023

vrothberg mentioned this pull request Mar 21, 2023

fix --health-on-failure=restart in transient unit #17830

Merged

openshift-ci bot assigned rhatdan Mar 21, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 21, 2023

openshift-merge-robot merged commit 158e332 into containers:v4.3.1-rhel Mar 21, 2023

vrothberg deleted the v4.3.1-rhel-backport-fix-17777 branch March 21, 2023 11:54

This was referenced Mar 29, 2023

Experimental workaround for cdn03.quay.io flake #17505

Merged

placeholder issue for quay.io flakes #16973

Closed

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 5, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v4.3.1-rhel] fix --health-on-failure=restart in transient unit #17864

[v4.3.1-rhel] fix --health-on-failure=restart in transient unit #17864

vrothberg commented Mar 21, 2023

openshift-ci bot commented Mar 21, 2023

rhatdan commented Mar 21, 2023

[v4.3.1-rhel] fix --health-on-failure=restart in transient unit #17864

[v4.3.1-rhel] fix --health-on-failure=restart in transient unit #17864

Conversation

vrothberg commented Mar 21, 2023

Does this PR introduce a user-facing change?

openshift-ci bot commented Mar 21, 2023

rhatdan commented Mar 21, 2023