Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v4.3.1-rhel] fix --health-on-failure=restart in transient unit #17864

Conversation

vrothberg
Copy link
Member

As described in #17777, the restart on-failure action did not behave correctly when the health check is being run by a transient systemd unit. It ran just fine when being executed outside such a unit, for instance, manually or, as done in the system tests, in a scripted fashion.

There were two issue causing the restart on-failure action to misbehave:

  1. The transient systemd units used the default KillMode=cgroup which
    will nuke all processes in the specific cgroup including the recently
    restarted container/conmon once the main podman healthcheck run
    process exits.

  2. Podman attempted to remove the transient systemd unit and timer
    during restart. That is perfectly fine when manually restarting the
    container but not when the restart itself is being executed inside
    such a transient unit. Ultimately, Podman tried to shoot itself in
    the foot.

Fix both issues by moving the restart logic in the cleanup process. Instead of restarting the container, the healthcheck run will just stop the container and the cleanup process will restart the container once it has turned unhealthy.

Backport of commit 9563415.

Fixes: #17777
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180104
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180108

Does this PR introduce a user-facing change?

Fix a bug in --health-on-failure=restart not restarting the container when health state turns unhealthy.

@TomSweeneyRedHat @Luap99 @mheon @rhatdan PTAL

As described in containers#17777, the `restart` on-failure action did not behave
correctly when the health check is being run by a transient systemd
unit.  It ran just fine when being executed outside such a unit, for
instance, manually or, as done in the system tests, in a scripted
fashion.

There were two issue causing the `restart` on-failure action to
misbehave:

1) The transient systemd units used the default `KillMode=cgroup` which
   will nuke all processes in the specific cgroup including the recently
   restarted container/conmon once the main `podman healthcheck run`
   process exits.

2) Podman attempted to remove the transient systemd unit and timer
   during restart.  That is perfectly fine when manually restarting the
   container but not when the restart itself is being executed inside
   such a transient unit.  Ultimately, Podman tried to shoot itself in
   the foot.

Fix both issues by moving the restart logic in the cleanup process.
Instead of restarting the container, the `healthcheck run` will just
stop the container and the cleanup process will restart the container
once it has turned unhealthy.

Backport of commit 9563415.

Fixes: containers#17777
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180104
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180108
Signed-off-by: Valentin Rothberg <[email protected]>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 21, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 21, 2023
@rhatdan
Copy link
Member

rhatdan commented Mar 21, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 21, 2023
@openshift-merge-robot openshift-merge-robot merged commit 158e332 into containers:v4.3.1-rhel Mar 21, 2023
@vrothberg vrothberg deleted the v4.3.1-rhel-backport-fix-17777 branch March 21, 2023 11:54
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 5, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants