Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix --health-on-failure=restart in transient unit #17830

Merged
merged 1 commit into from
Mar 20, 2023

Conversation

vrothberg
Copy link
Member

@vrothberg vrothberg commented Mar 17, 2023

As described in #17777, the restart on-failure action did not behave correctly when the health check is being run by a transient systemd unit. It ran just fine when being executed outside such a unit, for instance, manually or, as done in the system tests, in a scripted fashion.

There were two issue causing the restart on-failure action to misbehave:

  1. The transient systemd units used the default KillMode=cgroup which
    will nuke all processes in the specific cgroup including the recently
    restarted container/conmon once the main podman healthcheck run
    process exits. Setting the kill mode to none addresses this problem.

  2. Podman attempted to remove the transient systemd unit and timer
    during restart. That is perfectly fine when manually restarting the
    container but not when the restart itself is being executed inside
    such a transient unit. Ultimately, Podman tried to shoot itself in
    the foot.

Fix both issues by moving the restart logic in the cleanup process.
Instead of restarting the container, the healthcheck run will just
stop the container and the cleanup process will restart the container
once it has turned unhealthy.

Fixes: #17777

Does this PR introduce a user-facing change?

Fix a bug in --health-on-failure=restart not restarting the container when health state turns unhealthy.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 17, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2023
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kill Mode=None is deprecated: systemd/systemd#15928

We should not use it for new units IMO. As I see the problem is that we start the container from within the healthcheck command, wouldn't it make more sense to just stop it and then let the podman cleanup process start it again. i.e. like the regular restart policy.

@vrothberg
Copy link
Member Author

Kill Mode=None is deprecated: systemd/systemd#15928

I am not that concerned about it. None is still working but was deprecated to discourage users from doing it. It was a main cause of dysfunctional services. In this case, we know what we are doing.

We should not use it for new units IMO. As I see the problem is that we start the container from within the healthcheck command, wouldn't it make more sense to just stop it and then let the podman cleanup process start it again. i.e. like the regular restart policy.

That is an interesting idea! I will think about it over the weekend but also want others to chime in and see what they think. @mheon @giuseppe WDYT?

@mheon
Copy link
Member

mheon commented Mar 17, 2023

I see no reason not to let the cleanup process handle this, it works quite well for restart policy.

@vrothberg
Copy link
Member Author

I moved the restart logic into the cleanup process and adjusted the tests.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks but you forgot to update the commit message to reflect the current change

if err != nil {
return false, err
}
return healthCheck.Status == "unhealthy", nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a const that you can use for this string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea! Yes, there's define.HealthCheckUnhealthy. Will update.

@vrothberg
Copy link
Member Author

Thanks but you forgot to update the commit message to reflect the current change

"Fix both issues by moving the restart logic in the cleanup process.
Instead of restarting the container, the healthcheck run will just
stop the container and the cleanup process will restart the container
once it has turned unhealthy."

Please let me know what you think is missing in the commit message.

@Luap99
Copy link
Member

Luap99 commented Mar 20, 2023

You still have Setting the kill mode to none addresses this problem. in there which is no longer correct, but yeah you are right the rest is good.

As described in containers#17777, the `restart` on-failure action did not behave
correctly when the health check is being run by a transient systemd
unit.  It ran just fine when being executed outside such a unit, for
instance, manually or, as done in the system tests, in a scripted
fashion.

There were two issue causing the `restart` on-failure action to
misbehave:

1) The transient systemd units used the default `KillMode=cgroup` which
   will nuke all processes in the specific cgroup including the recently
   restarted container/conmon once the main `podman healthcheck run`
   process exits.

2) Podman attempted to remove the transient systemd unit and timer
   during restart.  That is perfectly fine when manually restarting the
   container but not when the restart itself is being executed inside
   such a transient unit.  Ultimately, Podman tried to shoot itself in
   the foot.

Fix both issues by moving the restart logic in the cleanup process.
Instead of restarting the container, the `healthcheck run` will just
stop the container and the cleanup process will restart the container
once it has turned unhealthy.

Fixes: containers#17777
Signed-off-by: Valentin Rothberg <[email protected]>
@vrothberg
Copy link
Member Author

You still have Setting the kill mode to none addresses this problem. in there which is no longer correct, but yeah you are right the rest is good.

Thanks for catching! I may need my afternoon coffee :^)

@vrothberg
Copy link
Member Author

@containers/podman-maintainers PTAL

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rhatdan
Copy link
Member

rhatdan commented Mar 20, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 20, 2023
@openshift-merge-robot openshift-merge-robot merged commit 23d97fc into containers:main Mar 20, 2023
@vrothberg vrothberg deleted the fix-17777 branch March 21, 2023 07:29
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 5, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

--health-on-failure=restart doesn't restart container?
5 participants