fix --health-on-failure=restart in transient unit #17830

vrothberg · 2023-03-17T13:10:10Z

As described in #17777, the restart on-failure action did not behave correctly when the health check is being run by a transient systemd unit. It ran just fine when being executed outside such a unit, for instance, manually or, as done in the system tests, in a scripted fashion.

There were two issue causing the restart on-failure action to misbehave:

The transient systemd units used the default KillMode=cgroup which
will nuke all processes in the specific cgroup including the recently
restarted container/conmon once the main podman healthcheck run
process exits. Setting the kill mode to none addresses this problem.
Podman attempted to remove the transient systemd unit and timer
during restart. That is perfectly fine when manually restarting the
container but not when the restart itself is being executed inside
such a transient unit. Ultimately, Podman tried to shoot itself in
the foot.

Fix both issues by moving the restart logic in the cleanup process.
Instead of restarting the container, the healthcheck run will just
stop the container and the cleanup process will restart the container
once it has turned unhealthy.

Fixes: #17777

Does this PR introduce a user-facing change?

Fix a bug in --health-on-failure=restart not restarting the container when health state turns unhealthy.

openshift-ci · 2023-03-17T13:11:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vrothberg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Luap99

Kill Mode=None is deprecated: systemd/systemd#15928

We should not use it for new units IMO. As I see the problem is that we start the container from within the healthcheck command, wouldn't it make more sense to just stop it and then let the podman cleanup process start it again. i.e. like the regular restart policy.

vrothberg · 2023-03-17T16:15:20Z

Kill Mode=None is deprecated: systemd/systemd#15928

I am not that concerned about it. None is still working but was deprecated to discourage users from doing it. It was a main cause of dysfunctional services. In this case, we know what we are doing.

We should not use it for new units IMO. As I see the problem is that we start the container from within the healthcheck command, wouldn't it make more sense to just stop it and then let the podman cleanup process start it again. i.e. like the regular restart policy.

That is an interesting idea! I will think about it over the weekend but also want others to chime in and see what they think. @mheon @giuseppe WDYT?

mheon · 2023-03-17T17:15:17Z

I see no reason not to let the cleanup process handle this, it works quite well for restart policy.

vrothberg · 2023-03-20T09:14:09Z

I moved the restart logic into the cleanup process and adjusted the tests.

Luap99

Thanks but you forgot to update the commit message to reflect the current change

Luap99 · 2023-03-20T12:47:37Z

libpod/healthcheck.go

+	if err != nil {
+		return false, err
+	}
+	return healthCheck.Status == "unhealthy", nil


Is there a const that you can use for this string?

Nice idea! Yes, there's define.HealthCheckUnhealthy. Will update.

vrothberg · 2023-03-20T12:49:23Z

Thanks but you forgot to update the commit message to reflect the current change

"Fix both issues by moving the restart logic in the cleanup process.
Instead of restarting the container, the healthcheck run will just
stop the container and the cleanup process will restart the container
once it has turned unhealthy."

Please let me know what you think is missing in the commit message.

Luap99 · 2023-03-20T12:51:36Z

You still have Setting the kill mode to none addresses this problem. in there which is no longer correct, but yeah you are right the rest is good.

As described in containers#17777, the `restart` on-failure action did not behave correctly when the health check is being run by a transient systemd unit. It ran just fine when being executed outside such a unit, for instance, manually or, as done in the system tests, in a scripted fashion. There were two issue causing the `restart` on-failure action to misbehave: 1) The transient systemd units used the default `KillMode=cgroup` which will nuke all processes in the specific cgroup including the recently restarted container/conmon once the main `podman healthcheck run` process exits. 2) Podman attempted to remove the transient systemd unit and timer during restart. That is perfectly fine when manually restarting the container but not when the restart itself is being executed inside such a transient unit. Ultimately, Podman tried to shoot itself in the foot. Fix both issues by moving the restart logic in the cleanup process. Instead of restarting the container, the `healthcheck run` will just stop the container and the cleanup process will restart the container once it has turned unhealthy. Fixes: containers#17777 Signed-off-by: Valentin Rothberg <[email protected]>

vrothberg · 2023-03-20T12:56:20Z

You still have Setting the kill mode to none addresses this problem. in there which is no longer correct, but yeah you are right the rest is good.

Thanks for catching! I may need my afternoon coffee :^)

vrothberg · 2023-03-20T15:01:54Z

@containers/podman-maintainers PTAL

Luap99

LGTM

rhatdan · 2023-03-20T20:17:08Z

/lgtm

vrothberg · 2023-03-21T08:36:47Z

Backports:

openshift-ci bot added the release-note label Mar 17, 2023

vrothberg mentioned this pull request Mar 17, 2023

--health-on-failure=restart doesn't restart container? #17777

Closed

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2023

Luap99 reviewed Mar 17, 2023

View reviewed changes

vrothberg force-pushed the fix-17777 branch from e676203 to 2cc3301 Compare March 20, 2023 09:13

vrothberg force-pushed the fix-17777 branch from 2cc3301 to ce97dab Compare March 20, 2023 12:37

Luap99 reviewed Mar 20, 2023

View reviewed changes

vrothberg force-pushed the fix-17777 branch from ce97dab to 0f0a099 Compare March 20, 2023 12:51

vrothberg force-pushed the fix-17777 branch from 0f0a099 to 9563415 Compare March 20, 2023 12:56

Luap99 reviewed Mar 20, 2023

View reviewed changes

openshift-ci bot assigned rhatdan Mar 20, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 20, 2023

openshift-merge-robot merged commit 23d97fc into containers:main Mar 20, 2023

vrothberg deleted the fix-17777 branch March 21, 2023 07:29

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 5, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix --health-on-failure=restart in transient unit #17830

fix --health-on-failure=restart in transient unit #17830

vrothberg commented Mar 17, 2023 •

edited

Loading

openshift-ci bot commented Mar 17, 2023

Luap99 left a comment

vrothberg commented Mar 17, 2023

mheon commented Mar 17, 2023

vrothberg commented Mar 20, 2023

Luap99 left a comment

Luap99 Mar 20, 2023

vrothberg Mar 20, 2023

vrothberg commented Mar 20, 2023

Luap99 commented Mar 20, 2023

vrothberg commented Mar 20, 2023

vrothberg commented Mar 20, 2023

Luap99 left a comment

rhatdan commented Mar 20, 2023

vrothberg commented Mar 21, 2023

fix --health-on-failure=restart in transient unit #17830

fix --health-on-failure=restart in transient unit #17830

Conversation

vrothberg commented Mar 17, 2023 • edited Loading

Does this PR introduce a user-facing change?

openshift-ci bot commented Mar 17, 2023

Luap99 left a comment

Choose a reason for hiding this comment

vrothberg commented Mar 17, 2023

mheon commented Mar 17, 2023

vrothberg commented Mar 20, 2023

Luap99 left a comment

Choose a reason for hiding this comment

Luap99 Mar 20, 2023

Choose a reason for hiding this comment

vrothberg Mar 20, 2023

Choose a reason for hiding this comment

vrothberg commented Mar 20, 2023

Luap99 commented Mar 20, 2023

vrothberg commented Mar 20, 2023

vrothberg commented Mar 20, 2023

Luap99 left a comment

Choose a reason for hiding this comment

rhatdan commented Mar 20, 2023

vrothberg commented Mar 21, 2023

vrothberg commented Mar 17, 2023 •

edited

Loading