auto-update: stop+start instead of restart sytemd units #17959

vrothberg · 2023-03-28T12:56:29Z

It turns out the restart is not a stop+start but keeps certain resources open and is subject to some timeouts that may differ across distributions' default settings.

Fixes: #17607

Does this PR introduce a user-facing change?

Fix a bug in `podman auto-update` where restarting systemd units may fail.

openshift-ci · 2023-03-28T12:56:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vrothberg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vrothberg · 2023-03-28T12:57:22Z

@edsantiago, fingers crossed !

edsantiago · 2023-03-28T14:11:19Z

All sys debian tests passed! 🥳

vrothberg · 2023-03-28T14:27:08Z

@containers/podman-maintainers PTAL

edsantiago · 2023-03-28T14:27:30Z

~~@vrothberg I'm really really sorry. That is from #17831 which includes, via cherry-pick, this PR. It's possible I did something wrong, though.~~ Please ignore

edsantiago · 2023-03-28T14:29:57Z

~~Failed in f37 also~~ I am an idiot. Juggling too many PRs at once. Sorry.

vrothberg · 2023-03-28T14:31:40Z

The error messages indicate that the changes of this PR were not part of the failures you point to:

Error: restarting unit container-c_local_X3CcJnS5eD.service during update: expected "done" but received "failed"

With the change from this PR, there should be a be substring saying "error {starting,stopping} systemd unit"

vrothberg · 2023-03-28T14:32:20Z

~~Failed in f37 also~~ I am an idiot. Juggling too many PRs at once. Sorry.

Ah, no worries at all! I absolutely appreciated you testing the "flake PRs" directly. Thanks a lot!

edsantiago · 2023-03-28T14:34:16Z

Please ignore. That was a different PR (I'm juggling too many open ones).

Unfortunately this one looks real. It fails with a different symptom: "failed to start container".

Please double-check that #17831 includes your PR, though.

vrothberg · 2023-03-28T14:36:57Z

Error: restarting unit container-c_image_zFnGV6HaJi.service during update: error starting systemd unit "container-c_image_zFnGV6HaJi.service" expected "done" but received "failed"

D'oh! At that point I am clueless and can only think of a retry on failure -.-

vrothberg · 2023-03-28T14:37:21Z

/hold

Don't merge. The search continues.

edsantiago · 2023-03-28T14:40:06Z

Would it help to add a journalctl or podman logs to the debug messages?

vrothberg · 2023-03-28T14:43:43Z

Would it help to add a journalctl or podman logs to the debug messages?

journalctl would be helpful. For some reason something goes south in the unit.

edsantiago · 2023-03-28T14:47:48Z

Someone just remembered that we save logs! Search in-page for nfeb, see if anything stands out. I see this:

podman[139930]: Error: remove /run/container-c_local_nFEbelOONN.service.ctr-id: no such file or directory

...but have no idea if it's meaningful.

Luap99

It turns out the restart is not a stop+start but keeps certain
resources open and is subject to some timeouts that may differ across
distributions' default settings.

Can you link to any resources about that? If I search for it I only find statements that say restart equals to stop && start? Given that I see no problem chaining to stop start but I am still interested in what the cause here is.

Luap99 · 2023-03-28T14:59:06Z

Someone just remembered that we save logs! Search in-page for nfeb, see if anything stands out. I see this:
podman[139930]: Error: remove /run/container-c_local_nFEbelOONN.service.ctr-id: no such file or directory
...but have no idea if it's meaningful.

The error could be thrown here:

podman/libpod/runtime_ctr.go

Lines 813 to 816 in 365131e

    
           if cidFile, ok := c.config.Spec.Annotations[define.InspectAnnotationCIDFile]; ok { 
        
           	if err := os.Remove(cidFile); err != nil { 
        
           		if cleanupErr == nil { 
        
           			cleanupErr = err

I guess the easiest would be to ignore the the ENOENT error there.

vrothberg · 2023-03-28T15:03:15Z

It turns out the restart is not a stop+start but keeps certain
resources open and is subject to some timeouts that may differ across
distributions' default settings.

Can you link to any resources about that? If I search for it I only find statements that say restart equals to stop && start? Given that I see no problem chaining to stop start but I am still interested in what the cause here is.

From man systemctl:

Note that restarting a unit with this command does not necessarily flush out all of
the unit's resources before it is started again. For example, the per-service file
descriptor storage facility (see FileDescriptorStoreMax= in systemd.service(5)) will
remain intact as long as the unit has a job pending, and is only cleared when the unit
is fully stopped and no jobs are pending anymore. If it is intended that the file
descriptor store is flushed out, too, during a restart operation an explicit systemctl
stop command followed by systemctl start should be issued.

vrothberg · 2023-03-28T15:05:37Z

The error could be thrown here:

I agree. I wonder how this can happen. Will take a closer look tomorrow.

Luap99 · 2023-03-28T15:12:24Z

The error could be thrown here:

I agree. I wonder how this can happen. Will take a closer look tomorrow.

I think the issue was introduced by commit 3fee351

The problem happens because podman rm --cidfile will remove the cidfile and podman run --rm ... will also try to remove the cidfile in the cleanup process which happens in the background.

vrothberg · 2023-03-28T15:16:40Z

Mar 28 14:11:21 cirrus-task-5161641281585152 systemd[1]: Starting container-c_local_nFEbelOONN.service - Podman container-c_local_nFEbelOONN.service...
Mar 28 14:11:22 cirrus-task-5161641281585152 podman[139930]: 2023-03-28 14:11:22.020187211 +0000 UTC m=+0.049311562 container remove 7dae766f8dd4d6f7967b193d63dd8f366948211064e5af54f98a91d177b3b47f (image=quay.io/libpod/localtest:latest, name=c_local_nFEbelOONN, PODMAN_SYSTEMD_UNIT=container-c_local_nFEbelOONN.service, created_at=2022-10-18T16:28:08Z, created_by=test/system/build-testimage, io.buildah.version=1.28.0, io.containers.autoupdate=local)
Mar 28 14:11:22 cirrus-task-5161641281585152 podman[139930]: 2023-03-28 14:11:22.002146796 +0000 UTC m=+0.031271181 image pull 83f9ba45d987cf1a5a96c173125bf036e9f7a03f1fd49ba800c2ee3b30dde5bc quay.io/libpod/localtest:latest
Mar 28 14:11:22 cirrus-task-5161641281585152 podman[139930]: Error: remove /run/container-c_local_nFEbelOONN.service.ctr-id: no such file or directory
Mar 28 14:11:22 cirrus-task-5161641281585152 systemd[1]: container-c_local_nFEbelOONN.service: Main process exited, code=exited, status=125/n/a
Mar 28 14:11:22 cirrus-task-5161641281585152 systemd[1]: container-c_local_nFEbelOONN.service: Failed with result 'exit-code'.
Mar 28 14:11:22 cirrus-task-5161641281585152 systemd[1]: Failed to start container-c_local_nFEbelOONN.service - Podman container-c_local_nFEbelOONN.service.
Mar 28 14:11:22 cirrus-task-5161641281585152 systemd[1]: Stopped container-c_local_nFEbelOONN.service - Podman container-c_local_nFEbelOONN.service.

Context around the suspicious log

edsantiago · 2023-03-28T15:18:07Z

Another one, and here is its journal, and the string to search for is "pne1"

edsantiago · 2023-03-29T01:10:12Z

I guess the easiest would be to ignore the the ENOENT error there.

My hammer-sqlite PR, #17831, sees this auto-update flake All. The. Time. I have never seen a CI run in which one of those tests does not fail at least once. Often multiple times.

I added a klunky ignore-ENOENT patch on my PR. Have run two full CI passes with it. All have failed, because I disable ginkgo-retry .... but I have not seen the auto-update flake. I've just submitted a third run, but it's close to bedtime so I'll have to report back in the morning.

ITM I haven't looked closely at @Luap99's explanation, and honestly am unlikely to... but from my perspective the ignore-ENOENT approach looks promising as a way to fix the flake. Or at least sweep it under the rug.

vrothberg · 2023-03-29T09:17:38Z

The problem happens because podman rm --cidfile will remove the cidfile and podman run --rm ... will also try to remove the cidfile in the cleanup process which happens in the background.

That is still a bit surprising to me. It's locked, so if --rm kicked in before podman rm then podman rm should just return.

vrothberg · 2023-03-29T09:19:59Z

ITM I haven't looked closely at @Luap99's explanation, and honestly am unlikely to... but from my perspective the ignore-ENOENT approach looks promising as a way to fix the flake. Or at least sweep it under the rug.

Yes, indeed!

My tired pair of eyes thought it would just be an error log but Podman indeed exits non-zero! Will add a fix to this PR.

vrothberg · 2023-03-29T09:23:52Z

Updated ✔️

It turns out the restart is _not_ a stop+start but keeps certain resources open and is subject to some timeouts that may differ across distributions' default settings. [NO NEW TESTS NEEDED] as I have absolutely no idea how to reliably cause the failure/flake/race. Also ignore ENOENTS of the CID file when removing a container which has been identified of actually fixing containers#17607. Fixes: containers#17607 Signed-off-by: Valentin Rothberg <[email protected]>

Luap99

LGTM, fingers crossed that the flake is now gone.

vrothberg · 2023-03-29T11:18:02Z

@edsantiago feel free to merge

edsantiago · 2023-03-29T11:21:00Z

I've cherrypicked onto #17831 (even though it's the same as my patch) and am making one last run. Will update in an hour.

edsantiago · 2023-03-29T12:13:03Z

All sys tests passed on the first try (the auto-update flake happens only in sys). Ergo:

/lgtm

Thank you @vrothberg and @Luap99 !

vrothberg · 2023-03-29T13:17:33Z

/hold cancel

edsantiago · 2023-03-29T13:26:07Z

Okay, stupid question and too late.... but did anyone reevaluate whether the unlink error was the ultimate root cause of this whole mess, and whether or not the stop/start thing was even needed?

vrothberg · 2023-03-29T13:30:14Z

Okay, stupid question and too late.... but did anyone reevaluate whether the unlink error was the ultimate root cause of this whole mess,

Yes. The unlink error was the root cause. It ultimately lead to errors when restarting the systemd units.

and whether or not the stop/start thing was even needed?

I think we still need this change. The comments in the systemctl man page on restart not totally being stop+start worried me.

Luap99 · 2023-03-29T13:33:44Z

The benefit of the stop/start change is that if we run into new issues in the future we will know exactly if we failed to stop or failed to start the unit and do not have to guess.

vrothberg · 2023-06-19T06:45:11Z

I think we still need this change. The comments in the systemctl man page on restart not totally being stop+start worried me.

It turns out that restart not being stop+start is not always a good thing (see #18926). restart will take care of dependencies while stop+start does not.

vrothberg marked this pull request as ready for review March 28, 2023 12:56

openshift-ci bot added release-note do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Mar 28, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 28, 2023

vrothberg force-pushed the fix-17607 branch from a114399 to 1074a51 Compare March 28, 2023 12:57

vrothberg mentioned this pull request Mar 28, 2023

auto-update: test getting "rolled back" instead of "true" #17607

Closed

vrothberg force-pushed the fix-17607 branch from 1074a51 to cb1c763 Compare March 28, 2023 13:13

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 28, 2023

Luap99 reviewed Mar 28, 2023

View reviewed changes

vrothberg force-pushed the fix-17607 branch from cb1c763 to bf7a795 Compare March 29, 2023 09:23

vrothberg force-pushed the fix-17607 branch from bf7a795 to f131eaa Compare March 29, 2023 09:31

Luap99 reviewed Mar 29, 2023

View reviewed changes

openshift-ci bot assigned edsantiago Mar 29, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 29, 2023

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 29, 2023

openshift-merge-robot merged commit d29a85b into containers:main Mar 29, 2023

vrothberg deleted the fix-17607 branch March 29, 2023 13:27

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 18, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auto-update: stop+start instead of restart sytemd units #17959

auto-update: stop+start instead of restart sytemd units #17959

vrothberg commented Mar 28, 2023

openshift-ci bot commented Mar 28, 2023

vrothberg commented Mar 28, 2023

edsantiago commented Mar 28, 2023

vrothberg commented Mar 28, 2023

edsantiago commented Mar 28, 2023 •

edited

Loading

edsantiago commented Mar 28, 2023 •

edited

Loading

vrothberg commented Mar 28, 2023 •

edited

Loading

vrothberg commented Mar 28, 2023

edsantiago commented Mar 28, 2023

vrothberg commented Mar 28, 2023

vrothberg commented Mar 28, 2023

edsantiago commented Mar 28, 2023

vrothberg commented Mar 28, 2023

edsantiago commented Mar 28, 2023

Luap99 left a comment

Luap99 commented Mar 28, 2023

vrothberg commented Mar 28, 2023

vrothberg commented Mar 28, 2023

Luap99 commented Mar 28, 2023

vrothberg commented Mar 28, 2023

edsantiago commented Mar 28, 2023

edsantiago commented Mar 29, 2023

vrothberg commented Mar 29, 2023

vrothberg commented Mar 29, 2023

vrothberg commented Mar 29, 2023

Luap99 left a comment

vrothberg commented Mar 29, 2023

edsantiago commented Mar 29, 2023

edsantiago commented Mar 29, 2023

vrothberg commented Mar 29, 2023

edsantiago commented Mar 29, 2023

vrothberg commented Mar 29, 2023

Luap99 commented Mar 29, 2023

vrothberg commented Jun 19, 2023

auto-update: stop+start instead of restart sytemd units #17959

auto-update: stop+start instead of restart sytemd units #17959

Conversation

vrothberg commented Mar 28, 2023

Does this PR introduce a user-facing change?

openshift-ci bot commented Mar 28, 2023

vrothberg commented Mar 28, 2023

edsantiago commented Mar 28, 2023

vrothberg commented Mar 28, 2023

edsantiago commented Mar 28, 2023 • edited Loading

edsantiago commented Mar 28, 2023 • edited Loading

vrothberg commented Mar 28, 2023 • edited Loading

vrothberg commented Mar 28, 2023

edsantiago commented Mar 28, 2023

vrothberg commented Mar 28, 2023

vrothberg commented Mar 28, 2023

edsantiago commented Mar 28, 2023

vrothberg commented Mar 28, 2023

edsantiago commented Mar 28, 2023

Luap99 left a comment

Choose a reason for hiding this comment

Luap99 commented Mar 28, 2023

vrothberg commented Mar 28, 2023

vrothberg commented Mar 28, 2023

Luap99 commented Mar 28, 2023

vrothberg commented Mar 28, 2023

edsantiago commented Mar 28, 2023

edsantiago commented Mar 29, 2023

vrothberg commented Mar 29, 2023

vrothberg commented Mar 29, 2023

vrothberg commented Mar 29, 2023

Luap99 left a comment

Choose a reason for hiding this comment

vrothberg commented Mar 29, 2023

edsantiago commented Mar 29, 2023

edsantiago commented Mar 29, 2023

vrothberg commented Mar 29, 2023

edsantiago commented Mar 29, 2023

vrothberg commented Mar 29, 2023

Luap99 commented Mar 29, 2023

vrothberg commented Jun 19, 2023

edsantiago commented Mar 28, 2023 •

edited

Loading

edsantiago commented Mar 28, 2023 •

edited

Loading

vrothberg commented Mar 28, 2023 •

edited

Loading