Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto-update: stop+start instead of restart sytemd units #17959

Merged
merged 1 commit into from
Mar 29, 2023

Conversation

vrothberg
Copy link
Member

It turns out the restart is not a stop+start but keeps certain resources open and is subject to some timeouts that may differ across distributions' default settings.

Fixes: #17607

Does this PR introduce a user-facing change?

Fix a bug in `podman auto-update` where restarting systemd units may fail.

@vrothberg vrothberg marked this pull request as ready for review March 28, 2023 12:56
@openshift-ci openshift-ci bot added release-note do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Mar 28, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 28, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 28, 2023
@vrothberg
Copy link
Member Author

@edsantiago, fingers crossed !

@edsantiago
Copy link
Member

All sys debian tests passed! 🥳

@vrothberg
Copy link
Member Author

@containers/podman-maintainers PTAL

@edsantiago
Copy link
Member

edsantiago commented Mar 28, 2023

@vrothberg I'm really really sorry. That is from #17831 which includes, via cherry-pick, this PR. It's possible I did something wrong, though. Please ignore

@edsantiago
Copy link
Member

edsantiago commented Mar 28, 2023

Failed in f37 also I am an idiot. Juggling too many PRs at once. Sorry.

@vrothberg
Copy link
Member Author

vrothberg commented Mar 28, 2023

The error messages indicate that the changes of this PR were not part of the failures you point to:

Error: restarting unit container-c_local_X3CcJnS5eD.service during update: expected "done" but received "failed"

With the change from this PR, there should be a be substring saying "error {starting,stopping} systemd unit"

@vrothberg
Copy link
Member Author

Failed in f37 also I am an idiot. Juggling too many PRs at once. Sorry.

Ah, no worries at all! I absolutely appreciated you testing the "flake PRs" directly. Thanks a lot!

@edsantiago
Copy link
Member

Please ignore. That was a different PR (I'm juggling too many open ones).

Unfortunately this one looks real. It fails with a different symptom: "failed to start container".

Please double-check that #17831 includes your PR, though.

@vrothberg
Copy link
Member Author

Error: restarting unit container-c_image_zFnGV6HaJi.service during update: error starting systemd unit "container-c_image_zFnGV6HaJi.service" expected "done" but received "failed"

D'oh! At that point I am clueless and can only think of a retry on failure -.-

@vrothberg
Copy link
Member Author

/hold

Don't merge. The search continues.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 28, 2023
@edsantiago
Copy link
Member

Would it help to add a journalctl or podman logs to the debug messages?

@vrothberg
Copy link
Member Author

Would it help to add a journalctl or podman logs to the debug messages?

journalctl would be helpful. For some reason something goes south in the unit.

@edsantiago
Copy link
Member

Someone just remembered that we save logs! Search in-page for nfeb, see if anything stands out. I see this:

podman[139930]: Error: remove /run/container-c_local_nFEbelOONN.service.ctr-id: no such file or directory

...but have no idea if it's meaningful.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out the restart is not a stop+start but keeps certain
resources open and is subject to some timeouts that may differ across
distributions' default settings.

Can you link to any resources about that? If I search for it I only find statements that say restart equals to stop && start? Given that I see no problem chaining to stop start but I am still interested in what the cause here is.

@Luap99
Copy link
Member

Luap99 commented Mar 28, 2023

Someone just remembered that we save logs! Search in-page for nfeb, see if anything stands out. I see this:

podman[139930]: Error: remove /run/container-c_local_nFEbelOONN.service.ctr-id: no such file or directory

...but have no idea if it's meaningful.

The error could be thrown here:

if cidFile, ok := c.config.Spec.Annotations[define.InspectAnnotationCIDFile]; ok {
if err := os.Remove(cidFile); err != nil {
if cleanupErr == nil {
cleanupErr = err

I guess the easiest would be to ignore the the ENOENT error there.

@vrothberg
Copy link
Member Author

It turns out the restart is not a stop+start but keeps certain
resources open and is subject to some timeouts that may differ across
distributions' default settings.

Can you link to any resources about that? If I search for it I only find statements that say restart equals to stop && start? Given that I see no problem chaining to stop start but I am still interested in what the cause here is.

From man systemctl:

Note that restarting a unit with this command does not necessarily flush out all of
the unit's resources before it is started again. For example, the per-service file
descriptor storage facility (see FileDescriptorStoreMax= in systemd.service(5)) will
remain intact as long as the unit has a job pending, and is only cleared when the unit
is fully stopped and no jobs are pending anymore. If it is intended that the file
descriptor store is flushed out, too, during a restart operation an explicit systemctl
stop command followed by systemctl start should be issued.

@vrothberg
Copy link
Member Author

The error could be thrown here:

I agree. I wonder how this can happen. Will take a closer look tomorrow.

@Luap99
Copy link
Member

Luap99 commented Mar 28, 2023

The error could be thrown here:

I agree. I wonder how this can happen. Will take a closer look tomorrow.

I think the issue was introduced by commit 3fee351

The problem happens because podman rm --cidfile will remove the cidfile and podman run --rm ... will also try to remove the cidfile in the cleanup process which happens in the background.

@vrothberg
Copy link
Member Author

Mar 28 14:11:21 cirrus-task-5161641281585152 systemd[1]: Starting container-c_local_nFEbelOONN.service - Podman container-c_local_nFEbelOONN.service...
Mar 28 14:11:22 cirrus-task-5161641281585152 podman[139930]: 2023-03-28 14:11:22.020187211 +0000 UTC m=+0.049311562 container remove 7dae766f8dd4d6f7967b193d63dd8f366948211064e5af54f98a91d177b3b47f (image=quay.io/libpod/localtest:latest, name=c_local_nFEbelOONN, PODMAN_SYSTEMD_UNIT=container-c_local_nFEbelOONN.service, created_at=2022-10-18T16:28:08Z, created_by=test/system/build-testimage, io.buildah.version=1.28.0, io.containers.autoupdate=local)
Mar 28 14:11:22 cirrus-task-5161641281585152 podman[139930]: 2023-03-28 14:11:22.002146796 +0000 UTC m=+0.031271181 image pull 83f9ba45d987cf1a5a96c173125bf036e9f7a03f1fd49ba800c2ee3b30dde5bc quay.io/libpod/localtest:latest
Mar 28 14:11:22 cirrus-task-5161641281585152 podman[139930]: Error: remove /run/container-c_local_nFEbelOONN.service.ctr-id: no such file or directory
Mar 28 14:11:22 cirrus-task-5161641281585152 systemd[1]: container-c_local_nFEbelOONN.service: Main process exited, code=exited, status=125/n/a
Mar 28 14:11:22 cirrus-task-5161641281585152 systemd[1]: container-c_local_nFEbelOONN.service: Failed with result 'exit-code'.
Mar 28 14:11:22 cirrus-task-5161641281585152 systemd[1]: Failed to start container-c_local_nFEbelOONN.service - Podman container-c_local_nFEbelOONN.service.
Mar 28 14:11:22 cirrus-task-5161641281585152 systemd[1]: Stopped container-c_local_nFEbelOONN.service - Podman container-c_local_nFEbelOONN.service.

Context around the suspicious log

@edsantiago
Copy link
Member

Another one, and here is its journal, and the string to search for is "pne1"

@edsantiago
Copy link
Member

I guess the easiest would be to ignore the the ENOENT error there.

My hammer-sqlite PR, #17831, sees this auto-update flake All. The. Time. I have never seen a CI run in which one of those tests does not fail at least once. Often multiple times.

I added a klunky ignore-ENOENT patch on my PR. Have run two full CI passes with it. All have failed, because I disable ginkgo-retry .... but I have not seen the auto-update flake. I've just submitted a third run, but it's close to bedtime so I'll have to report back in the morning.

ITM I haven't looked closely at @Luap99's explanation, and honestly am unlikely to... but from my perspective the ignore-ENOENT approach looks promising as a way to fix the flake. Or at least sweep it under the rug.

@vrothberg
Copy link
Member Author

The problem happens because podman rm --cidfile will remove the cidfile and podman run --rm ... will also try to remove the cidfile in the cleanup process which happens in the background.

That is still a bit surprising to me. It's locked, so if --rm kicked in before podman rm then podman rm should just return.

@vrothberg
Copy link
Member Author

ITM I haven't looked closely at @Luap99's explanation, and honestly am unlikely to... but from my perspective the ignore-ENOENT approach looks promising as a way to fix the flake. Or at least sweep it under the rug.

Yes, indeed!

My tired pair of eyes thought it would just be an error log but Podman indeed exits non-zero! Will add a fix to this PR.

@vrothberg
Copy link
Member Author

Updated ✔️

It turns out the restart is _not_ a stop+start but keeps certain
resources open and is subject to some timeouts that may differ across
distributions' default settings.

[NO NEW TESTS NEEDED] as I have absolutely no idea how to reliably cause
the failure/flake/race.

Also ignore ENOENTS of the CID file when removing a container which has
been identified of actually fixing containers#17607.

Fixes: containers#17607
Signed-off-by: Valentin Rothberg <[email protected]>
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, fingers crossed that the flake is now gone.

@vrothberg
Copy link
Member Author

@edsantiago feel free to merge

@edsantiago
Copy link
Member

I've cherrypicked onto #17831 (even though it's the same as my patch) and am making one last run. Will update in an hour.

@edsantiago
Copy link
Member

All sys tests passed on the first try (the auto-update flake happens only in sys). Ergo:

/lgtm

Thank you @vrothberg and @Luap99 !

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 29, 2023
@vrothberg
Copy link
Member Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 29, 2023
@openshift-merge-robot openshift-merge-robot merged commit d29a85b into containers:main Mar 29, 2023
@edsantiago
Copy link
Member

Okay, stupid question and too late.... but did anyone reevaluate whether the unlink error was the ultimate root cause of this whole mess, and whether or not the stop/start thing was even needed?

@vrothberg vrothberg deleted the fix-17607 branch March 29, 2023 13:27
@vrothberg
Copy link
Member Author

Okay, stupid question and too late.... but did anyone reevaluate whether the unlink error was the ultimate root cause of this whole mess,

Yes. The unlink error was the root cause. It ultimately lead to errors when restarting the systemd units.

and whether or not the stop/start thing was even needed?

I think we still need this change. The comments in the systemctl man page on restart not totally being stop+start worried me.

@Luap99
Copy link
Member

Luap99 commented Mar 29, 2023

The benefit of the stop/start change is that if we run into new issues in the future we will know exactly if we failed to stop or failed to start the unit and do not have to guess.

@vrothberg
Copy link
Member Author

I think we still need this change. The comments in the systemctl man page on restart not totally being stop+start worried me.

It turns out that restart not being stop+start is not always a good thing (see #18926). restart will take care of dependencies while stop+start does not.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 18, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

auto-update: test getting "rolled back" instead of "true"
4 participants