-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auto-update: stop+start instead of restart sytemd units #17959
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vrothberg The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@edsantiago, fingers crossed ! |
All |
@containers/podman-maintainers PTAL |
|
|
The error messages indicate that the changes of this PR were not part of the failures you point to:
With the change from this PR, there should be a be substring saying "error {starting,stopping} systemd unit" |
Ah, no worries at all! I absolutely appreciated you testing the "flake PRs" directly. Thanks a lot! |
D'oh! At that point I am clueless and can only think of a retry on failure -.- |
/hold Don't merge. The search continues. |
Would it help to add a |
|
Someone just remembered that we save logs! Search in-page for
...but have no idea if it's meaningful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out the restart is not a stop+start but keeps certain
resources open and is subject to some timeouts that may differ across
distributions' default settings.
Can you link to any resources about that? If I search for it I only find statements that say restart equals to stop && start? Given that I see no problem chaining to stop start but I am still interested in what the cause here is.
The error could be thrown here: Lines 813 to 816 in 365131e
I guess the easiest would be to ignore the the ENOENT error there. |
From
|
I agree. I wonder how this can happen. Will take a closer look tomorrow. |
I think the issue was introduced by commit 3fee351 The problem happens because |
Context around the suspicious log |
Another one, and here is its journal, and the string to search for is "pne1" |
My hammer-sqlite PR, #17831, sees this auto-update flake All. The. Time. I have never seen a CI run in which one of those tests does not fail at least once. Often multiple times. I added a klunky ignore-ENOENT patch on my PR. Have run two full CI passes with it. All have failed, because I disable ginkgo-retry .... but I have not seen the auto-update flake. I've just submitted a third run, but it's close to bedtime so I'll have to report back in the morning. ITM I haven't looked closely at @Luap99's explanation, and honestly am unlikely to... but from my perspective the ignore-ENOENT approach looks promising as a way to fix the flake. Or at least sweep it under the rug. |
That is still a bit surprising to me. It's locked, so if |
Yes, indeed! My tired pair of eyes thought it would just be an error log but Podman indeed exits non-zero! Will add a fix to this PR. |
Updated ✔️ |
It turns out the restart is _not_ a stop+start but keeps certain resources open and is subject to some timeouts that may differ across distributions' default settings. [NO NEW TESTS NEEDED] as I have absolutely no idea how to reliably cause the failure/flake/race. Also ignore ENOENTS of the CID file when removing a container which has been identified of actually fixing containers#17607. Fixes: containers#17607 Signed-off-by: Valentin Rothberg <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, fingers crossed that the flake is now gone.
@edsantiago feel free to merge |
I've cherrypicked onto #17831 (even though it's the same as my patch) and am making one last run. Will update in an hour. |
All /lgtm Thank you @vrothberg and @Luap99 ! |
/hold cancel |
Okay, stupid question and too late.... but did anyone reevaluate whether the |
Yes. The
I think we still need this change. The comments in the systemctl man page on |
The benefit of the stop/start change is that if we run into new issues in the future we will know exactly if we failed to stop or failed to start the unit and do not have to guess. |
It turns out that |
It turns out the restart is not a stop+start but keeps certain resources open and is subject to some timeouts that may differ across distributions' default settings.
Fixes: #17607
Does this PR introduce a user-facing change?