Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/e2e: check for stderr errors in cleanup() #18442

Merged
merged 1 commit into from
Mar 4, 2024

Conversation

Luap99
Copy link
Member

@Luap99 Luap99 commented May 3, 2023

Note I really do not expect this to pass, I only submit in order to gather some data and see if this shows us more bugs. In the long term something like this should be included. There are many code paths which only do logrus but still exit 0 so this should catch more bugs.

Does this PR introduce a user-facing change?

None

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none labels May 3, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 3, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 3, 2023
@Luap99
Copy link
Member Author

Luap99 commented May 3, 2023

@edsantiago WDYT? I don't expect this to work without further tweeks but I think this important to catch more bugs and flakes.

@edsantiago
Copy link
Member

I really like it! I think it might be too soon to put this in production (I've tried before and it has been impossible), but maybe once unlinkat/EBUSY gets fixed we can push this, thank you!

@Luap99
Copy link
Member Author

Luap99 commented May 3, 2023

Looks actually doable. Between 10-20 failures depending on which test you look, and yes several flakes as well.
Can you check this against your know flakes? Just to see what new failures we see here.

@edsantiago
Copy link
Member

Yow. Everything looks new. Since the hard failures seem consistent, I'm treating those as SEP, and am concentrating on looking at flakes only. The open pidfd one is nefarious, it has cropped up repeatedly since 2020 but never well enough to file an issue for. Here, I see it in f37 root and actually almost all the tests. I will file an issue for it later. IPAM/ips is a new one, will look closer at it tonight. And, sigh, unmount NS is back - I had hoped those would be gone with the ginkgov2 system reset work.

@Luap99
Copy link
Member Author

Luap99 commented May 4, 2023

And, sigh, unmount NS is back

It may be back but this a completely different failure, no such file or directory vs invalid fs magic. From the other error report there and all the other errors in these tests it is pretty safe to say that we are trying to Cleanup() twice.

@Luap99
Copy link
Member Author

Luap99 commented May 4, 2023

Filled #18460, trivial to reproduce. Still going through the logs to look for other hard failures.

@edsantiago
Copy link
Member

Much better! Only about ten flakes, all of them the pidfd one. I did not see the IPAM flake.

@github-actions
Copy link

github-actions bot commented Jun 4, 2023

A friendly reminder that this PR had no activity for 30 days.

@Luap99
Copy link
Member Author

Luap99 commented Jun 13, 2023

Ok, all last outstanding errors are like this:

           # podman [options] stop --all -t 0
           time="2023-06-13T10:17:45Z" level=warning msg="lstat /sys/fs/cgroup/devices/machine.slice/libpod-deb09b1224bf545821fb135679cd6d1ad2bbacd55415b32acb89f26b445ebe78.scope: no such file or directory"
           4e60fd09f05efc6ae524bd0a46e0f6494b09a45fd9ccd6e50e0ed28d2c7c9f12
           deb09b1224bf545821fb135679cd6d1ad2bbacd55415b32acb89f26b445ebe78
           # podman [options] pod rm -fa -t 0
           ca4099d1486c4a67d07d411bb0f7a7911fd53d2197829b1877c02b22212be2de
           # podman [options] rm -fa -t 0

           [FAILED] Failure recorded during attempt 1:
           stop --all -t0 error logged
           Expected
               <string>: time="2023-06-13T10:17:45Z" level=warning msg="lstat /sys/fs/cgroup/devices/machine.slice/libpod-deb09b1224bf545821fb135679cd6d1ad2bbacd55415b32acb89f26b445ebe78.scope: no such file or directory"
           to be empty

https://api.cirrus-ci.com/v1/artifact/task/6373332413579264/html/int-podman-debian-12-root-host-boltdb.log.html
(Note the links at the bottom are broken so you need to grep for [FAILED] )

Only debian but with root and rootless so could be something runc or cgroupv1 specific?
@giuseppe Could you take a look? Maybe we should ignore ENOENT?

@edsantiago
Copy link
Member

Looks like #11784 which was closed ENOTHINGWECANDOABOUTIT

@Luap99
Copy link
Member Author

Luap99 commented Jun 13, 2023

Looks like #11784 which was closed ENOTHINGWECANDOABOUTIT

Thanks, except this here is no flake. It is trivial to reproduce actually when using runc

$ sudo podman --runtime runc run -d --name test alpine top
23b74a32da9148ab003a57ea97b53a20e0539e4139fa49ad4171b0b9d11e9ec0
$ sudo podman --runtime runc run -d --pid container:test alpine top
ed3ebef0a4c988b3aedeef9633635002c6affabcb76581b4a9d7d5d527ab2a99
$ sudo podman stop --all
WARN[0000] freezer not supported: openat2 /sys/fs/cgroup/machine.slice/libpod-ed3ebef0a4c988b3aedeef9633635002c6affabcb76581b4a9d7d5d527ab2a99.scope/cgroup.freeze: no such file or directory 
WARN[0000] lstat /sys/fs/cgroup/machine.slice/libpod-ed3ebef0a4c988b3aedeef9633635002c6affabcb76581b4a9d7d5d527ab2a99.scope: no such file or directory 
23b74a32da9148ab003a57ea97b53a20e0539e4139fa49ad4171b0b9d11e9ec0
ed3ebef0a4c988b3aedeef9633635002c6affabcb76581b4a9d7d5d527ab2a99

@Luap99
Copy link
Member Author

Luap99 commented Jun 13, 2023

In fact you do not even need two containers just using --pid host will trigger this error alone.

$ sudo podman --runtime runc run -d --pid host alpine top
d57fe8041fe014d0e85dba2e1aa1e0b903599531ac2067f15528acb4ee3756b7
$ sudo podman stop --all
WARN[0000] freezer not supported: openat2 /sys/fs/cgroup/machine.slice/libpod-d57fe8041fe014d0e85dba2e1aa1e0b903599531ac2067f15528acb4ee3756b7.scope/cgroup.freeze: no such file or directory 
WARN[0000] lstat /sys/fs/cgroup/machine.slice/libpod-d57fe8041fe014d0e85dba2e1aa1e0b903599531ac2067f15528acb4ee3756b7.scope: no such file or directory 
d57fe8041fe014d0e85dba2e1aa1e0b903599531ac2067f15528acb4ee3756b7

@Luap99
Copy link
Member Author

Luap99 commented Jun 13, 2023

I tested with runc from the main branch and it seems to not emit these warnings anymore. So I guess the answer is wait a bit longer until this is released and packaged in debian.

@giuseppe
Copy link
Member

--pid host with rootless on cgroup v1 should not be supported.

There is no reliable way to terminate the container since cgroups are not used. We shouldn't allow it as well as joining another PID namespace

@Luap99
Copy link
Member Author

Luap99 commented Jun 13, 2023

Note I ran the reproducers locally on cgroupv2 so it shouldn't matter for this specific warning I assume.
In any case this seems to be fixed upstream so I am fine with just waiting.

@github-actions github-actions bot removed the stale-pr label Jun 14, 2023
@github-actions
Copy link

github-actions bot commented Aug 5, 2023

A friendly reminder that this PR had no activity for 30 days.

edsantiago added a commit to edsantiago/libpod that referenced this pull request Sep 7, 2023
With few exceptions, commands that exit 0 should not emit any
messages with level=warning or =error. Let's start enforcing
that in run_podman.

Allow one-off exceptions, typically when we're testing an
actual warning condition (usual case: "podman stop" where it
times out to SIGKILL). Exceptions are specified via:

    run_podman 0+w subcommand...
               ^^^---- or, rarely, 0+e

"0" stands for "expect exit status 0", which is the default
so it's implicit anyway. The +w / +e (or even +we) is the
new part. I have added it to tests where necessary.

And, because life is what it is, add two global exceptions:

  - Debian. Because runc has too many flakes.
  - kube. Ditto. Kube commands emit lots of nasty error
    messages (yes, level=error) that don't seem to affect
    results.

Similar to containers#18442

Signed-off-by: Ed Santiago <[email protected]>
@rhatdan
Copy link
Member

rhatdan commented Mar 1, 2024

Can we close this or update it?

@Luap99
Copy link
Member Author

Luap99 commented Mar 1, 2024

I want this although, I guess I can rebase and see what blows up.

@Luap99
Copy link
Member Author

Luap99 commented Mar 1, 2024

Debian still seems to print these runc warnings for some tests.

I guess one way forward is to check stderr for crun/fedora only? @edsantiago WDYT?

@edsantiago
Copy link
Member

I'm totally fine with skipping warning checks on Debian. There is precedent, see #21191

And needless to say I would LOVE to have this enabled wherever possible.

There are many code paths which only do logrus but still exit 0 so this
should catch more bugs. Unfortunately runc logs way to much random stuff
so we ignore this check for runc right now.

Signed-off-by: Paul Holzinger <[email protected]>
@Luap99 Luap99 marked this pull request as ready for review March 4, 2024 10:28
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 4, 2024
@edsantiago
Copy link
Member

/lgtm

thank you!

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 4, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 5a4864c into containers:main Mar 4, 2024
94 checks passed
@Luap99 Luap99 deleted the e2e-noerrout branch March 4, 2024 11:48
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Jun 3, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Jun 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants