Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lstat /sys/fs/cgroup/devices/machine.slice/libpod-SHA.scope: ENOENT #11784

Closed
edsantiago opened this issue Sep 29, 2021 · 17 comments
Closed

lstat /sys/fs/cgroup/devices/machine.slice/libpod-SHA.scope: ENOENT #11784

edsantiago opened this issue Sep 29, 2021 · 17 comments
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

New flake in f33-root:

[+0920s] not ok 236 podman selinux: shared context in (some) namespaces
         # (from function `is' in file test/system/helpers.bash, line 508,
         #  in test file test/system/410-selinux.bats, line 126)
         #   `is "$output" "$context_c1" "new container, run with --pid of existing one "' failed
         # # podman rm --all --force
         # # podman ps --all --external --format {{.ID}} {{.Names}}
         # # podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
         # quay.io/libpod/testimage:20210610 9f9ec7f2fdef
         # # podman run -d --name myctr quay.io/libpod/testimage:20210610 top
         # 1f043d30e46e9f85a55a13e7bd72f16316cfd56534e42e699d656ffd3d20da09
         # # podman exec myctr cat -v /proc/self/attr/current
         # system_u:system_r:container_t:s0:c364,c713^@
         # # podman run --name myctr2 --ipc container:myctr quay.io/libpod/testimage:20210610 cat -v /proc/self/attr/current
         # system_u:system_r:container_t:s0:c364,c713^@
         # # podman run --rm --pid container:myctr quay.io/libpod/testimage:20210610 cat -v /proc/self/attr/current
         # system_u:system_r:container_t:s0:c364,c713^@time="2021-09-28T17:08:59-05:00" level=warning msg="lstat /sys/fs/cgroup/devices/machine.slice/libpod-11f69a5b1d699bf9ab9e8b5fa8994e43b3ea7c3d0f0d1e1bc0d5bf33d37cccae.scope: no such file or directory"
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #|     FAIL: new container, run with --pid of existing one 
         # #| expected: 'system_u:system_r:container_t:s0:c364,c713^@'
         # #|   actual: 'system_u:system_r:container_t:s0:c364,c713^@time="2021-09-28T17:08:59-05:00" level=warning msg="lstat /sys/fs/cgroup/devices/machine.slice/libpod-11f69a5b1d699bf9ab9e8b5fa8994e43b3ea7c3d0f0d1e1bc0d5bf33d37cccae.scope: no such file or directory"'
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is not (as of this writing) a flake that will cause a CI failure, because (as of this writing) system tests and integration tests do not check for extra cruft. THIS IS GOING TO CHANGE, at least in system tests.

I cannot reproduce with podman-3.4.0-0.10.rc2.fc33, in 30 minutes of looping, but my cirrus-flake-grep tool shows this happening as far back as June (when I started collecting CI logs). All the instances I see are root; none rootless.

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Sep 29, 2021
@edsantiago
Copy link
Member Author

@giuseppe PTAL

@giuseppe
Copy link
Member

runc generates that error.

Not sure if it is a regression but it appeared with: opencontainers/runc@cbb0a79

Simple reproducer (as root):

# podman run -d --name foo alpine top
# podman run --rm --pid container:foo alpine true
WARN[0000] lstat /sys/fs/cgroup/devices/machine.slice/libpod-f2498bb96e51a783698380494e772b9a13cf3d044fc229cc9e4710e4eb10f811.scope: no such file or directory 

@kolyshkin FYI

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Nov 2, 2021

Since this is not a Podman issue, should I close this?

@edsantiago
Copy link
Member Author

cirrus-flake-grep reports that this is still happening. Only f33, which makes sense if it's a runc bug. Here are two recent examples: pr 11956 and pr 12107, both f33 root.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

Still happening.

Please remember that these do not cause actual CI failures, so my flake logger only catches them if they're present in an actual CI-failure-causing flake. The stats above are probably an underrepresentation.

@vrothberg vrothberg changed the title cgroupsv1(?): lstat /sys/fs/cgroup/devices/machine.slice/libpod-SHA.scope: ENOENT lstat /sys/fs/cgroup/devices/machine.slice/libpod-SHA.scope: ENOENT Dec 14, 2021
@vrothberg
Copy link
Member

vrothberg commented Dec 14, 2021

There's a race condition (see below). It seems that the freezer state has changed when attempting to freeze (i.e., it shouldn't attempt to freeze afaiks).

dda8f53a07a7a39771e646e66148bb4b4d3952db44439ea563b8396bb868ac7f                                                                                                                             
22ff177983d71ff968598f880053a937393b7c9ea3fd14a00ae36ab49db4b851                                                                                                                             
ERRO[0000] STATE: FROZEN                                                                                                                                                                     
WARN[0000] freezer not supported: openat2 /sys/fs/cgroup/machine.slice/libpod-2f526821ca315a919d32aad899b9817121363405c3829650b03fd512817a3801.scope/cgroup.freeze: no such file or directory
ERRO[0000] STATE: THAWED                                                                                                                                                                     
WARN[0000] lstat /sys/fs/cgroup/machine.slice/libpod-2f526821ca315a919d32aad899b9817121363405c3829650b03fd512817a3801.scope: no such file or directory                                       

I used the following diff to get the error log:

diff --git a/libcontainer/cgroups/fs2/freezer.go b/libcontainer/cgroups/fs2/freezer.go
index 8917a6411d68..b3ed1626c851 100644                                               
--- a/libcontainer/cgroups/fs2/freezer.go                                             
+++ b/libcontainer/cgroups/fs2/freezer.go                                             
@@ -12,9 +12,11 @@ import (                                                           
                                                                                      
        "github.com/opencontainers/runc/libcontainer/cgroups"                         
        "github.com/opencontainers/runc/libcontainer/configs"                         
+       "github.com/sirupsen/logrus"                                                  
 )                                                                                    
                                                                                      
 func setFreezer(dirPath string, state configs.FreezerState) error {                  
+       logrus.Errorf("STATE: %s", state)                                             
        var stateStr string                                                           
        switch state {                                                                
        case configs.Undefined:                                                       

@kolyshkin @cyphar could you have a look? I am not familiar with the runc code and think you know where to poke.

@kolyshkin
Copy link
Contributor

There's a race condition (see below). It seems that the freezer state has changed when attempting to freeze (i.e., it shouldn't attempt to freeze afaiks).

From what I see, this is just two calls to setFreezer -- the first one to freeze it, the second one to unfreeze it. No race here.

I will take a closer look later.

@vrothberg
Copy link
Member

Thanks! To elaborate on why I think there's a race. Each time it fails, state is != configs.Undefined which made me believe that the specific path isn't always present or some condition must be waited for/on. But it's just an uninformed guess; I am not familiar with the runc code base.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Jan 18, 2022

@kolyshkin Any progress on this?

@kolyshkin
Copy link
Contributor

Had no time to look at it, hopefully later this week (I have a separate browser window opened with this as a reminder 😁 )

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@kolyshkin
Copy link
Contributor

Not stale

@giuseppe
Copy link
Member

I think the cgroup could have been cleaned up by systemd while runc is trying to use it.

Should we close this issue? I don't think there is anything we can do from the Podman side

@vrothberg
Copy link
Member

I agree.

edsantiago added a commit to edsantiago/libpod that referenced this issue Jan 10, 2023
To silence my find-obsolete-skips script:
 - containers#11784 : issue closed wont-fix
 - containers#15013 : issue closed, we no longer test with runc
 - containers#15014 : bump timeout, see if that fixes things
 - containers#15025 : issue closed, we no longer test with runc

...and one FIXME not associated with an issue, ubuntu-related,
and we no longer test ubuntu.

Signed-off-by: Ed Santiago <[email protected]>
edsantiago added a commit to edsantiago/libpod that referenced this issue Jul 13, 2023
To silence my find-obsolete-skips script, remove the '#'
from the following issues in skip messages:

  containers#11784 containers#15013 containers#15025 containers#17433 containers#17436 containers#17456

Also update the messages to reflect the fact that the issues
will never be fixed.

Also remove ubuntu skips: we no longer test ubuntu.

Also remove one buildah skip that is no longer applicable:

Fixes: containers#17520

Signed-off-by: Ed Santiago <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 20, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

5 participants