Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pod resource limits: error creating cgroup path: subtree_control: ENOENT #15074

Open
lsm5 opened this issue Jul 26, 2022 · 19 comments
Open

pod resource limits: error creating cgroup path: subtree_control: ENOENT #15074

lsm5 opened this issue Jul 26, 2022 · 19 comments
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug.

Comments

@lsm5
Copy link
Member

lsm5 commented Jul 26, 2022

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

aarch64 CI enablement at #14801 is experiencing failures in the system tests. This issue is a placeholder for tracking and using in FIXME comments for skip_if_aarch64.

@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 26, 2022
@edsantiago
Copy link
Member

pod resource limits test is in code that @cdoern just merged last week:

# # podman --cgroup-manager=cgroupfs pod create --name=resources-cgroupfs --cpus=5 --memory=5m --memory-swap=1g --cpu-shares=1000 --cpuset-cpus=0 --cpuset-mems=0 --device-read-bps=/dev/loop0:1mb --device-write-bps=/dev/loop0:1mb --blkio-weight-device=/dev/loop0:123 --blkio-weight=50
# Error: error creating cgroup path /libpod_parent/e0024c8b8ccc24c247b62a422433c0b69d7c3f930bad3863563fcec0d0db43f1: write /sys/fs/cgroup/libpod_parent/cgroup.subtree_control: no such file or directory
# [ rc=125 (** EXPECTED 0 **) ]

@edsantiago
Copy link
Member

sdnotify test is systemd, so @vrothberg might be the best person to look at it, but it also could be crun, so, ping, @giuseppe also:

# # podman run -d --sdnotify=container quay.io/libpod/fedora:31 sh -c printenv NOTIFY_SOCKET;echo READY;systemd-notify --ready;while ! test -f /stop;do sleep 0.1;done
# 2ff76f9670f13c479196440ac93babe9fc4afa8cbb0e0b6799b73a3b59969292
# # podman logs 2ff76f9670f13c479196440ac93babe9fc4afa8cbb0e0b6799b73a3b59969292
# /run/notify/notify.sock
# READY
# �[0;1;31mFailed to notify init system: Permission denied�[0m

Lots more permission and SELinux errors, make me strongly suspect that SELinux is broken on these systems. It might be that the only way to debug is to ssh into one of them.

@edsantiago
Copy link
Member

@lsm5 hint for next time: file the issue first, then go to the broken PR and find links to all the failing logs, paste them in the issue, and then resubmit the PR with skips. It's almost impossible to find old Cirrus logs for a PR. (I scraped the above from comments I made in your PR, so no problem. Just something to keep in mind for next time!)

@cdoern
Copy link
Contributor

cdoern commented Jul 26, 2022

pod resource limits test is in code that @cdoern just merged last week:

# # podman --cgroup-manager=cgroupfs pod create --name=resources-cgroupfs --cpus=5 --memory=5m --memory-swap=1g --cpu-shares=1000 --cpuset-cpus=0 --cpuset-mems=0 --device-read-bps=/dev/loop0:1mb --device-write-bps=/dev/loop0:1mb --blkio-weight-device=/dev/loop0:123 --blkio-weight=50
# Error: error creating cgroup path /libpod_parent/e0024c8b8ccc24c247b62a422433c0b69d7c3f930bad3863563fcec0d0db43f1: write /sys/fs/cgroup/libpod_parent/cgroup.subtree_control: no such file or directory
# [ rc=125 (** EXPECTED 0 **) ]

the only reason this should fail is if arm does not have subtree control which I find highly unlikely. the subtree_control file is less related to my resource limits work and more related to cgroup creation in general. I know where this is done in containers/common but still... an issue like this makes me think the kernel is missing some things when complied.

@giuseppe
Copy link
Member

the only reason this should fail is if arm does not have subtree control which I find highly unlikely. the subtree_control file is less related to my resource limits work and more related to cgroup creation in general. I know where this is done in containers/common but still... an issue like this makes me think the kernel is missing some things when complied.

could also be libpod_parent/ missing

@cdoern
Copy link
Contributor

cdoern commented Jul 26, 2022

True @giuseppe but libpod_parent is created (if it does not exist) before subtree control I believe?

@giuseppe
Copy link
Member

then /sys/fs/cgroup might not be a cgroup v2 mount

@edsantiago
Copy link
Member

It's v2. I'm doing the Cirrus rerun-with-terminal thing, and trying to reproduce it, and can't: hack/bats 200:resource passes, as does manually recreating the fallocate, losetup, echo bfq, podman pod create commands. This could be something context-sensitive, where a prior test sets the system up in such a way that it causes this test to fail.

@edsantiago
Copy link
Member

Still failing, but @lsm5 believes it might be a flake (which is consistent with my findings in the rerun terminal). I don't know if that's better or worse.

@edsantiago
Copy link
Member

I'll be darned. It is a flake.

@edsantiago
Copy link
Member

@cdoern @giuseppe please use @cevich's #15145 to spin up VMs and debug this.

@github-actions
Copy link

github-actions bot commented Sep 3, 2022

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member

pod resource limits still flaking

edsantiago added a commit to edsantiago/libpod that referenced this issue Sep 21, 2022
Background: in order to add aarch64 tests, we had to add
emergency skips to a lot of failing tests. No attempt was
ever made to understand why they were failing.

Fast forward to today, I filed containers#15888 just to see if tests
are still failing. Looks like a number of them are fixed.
(Yes, magically). Remove those skips.

See: containers#15074, containers#15277

Signed-off-by: Ed Santiago <[email protected]>
@edsantiago edsantiago changed the title aarch64 CI - investigate system test failures aarch64 CI - error creating cgroup path: subtree_control: ENOENT Sep 22, 2022
@edsantiago
Copy link
Member

Still happening on f38:

[+1177s] not ok 317 pod resource limits
...
<+008ms> # # podman --cgroup-manager=cgroupfs pod create --name=resources-cgroupfs --cpus=5 --memory=5m --memory-swap=1g --cpu-shares=1000 --cpuset-cpus=0 --cpuset-mems=0 --device-read-bps=/dev/loop0:1mb --device-write-bps=/dev/loop0:1mb --blkio-weight=50
<+209ms> # Error: creating cgroup path /libpod_parent/9f84a4a2767e6495567aaf02a54447213083db7484d539edae31add828221b45: write /sys/fs/cgroup/libpod_parent/cgroup.subtree_control: no such file or directory

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Jun 5, 2023
@edsantiago
Copy link
Member

Seen just now on my RH laptop:

✗ pod resource limits
...
   [05:05:24.431787056] # .../bin/podman --cgroup-manager=cgroupfs pod create --name=resources-cgroupfs --cpus=5 --memory=5m --memory-swap=1g --cpu-shares=1000 --cpuset-cpus=0 --cpuset-mems=0 --device-read-bps=/dev/loop0:1mb --device-write-bps=/dev/loop0:1mb --blkio-weight=50
   [05:05:24.528324789] Error: creating cgroup path /libpod_parent/09404b9d6c87cce725635b445cfc3b5bf0f5fb654dfece8a15296915e6d71871: write /sys/fs/cgroup/libpod_parent/cgroup.subtree_control: no such file or directory
   [05:05:24.541146057] [ rc=125 (** EXPECTED 0 **) ]

Passed on rerun. Again, this is my RH laptop, not aarch64.

@edsantiago edsantiago changed the title aarch64 CI - error creating cgroup path: subtree_control: ENOENT pod resource limits: error creating cgroup path: subtree_control: ENOENT Jul 11, 2023
edsantiago added a commit to edsantiago/libpod that referenced this issue Jan 16, 2024
- containers#15074 ("subtree_control" flake). The flake is NOT FIXED, I
  saw it six months ago on my (non-aarch64) laptop. However,
  it looks like the frequent-flake-on-aarch64 bug is resolved.
  I've been testing in containers#17831 and have not seen it. So,
  tentatively remove the skip and see what happens.

- Closes: containers#19407 (broken tar, "duplicates of file paths")
  All Fedoras now have a fixed tar. Debian DOES NOT, but
  we're handling that in our build-ci-vm code. I.e., the
  Debian VM we're using has a working tar even though there's
  currently a broken tar out in the wild.

  Added distro-integration tag so we can catch future problems
  like this in OpenQA.

- Closes: containers#19471 (brq / blkio / loopbackfs in rawhide)
  Bug appears to be fixed in rawhide, at least in the VMs we're
  using now.

  Added distro-integration tag because this test obviously
  relies on other system stuff.

Signed-off-by: Ed Santiago <[email protected]>
@edsantiago
Copy link
Member

Seen after a long absence, f40 root, in parallel system tests though I doubt the parallel has anything to do with anything.

@edsantiago
Copy link
Member

Ping, seeing this one often in parallel system tests.

x x x x x x
sys(7) podman(7) fedora-40-aarch64(2) root(7) host(7) sqlite(6)
rawhide(2) boltdb(1)
fedora-40(2)
fedora-39(1)

@edsantiago
Copy link
Member

Continuing to see this often in parallel system tests

x x x x x x
sys(12) podman(12) fedora-40(5) root(12) host(12) sqlite(8)
fedora-40-aarch64(3) boltdb(4)
rawhide(2)
fedora-39(2)

@giuseppe
Copy link
Member

adding some code through containers/common#2158 to help debugging this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants