Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman run or restore: requested cgroup controller pids is not available #9752

Closed
edsantiago opened this issue Mar 18, 2021 · 22 comments
Closed
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

Another one of those hard-to-track-down flakes that appears in different tests:

# podman-remote run [...]
Error: error preparing container <sha> for attach: the requested cgroup controller `pids` is not available: OCI runtime error

Only two instances, both in the last two days, both on Ubuntu 2010:

sys: podman cp - will not recognize symlink pointing into host space

sys: Verify /run/.containerenv exist

@edsantiago edsantiago added flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. remote Problem is in podman-remote labels Mar 18, 2021
@edsantiago
Copy link
Member Author

Two more instances, both on the same test run:

sys: podman run : --userns=keep-id: passwd file is modifiable

sys: podman run : add username to /etc/passwd if --userns=keep-id

@edsantiago
Copy link
Member Author

And another: again in remote ubuntu-2010 root

@edsantiago
Copy link
Member Author

And yet another:

@zhangguanzhang
Copy link
Collaborator

I think some runner machine does not enable the pids cgroup

@edsantiago
Copy link
Member Author

Another:

@edsantiago
Copy link
Member Author

Two more:

sys: podman run : user namespace preserved root ownership

sys: podman run docker-archive

@edsantiago
Copy link
Member Author

Now seeing it in buildah CI, in setup (the registry thing) in https://github.com/containers/buildah/pull/3186/checks?check_run_id=2450159721

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago edsantiago removed remote Problem is in podman-remote stale-issue labels Oct 13, 2021
@edsantiago
Copy link
Member Author

I just saw this on my own laptop, testing main @ 192d16e6a3c4801dee468b6b7f4de52952a80b09, on a podman container restore.

# /home/esm/src/atomic/2018-02.podman/libpod/bin/podman container restore 342c2357fdd47755f5f6231b361968485bc343a05953fe2ccea6dcab1d9dcb6e
Error: OCI runtime error: the requested cgroup controller `pids` is not available

It has worked all day, and ran fine on !!, so it's a flake.

Local, not remote, so I've removed the remote tag.

Podman run [It] podman run with cgroups=split

@edsantiago
Copy link
Member Author

Okay... so I'm working on #11957, and in my local (laptop) testing, I'm seeing this flake about once in every 5-10 runs. This means that, if my PR gets merged, it will flake in half of CI runs. This flake needs to be fixed. Pretty please? Here's the best reproducer I can offer:

$ while :;do sudo bin/podman run -d --name foo quay.io/libpod/testimage:20210610 sh -c 'while :;do cat /proc/uptime;done';sudo bin/podman container checkpoint foo;sudo bin/podman container logs foo >/dev/null;sudo bin/podman container inspect foo >/dev/null;sleep 0.5;sudo bin/podman container restore foo || break;sudo bin/podman container rm -f -t 0 foo;done
...
4476405605bf413c7f2305ec9e19abba6044175514dcb4947b024daf3c97cfa3
Error: OCI runtime error: the requested cgroup controller `pids` is not available

It's a poor reproducer: in one attempt, it failed within seconds. On another, it ran fine for 15 minutes. I will try to work on a better one, but right now I need to move on for the day.

@edsantiago edsantiago changed the title remote: podman run: requested cgroup controller pids is not available podman run or restore: requested cgroup controller pids is not available Oct 17, 2021
@edsantiago
Copy link
Member Author

Here's a slightly different reproducer; this one has failed at 2s, 54s, 283s, 294s.

t0=$SECONDS;while :;do sudo bin/podman run -d --name foo quay.io/libpod/testimage:20210610 sh -c "while :;do awk '{print $1}' </proc/uptime | tr -d .;sleep 0.1;done";sleep 0.1;sudo bin/podman container logs foo >/dev/null;sudo bin/podman container checkpoint foo;sleep 0.4;sudo bin/podman container restore foo || break;sudo bin/podman container rm -f -t 0 foo;done;t1=$SECONDS;echo $((t1 - t0)) seconds

@edsantiago
Copy link
Member Author

Yep, that works well enough. 2s, 20s, 177s, always less than 5 minutes.

One more data point: after this crash, retrying still fails but a different way:

$ sudo bin/podman ps -a
CONTAINER ID  IMAGE                              COMMAND               CREATED        STATUS                    PORTS       NAMES
373c121f8693  quay.io/libpod/testimage:20210610  sh -c while :;do ...  4 minutes ago  Exited (0) 4 minutes ago              foo
$ sudo bin/podman container restore foo
Error: OCI runtime error: sd-bus call: File exists

@cevich
Copy link
Member

cevich commented Oct 18, 2021

Yep, that works well enough. 2s, 20s, 177s, always less than 5 minutes.

In my experience, a major milestone in fixing races is getting a fast reproducer, so this is excellent. What sort of environment is that being done under? Always crun and never runc?

Eyeballing the environments above, it looks like a lot of Ubuntu 21.10 + crun. A few mentions of F34, which I assume are also crun.

Has @giuseppe taken a look at this?

@edsantiago
Copy link
Member Author

Oops! I forgot to mention: that's on my laptop (f34) using main as of yesterday. And yes, crun.

@cevich
Copy link
Member

cevich commented Oct 18, 2021

Oh that's interesting, so plenty of CPU and memory available then. Ya I think this is mosdef @giuseppe territory.

@giuseppe
Copy link
Member

it is an issue in crun (or better I think in the kernel), but in any case, we need to account for it in crun. I am still validating my patch, opening a PR as soon as I've finished testing it

giuseppe added a commit to giuseppe/crun that referenced this issue Oct 19, 2021
It seems the kernel can return EBUSY when a process was moved to a
sub-cgroup and the controllers are enabled in its parent cgroup.

On EBUSY retry a few times until a controller could be enabled.

Reported: containers/podman#9752

Signed-off-by: Giuseppe Scrivano <[email protected]>
@giuseppe
Copy link
Member

PR here: containers/crun#758

giuseppe added a commit to giuseppe/crun that referenced this issue Oct 19, 2021
It seems the kernel can return EBUSY when a process was moved to a
sub-cgroup and the controllers are enabled in its parent cgroup.

On EBUSY retry a few times until a controller could be enabled.

Reported: containers/podman#9752

Signed-off-by: Giuseppe Scrivano <[email protected]>
@edsantiago
Copy link
Member Author

The "requested cgroup controller 'pids' is not available" message also appeared (not recently) in the cgroups=split test:

Podman run [It] podman run with cgroups=split

@vrothberg
Copy link
Member

containers/crun#758 has merged on Oct 19 and since CI is using newer cruns, I think we're good to close. Please reopen if I am mistaken.

@emansom
Copy link
Contributor

emansom commented Jun 5, 2022

People are still running into this on RHEL 8 boxes. As per #1897579 on the bugzilla.

Running the containers rootless with a linger enabled user, with user units podman and podman-restart enabled. The latter always fails on reboot, given no cgroup controllers for the user have been setup.

CGroupsV2 enabled on the host via systemd.unified_cgroup_hierarchy=1 on the kernel cmdline.

systemd running with modified defaults in /etc/systemd/system.conf

DefaultCPUAccounting=yes
DefaultIOAccounting=yes
DefaultMemoryAccounting=yes
DefaultTasksAccounting=yes

Passed through to the user via /etc/systemd/system/user-.slice.d/override.conf

[Slice]
CPUAccounting=yes
MemoryAccounting=yes
IOAccounting=yes
TasksAccounting=yes

Podman configured to use crun by default in /etc/containers/containers.conf

[containers]
runtime = "crun"

Running default kernel 4.18.0-372.9.1.el8.x86_64 and default systemd version systemd-239-58.el8.x86_64.

Doing a system-wide systemctl daemon-reload works as workaround.

However, it will still yield broken rootless containers on host restart. Is a systemd user unit dependency for CGroup setup on the podman user units to ensure creation a possibility?

@emansom
Copy link
Contributor

emansom commented Jun 6, 2022

Seems related to issue #9578 / #9512 in systemd. Can that fix be backported to RHEL 8?

@mheon
Copy link
Member

mheon commented Jun 6, 2022

Can you comment to that effect in the Bugzilla? We can swap it over to point at systemd, but having more context on what fix is necessary would be good.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 20, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

7 participants