Failed to add pause process to systemd sandbox cgroup #12065

cmurf · 2021-10-22T16:24:30Z

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

Steps to reproduce the issue:

podman system reset

Describe the results you received:
[chris@fovo ~]$ podman --log-level debug system reset
INFO[0000] podman filtering at log level debug
DEBU[0000] Called reset.PersistentPreRunE(podman --log-level debug system reset)
DEBU[0000] Merged system config "/usr/share/containers/containers.conf"
DEBU[0000] Using conmon: "/usr/bin/conmon"
DEBU[0000] Initializing boltdb state at /home/chris/.local/share/containers/storage/libpod/bolt_state.db
DEBU[0000] Using graph driver btrfs
DEBU[0000] Using graph root /home/chris/.local/share/containers/storage
DEBU[0000] Using run root /run/user/1000/containers
DEBU[0000] Using static dir /home/chris/.local/share/containers/storage/libpod
DEBU[0000] Using tmp dir /run/user/1000/libpod/tmp
DEBU[0000] Using volume path /home/chris/.local/share/containers/storage/volumes
DEBU[0000] Set libpod namespace to ""
DEBU[0000] Not configuring container store
DEBU[0000] Initializing event backend journald
DEBU[0000] configured OCI runtime kata initialization failed: no valid executable found for OCI runtime kata: invalid argument
DEBU[0000] configured OCI runtime runsc initialization failed: no valid executable found for OCI runtime runsc: invalid argument
DEBU[0000] configured OCI runtime runc initialization failed: no valid executable found for OCI runtime runc: invalid argument
DEBU[0000] Using OCI runtime "/usr/bin/crun"
INFO[0000] Found CNI network podman (type=bridge) at /home/chris/.config/cni/net.d/87-podman.conflist
DEBU[0000] Default CNI network name podman is unchangeable
INFO[0000] podman filtering at log level debug
DEBU[0000] Called reset.PersistentPreRunE(podman --log-level debug system reset)
DEBU[0000] cached value indicated that overlay is supported
DEBU[0000] Merged system config "/usr/share/containers/containers.conf"
DEBU[0000] cached value indicated that overlay is supported
DEBU[0000] Using conmon: "/usr/bin/conmon"
DEBU[0000] Initializing boltdb state at /home/chris/.local/share/containers/storage/libpod/bolt_state.db
DEBU[0000] Using graph driver btrfs
DEBU[0000] Using graph root /home/chris/.local/share/containers/storage
DEBU[0000] Using run root /run/user/1000/containers
DEBU[0000] Using static dir /home/chris/.local/share/containers/storage/libpod
DEBU[0000] Using tmp dir /run/user/1000/libpod/tmp
DEBU[0000] Using volume path /home/chris/.local/share/containers/storage/volumes
DEBU[0000] cached value indicated that overlay is supported
DEBU[0000] Set libpod namespace to ""
DEBU[0000] [graphdriver] trying provided driver "btrfs"
DEBU[0000] Initializing event backend journald
DEBU[0000] configured OCI runtime runc initialization failed: no valid executable found for OCI runtime runc: invalid argument
DEBU[0000] configured OCI runtime kata initialization failed: no valid executable found for OCI runtime kata: invalid argument
DEBU[0000] configured OCI runtime runsc initialization failed: no valid executable found for OCI runtime runsc: invalid argument
DEBU[0000] Using OCI runtime "/usr/bin/crun"
INFO[0000] Found CNI network podman (type=bridge) at /home/chris/.config/cni/net.d/87-podman.conflist
DEBU[0000] Default CNI network name podman is unchangeable
DEBU[0000] Podman detected system restart - performing state refresh
INFO[0000] Setting parallel job count to 25

WARNING! This will remove:
- all containers
- all pods
- all images
- all build cache
Are you sure you want to continue? [y/N] y
DEBU[0001] cached value indicated that overlay is supported
DEBU[0001] Merged system config "/usr/share/containers/containers.conf"
DEBU[0001] cached value indicated that overlay is supported
DEBU[0001] Using conmon: "/usr/bin/conmon"
DEBU[0001] Initializing boltdb state at /home/chris/.local/share/containers/storage/libpod/bolt_state.db
DEBU[0001] Using graph driver btrfs
DEBU[0001] Using graph root /home/chris/.local/share/containers/storage
DEBU[0001] Using run root /run/user/1000/containers
DEBU[0001] Using static dir /home/chris/.local/share/containers/storage/libpod
DEBU[0001] Using tmp dir /run/user/1000/libpod/tmp
DEBU[0001] Using volume path /home/chris/.local/share/containers/storage/volumes
DEBU[0001] cached value indicated that overlay is supported
DEBU[0001] Set libpod namespace to ""
DEBU[0001] Initializing event backend journald
DEBU[0001] configured OCI runtime runc initialization failed: no valid executable found for OCI runtime runc: invalid argument
DEBU[0001] configured OCI runtime kata initialization failed: no valid executable found for OCI runtime kata: invalid argument
DEBU[0001] configured OCI runtime runsc initialization failed: no valid executable found for OCI runtime runsc: invalid argument
DEBU[0001] Using OCI runtime "/usr/bin/crun"
INFO[0001] Found CNI network podman (type=bridge) at /home/chris/.config/cni/net.d/87-podman.conflist
DEBU[0001] Default CNI network name podman is unchangeable
DEBU[0001] [graphdriver] trying provided driver "btrfs"
A storage.conf file exists at /home/chris/.config/containers/storage.conf
You should remove this file if you did not modified the configuration.
WARN[0001] Failed to add pause process to systemd sandbox cgroup:
[chris@fovo ~]$

Describe the results you expected:

Should reset without error.

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

$ podman version
Version:      3.4.1
API Version:  3.4.1
Go Version:   go1.16.8
Built:        Wed Oct 20 10:31:56 2021
OS/Arch:      linux/amd64

Output of podman info --debug:

[chris@fovo ~]$ podman info --debug
host:
  arch: amd64
  buildahVersion: 1.23.1
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.0.30-2.fc35.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.30, commit: '
  cpus: 8
  distribution:
    distribution: fedora
    variant: workstation
    version: "35"
  eventLogger: journald
  hostname: fovo.local
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.15.0-0.rc6.47.fc36.x86_64+debug
  linkmode: dynamic
  logDriver: journald
  memFree: 11412983808
  memTotal: 16362147840
  ociRuntime:
    name: crun
    package: crun-1.2-1.fc35.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.2
      commit: 4f6c8e0583c679bfee6a899c05ac6b916022561b
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.1.12-2.fc35.x86_64
    version: |-
      slirp4netns version 1.1.12
      commit: 7a104a101aa3278a2152351a082a6df71f57c9a3
      libslirp: 4.6.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.0
  swapFree: 8589930496
  swapTotal: 8589930496
  uptime: 6m 18.95s
plugins:
  log:
  - k8s-file
  - none
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /home/chris/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: btrfs
  graphOptions: {}
  graphRoot: /home/chris/.local/share/containers/storage
  graphStatus:
    Build Version: 'Btrfs v5.14.1 '
    Library Version: "102"
  imageStore:
    number: 0
  runRoot: /run/user/1000/containers
  volumePath: /home/chris/.local/share/containers/storage/volumes
version:
  APIVersion: 3.4.1
  Built: 1634740316
  BuiltTime: Wed Oct 20 10:31:56 2021
  GitCommit: ""
  GoVersion: go1.16.8
  OsArch: linux/amd64
  Version: 3.4.1

Package info (e.g. output of rpm -q podman or apt list podman):

podman-3.4.1-1.fc35.x86_64
systemd-249.4-2.fc35.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)

No

Additional environment details (AWS, VirtualBox, physical, etc.):

Fedora 35 Workstation, installed on a Lenovo Thinkpad

$ rpm -qa | grep dbus | sort
abrt-dbus-2.14.6-9.fc35.x86_64
abrt-dbus-debuginfo-2.14.6-7.fc35.x86_64
dbus-1.12.20-5.fc35.x86_64
dbus-broker-29-4.fc35.x86_64
dbus-common-1.12.20-5.fc35.noarch
dbus-daemon-1.12.20-5.fc35.x86_64
dbus-debuginfo-1.12.20-4.fc35.x86_64
dbus-debugsource-1.12.20-4.fc35.x86_64
dbus-glib-0.112-2.fc35.x86_64
dbus-libs-1.12.20-5.fc35.x86_64
dbus-libs-debuginfo-1.12.20-4.fc35.x86_64
dbus-tools-1.12.20-5.fc35.x86_64
dleyna-connector-dbus-0.3.0-11.fc35.x86_64
python3-dbus-1.2.18-2.fc35.x86_64
xdg-dbus-proxy-0.1.2-5.fc35.x86_64

The text was updated successfully, but these errors were encountered:

cmurf · 2021-10-22T16:25:33Z

Interestingly, this doesn't always happen. I just tried it again and did not receive this error message. But the next two times I ran it, I did get the error.

cmurf · 2021-10-22T16:26:56Z

I still get the error after rm -f /home/chris/.config/containers/storage.conf which contained only

[storage]

driver = "btrfs"

cmurf · 2021-10-22T16:33:13Z

OK so far it looks like I do not get this error if I do podman run hello-world in between each system reset. It only happens with back to back resets.

mheon · 2021-10-25T14:51:40Z

Why does podman system reset even make a pause process? This seems unnecessary. @giuseppe WDYT?

giuseppe · 2021-10-27T07:55:59Z

Why does podman system reset even make a pause process? This seems unnecessary. @giuseppe WDYT?

I've opened a PR to check the pause pid file exists before attempting to move it to its own scope: #12111

check that the pause pid exists before trying to move it to a separate scope. Closes: containers#12065 Signed-off-by: Giuseppe Scrivano <[email protected]>

edsantiago · 2021-10-27T18:43:12Z

Reopening. Test just failed in PR #12121, which (I checked) is based on main @ a05a445, which includes #12111.

commit 6b3b0a1 introduced a check for the PID file before attempting to move the PID to a new scope. This is still vulnerable to TOCTOU race condition though, since the PID file or the PID can be removed/killed after the check was successful but before it was used. Closes: containers#12065 [NO NEW TESTS NEEDED] it fixes a CI flake Signed-off-by: Giuseppe Scrivano <[email protected]>

giuseppe · 2021-10-28T09:25:03Z

Reopening. Test just failed in PR #12121, which (I checked) is based on main @ a05a445, which includes #12111.

ouch, another attempt at fixing it: #12126

check that the pause pid exists before trying to move it to a separate scope. Closes: containers#12065 Signed-off-by: Giuseppe Scrivano <[email protected]>

commit 6b3b0a1 introduced a check for the PID file before attempting to move the PID to a new scope. This is still vulnerable to TOCTOU race condition though, since the PID file or the PID can be removed/killed after the check was successful but before it was used. Closes: containers#12065 [NO NEW TESTS NEEDED] it fixes a CI flake Signed-off-by: Giuseppe Scrivano <[email protected]>

edsantiago · 2021-11-15T13:38:43Z

Reopening due to failure in bodhi gating tests (rootless):

not ok 221 podman corrupt images - system reset
# (from function `is' in file ./helpers.bash, line 491,
#  from function `_corrupt_image_test' in file ./330-corrupt-images.bats, line 85,
#  in test file ./330-corrupt-images.bats, line 121)
#   `_corrupt_image_test "system reset -f"' failed
# umount: /tmp/podman_corrupt_test.pVOFEY/root/overlay: No such file or directory
# $ podman load -i /tmp/podman_corrupt_test.pVOFEY/img.tar
# Getting image source signatures
# Copying blob sha256:03901b4a2ea88eeaad62dbe59b072b28b6efa00491962b8741081c5df50c65e0
# Copying config sha256:961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4
# Writing manifest to image destination
# Storing signatures
# Loaded image(s): sha256:961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4
# $ podman tag 961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4 quay.io/libpod/alpine:test
# removed '/tmp/podman_corrupt_test.pVOFEY/root/vfs-images/961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4/manifest'
# $ podman images
# Error: error retrieving label for image "961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4": you may need to remove the image to resolve the error: fallback error checking whether image is a manifest list: error locating item named "manifest" for image with ID "961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4": file does not exist: error locating item named "manifest" for image with ID "961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4": file does not exist
# [ rc=125 (expected) ]
# $ podman system reset -f
# time="2021-11-12T21:42:17Z" level=warning msg="error determining if an image is a parent: error locating item named \"manifest\" for image with ID \"961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4\": file does not exist, ignoring the error"
# time="2021-11-12T21:42:17Z" level=warning msg="error determining parent of image: error locating item named \"manifest\" for image with ID \"961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4\": file does not exist, ignoring the error"
# $ podman images -a --noheading
# time="2021-11-12T21:42:17Z" level=warning msg="Failed to add pause process to systemd sandbox cgroup: Unit podman-pause.scope already exists."
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #|     FAIL: podman images -a, after system reset -f, is empty
# #| expected: '[no output]'
# #|   actual: 'time="2021-11-12T21:42:17Z" level=warning msg="Failed to add pause process to systemd sandbox cgroup: Unit podman-pause.scope already exists."'

Packages:

  Kernel: 5.16.0-0.rc0.20211111gitdebe436e77c7.11.fc36.x86_64
  conmon-2.0.30-2.fc36.x86_64
  container-selinux-2.171.0-1.fc36.noarch
  crun-1.3-1.fc36.x86_64
  iptables-legacy-1.8.7-13.fc35.x86_64
  iptables-legacy-libs-1.8.7-13.fc35.x86_64
  iptables-libs-1.8.7-13.fc35.x86_64
  libslirp-4.6.1-2.fc35.x86_64
  podman-3.4.2-1.fc36.x86_64
  podman-debuginfo-3.4.2-1.fc36.x86_64
  podman-debugsource-3.4.2-1.fc36.x86_64
  podman-docker-3.4.2-1.fc36.noarch
  podman-gvproxy-3.4.2-1.fc36.x86_64
  podman-gvproxy-debuginfo-3.4.2-1.fc36.x86_64
  podman-plugins-3.4.2-1.fc36.x86_64
  podman-plugins-debuginfo-3.4.2-1.fc36.x86_64
  podman-remote-3.4.2-1.fc36.x86_64
  podman-remote-debuginfo-3.4.2-1.fc36.x86_64
  podman-tests-3.4.2-1.fc36.x86_64
  rpm-plugin-systemd-inhibit-4.17.0-1.fc36.1.x86_64
  slirp4netns-1.1.12-2.fc36.x86_64
  systemd-249.6-2.fc36.x86_64
  systemd-libs-249.6-2.fc36.x86_64
  systemd-networkd-249.6-2.fc36.x86_64
  systemd-oomd-defaults-249.6-2.fc36.noarch
  systemd-pam-249.6-2.fc36.x86_64
  systemd-resolved-249.6-2.fc36.x86_64
  systemd-udev-249.6-2.fc36.x86_64

edsantiago · 2021-11-16T16:01:52Z

Closing again. I think this is another one of those "fixed a long long time ago" bugs that hasn't actually propagated into whatever is being built in f36.

edsantiago · 2021-11-16T18:54:46Z

...and reopening again. This has just triggered in PR #11795 (the monster f35 Cirrus one), in f35 rootless:

# $ podman images -a --noheading
# time="2021-11-16T12:22:20-06:00" level=warning msg="Failed to add pause process to systemd sandbox cgroup: Unit podman-pause.scope already exists."
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #|     FAIL: podman images -a, after system reset -f, is empty
# #| expected: '[no output]'
# #|   actual: 'time="2021-11-16T12:22:20-06:00" level=warning msg="Failed to add pause process to systemd sandbox cgroup: Unit podman-pause.scope already exists."'
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(the PR is rebased on main @ f031bd2, which includes the purported fix (#12126)

cevich · 2021-11-16T19:38:29Z

Oh, I guessed it was b/c of the 0.1.7 update PR. I rebased, let's see if it happens again.

edsantiago · 2021-11-16T21:39:57Z

Seen again in #11795.

edsantiago · 2021-11-16T22:07:14Z

...and again. At this point it is no longer a flake, it is a showstopper.

edsantiago · 2021-11-16T22:10:55Z

@giuseppe PTAL ASAP.

giuseppe · 2021-11-17T09:13:21Z

@edsantiago opened a PR: #12323

we try hard to re-use the existing podman-pause.scope name when it already exists, causing any sort of race errors when the already existing scope is terminating. There is no such a requirement though, so just try with a random name. Closes: containers#12065 [NO NEW TESTS NEEDED] it fixes a race in the CI Signed-off-by: Giuseppe Scrivano <[email protected]>

[Backport of containers#12323 into v3.4, to fix gating-test flakes] we try hard to re-use the existing podman-pause.scope name when it already exists, causing any sort of race errors when the already existing scope is terminating. There is no such a requirement though, so just try with a random name. Closes: containers#12065 [NO NEW TESTS NEEDED] it fixes a race in the CI Signed-off-by: Giuseppe Scrivano <[email protected]>

we try hard to re-use the existing podman-pause.scope name when it already exists, causing any sort of race errors when the already existing scope is terminating. There is no such a requirement though, so just try with a random name. Closes: containers#12065 [NO NEW TESTS NEEDED] it fixes a race in the CI Signed-off-by: Giuseppe Scrivano <[email protected]>

openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 22, 2021

giuseppe mentioned this issue Oct 27, 2021

runtime: check for pause pid existence #12111

Merged

giuseppe added a commit to giuseppe/libpod that referenced this issue Oct 27, 2021

runtime: check for pause pid existence

6b3b0a1

check that the pause pid exists before trying to move it to a separate scope. Closes: containers#12065 Signed-off-by: Giuseppe Scrivano <[email protected]>

openshift-merge-robot closed this as completed in #12111 Oct 27, 2021

edsantiago reopened this Oct 27, 2021

giuseppe mentioned this issue Oct 28, 2021

runtime: change PID existence check #12126

Merged

openshift-merge-robot closed this as completed in #12126 Oct 28, 2021

mheon pushed a commit to mheon/libpod that referenced this issue Nov 12, 2021

runtime: check for pause pid existence

2467821

check that the pause pid exists before trying to move it to a separate scope. Closes: containers#12065 Signed-off-by: Giuseppe Scrivano <[email protected]>

edsantiago reopened this Nov 15, 2021

edsantiago closed this as completed Nov 16, 2021

edsantiago reopened this Nov 16, 2021

edsantiago mentioned this issue Nov 16, 2021

Cirrus: Bump Fedora to release 35 #11795

Merged

giuseppe mentioned this issue Nov 17, 2021

utils: use podman-pause-$RANDOM.scope name #12323

Merged

openshift-merge-robot closed this as completed in #12323 Nov 17, 2021

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to add pause process to systemd sandbox cgroup #12065

Failed to add pause process to systemd sandbox cgroup #12065

cmurf commented Oct 22, 2021 •

edited

Loading

cmurf commented Oct 22, 2021

cmurf commented Oct 22, 2021

cmurf commented Oct 22, 2021

mheon commented Oct 25, 2021

giuseppe commented Oct 27, 2021

edsantiago commented Oct 27, 2021

giuseppe commented Oct 28, 2021

edsantiago commented Nov 15, 2021

edsantiago commented Nov 16, 2021

edsantiago commented Nov 16, 2021

cevich commented Nov 16, 2021

edsantiago commented Nov 16, 2021

edsantiago commented Nov 16, 2021

edsantiago commented Nov 16, 2021

giuseppe commented Nov 17, 2021

Failed to add pause process to systemd sandbox cgroup #12065

Failed to add pause process to systemd sandbox cgroup #12065

Comments

cmurf commented Oct 22, 2021 • edited Loading

cmurf commented Oct 22, 2021

cmurf commented Oct 22, 2021

cmurf commented Oct 22, 2021

mheon commented Oct 25, 2021

giuseppe commented Oct 27, 2021

edsantiago commented Oct 27, 2021

giuseppe commented Oct 28, 2021

edsantiago commented Nov 15, 2021

edsantiago commented Nov 16, 2021

edsantiago commented Nov 16, 2021

cevich commented Nov 16, 2021

edsantiago commented Nov 16, 2021

edsantiago commented Nov 16, 2021

edsantiago commented Nov 16, 2021

giuseppe commented Nov 17, 2021

cmurf commented Oct 22, 2021 •

edited

Loading