Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to add pause process to systemd sandbox cgroup #12065

Closed
cmurf opened this issue Oct 22, 2021 · 15 comments · Fixed by #12111, #12126 or #12323
Closed

Failed to add pause process to systemd sandbox cgroup #12065

cmurf opened this issue Oct 22, 2021 · 15 comments · Fixed by #12111, #12126 or #12323
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@cmurf
Copy link

cmurf commented Oct 22, 2021

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

Steps to reproduce the issue:

  1. podman system reset

Describe the results you received:
[chris@fovo ~]$ podman --log-level debug system reset
INFO[0000] podman filtering at log level debug
DEBU[0000] Called reset.PersistentPreRunE(podman --log-level debug system reset)
DEBU[0000] Merged system config "/usr/share/containers/containers.conf"
DEBU[0000] Using conmon: "/usr/bin/conmon"
DEBU[0000] Initializing boltdb state at /home/chris/.local/share/containers/storage/libpod/bolt_state.db
DEBU[0000] Using graph driver btrfs
DEBU[0000] Using graph root /home/chris/.local/share/containers/storage
DEBU[0000] Using run root /run/user/1000/containers
DEBU[0000] Using static dir /home/chris/.local/share/containers/storage/libpod
DEBU[0000] Using tmp dir /run/user/1000/libpod/tmp
DEBU[0000] Using volume path /home/chris/.local/share/containers/storage/volumes
DEBU[0000] Set libpod namespace to ""
DEBU[0000] Not configuring container store
DEBU[0000] Initializing event backend journald
DEBU[0000] configured OCI runtime kata initialization failed: no valid executable found for OCI runtime kata: invalid argument
DEBU[0000] configured OCI runtime runsc initialization failed: no valid executable found for OCI runtime runsc: invalid argument
DEBU[0000] configured OCI runtime runc initialization failed: no valid executable found for OCI runtime runc: invalid argument
DEBU[0000] Using OCI runtime "/usr/bin/crun"
INFO[0000] Found CNI network podman (type=bridge) at /home/chris/.config/cni/net.d/87-podman.conflist
DEBU[0000] Default CNI network name podman is unchangeable
INFO[0000] podman filtering at log level debug
DEBU[0000] Called reset.PersistentPreRunE(podman --log-level debug system reset)
DEBU[0000] cached value indicated that overlay is supported
DEBU[0000] Merged system config "/usr/share/containers/containers.conf"
DEBU[0000] cached value indicated that overlay is supported
DEBU[0000] Using conmon: "/usr/bin/conmon"
DEBU[0000] Initializing boltdb state at /home/chris/.local/share/containers/storage/libpod/bolt_state.db
DEBU[0000] Using graph driver btrfs
DEBU[0000] Using graph root /home/chris/.local/share/containers/storage
DEBU[0000] Using run root /run/user/1000/containers
DEBU[0000] Using static dir /home/chris/.local/share/containers/storage/libpod
DEBU[0000] Using tmp dir /run/user/1000/libpod/tmp
DEBU[0000] Using volume path /home/chris/.local/share/containers/storage/volumes
DEBU[0000] cached value indicated that overlay is supported
DEBU[0000] Set libpod namespace to ""
DEBU[0000] [graphdriver] trying provided driver "btrfs"
DEBU[0000] Initializing event backend journald
DEBU[0000] configured OCI runtime runc initialization failed: no valid executable found for OCI runtime runc: invalid argument
DEBU[0000] configured OCI runtime kata initialization failed: no valid executable found for OCI runtime kata: invalid argument
DEBU[0000] configured OCI runtime runsc initialization failed: no valid executable found for OCI runtime runsc: invalid argument
DEBU[0000] Using OCI runtime "/usr/bin/crun"
INFO[0000] Found CNI network podman (type=bridge) at /home/chris/.config/cni/net.d/87-podman.conflist
DEBU[0000] Default CNI network name podman is unchangeable
DEBU[0000] Podman detected system restart - performing state refresh
INFO[0000] Setting parallel job count to 25

WARNING! This will remove:
- all containers
- all pods
- all images
- all build cache
Are you sure you want to continue? [y/N] y
DEBU[0001] cached value indicated that overlay is supported
DEBU[0001] Merged system config "/usr/share/containers/containers.conf"
DEBU[0001] cached value indicated that overlay is supported
DEBU[0001] Using conmon: "/usr/bin/conmon"
DEBU[0001] Initializing boltdb state at /home/chris/.local/share/containers/storage/libpod/bolt_state.db
DEBU[0001] Using graph driver btrfs
DEBU[0001] Using graph root /home/chris/.local/share/containers/storage
DEBU[0001] Using run root /run/user/1000/containers
DEBU[0001] Using static dir /home/chris/.local/share/containers/storage/libpod
DEBU[0001] Using tmp dir /run/user/1000/libpod/tmp
DEBU[0001] Using volume path /home/chris/.local/share/containers/storage/volumes
DEBU[0001] cached value indicated that overlay is supported
DEBU[0001] Set libpod namespace to ""
DEBU[0001] Initializing event backend journald
DEBU[0001] configured OCI runtime runc initialization failed: no valid executable found for OCI runtime runc: invalid argument
DEBU[0001] configured OCI runtime kata initialization failed: no valid executable found for OCI runtime kata: invalid argument
DEBU[0001] configured OCI runtime runsc initialization failed: no valid executable found for OCI runtime runsc: invalid argument
DEBU[0001] Using OCI runtime "/usr/bin/crun"
INFO[0001] Found CNI network podman (type=bridge) at /home/chris/.config/cni/net.d/87-podman.conflist
DEBU[0001] Default CNI network name podman is unchangeable
DEBU[0001] [graphdriver] trying provided driver "btrfs"
A storage.conf file exists at /home/chris/.config/containers/storage.conf
You should remove this file if you did not modified the configuration.
WARN[0001] Failed to add pause process to systemd sandbox cgroup:
[chris@fovo ~]$

Describe the results you expected:

Should reset without error.

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

$ podman version
Version:      3.4.1
API Version:  3.4.1
Go Version:   go1.16.8
Built:        Wed Oct 20 10:31:56 2021
OS/Arch:      linux/amd64

Output of podman info --debug:

[chris@fovo ~]$ podman info --debug
host:
  arch: amd64
  buildahVersion: 1.23.1
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.0.30-2.fc35.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.30, commit: '
  cpus: 8
  distribution:
    distribution: fedora
    variant: workstation
    version: "35"
  eventLogger: journald
  hostname: fovo.local
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.15.0-0.rc6.47.fc36.x86_64+debug
  linkmode: dynamic
  logDriver: journald
  memFree: 11412983808
  memTotal: 16362147840
  ociRuntime:
    name: crun
    package: crun-1.2-1.fc35.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.2
      commit: 4f6c8e0583c679bfee6a899c05ac6b916022561b
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.1.12-2.fc35.x86_64
    version: |-
      slirp4netns version 1.1.12
      commit: 7a104a101aa3278a2152351a082a6df71f57c9a3
      libslirp: 4.6.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.0
  swapFree: 8589930496
  swapTotal: 8589930496
  uptime: 6m 18.95s
plugins:
  log:
  - k8s-file
  - none
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /home/chris/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: btrfs
  graphOptions: {}
  graphRoot: /home/chris/.local/share/containers/storage
  graphStatus:
    Build Version: 'Btrfs v5.14.1 '
    Library Version: "102"
  imageStore:
    number: 0
  runRoot: /run/user/1000/containers
  volumePath: /home/chris/.local/share/containers/storage/volumes
version:
  APIVersion: 3.4.1
  Built: 1634740316
  BuiltTime: Wed Oct 20 10:31:56 2021
  GitCommit: ""
  GoVersion: go1.16.8
  OsArch: linux/amd64
  Version: 3.4.1

Package info (e.g. output of rpm -q podman or apt list podman):

podman-3.4.1-1.fc35.x86_64
systemd-249.4-2.fc35.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)

No

Additional environment details (AWS, VirtualBox, physical, etc.):

Fedora 35 Workstation, installed on a Lenovo Thinkpad

$ rpm -qa | grep dbus | sort
abrt-dbus-2.14.6-9.fc35.x86_64
abrt-dbus-debuginfo-2.14.6-7.fc35.x86_64
dbus-1.12.20-5.fc35.x86_64
dbus-broker-29-4.fc35.x86_64
dbus-common-1.12.20-5.fc35.noarch
dbus-daemon-1.12.20-5.fc35.x86_64
dbus-debuginfo-1.12.20-4.fc35.x86_64
dbus-debugsource-1.12.20-4.fc35.x86_64
dbus-glib-0.112-2.fc35.x86_64
dbus-libs-1.12.20-5.fc35.x86_64
dbus-libs-debuginfo-1.12.20-4.fc35.x86_64
dbus-tools-1.12.20-5.fc35.x86_64
dleyna-connector-dbus-0.3.0-11.fc35.x86_64
python3-dbus-1.2.18-2.fc35.x86_64
xdg-dbus-proxy-0.1.2-5.fc35.x86_64
@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 22, 2021
@cmurf
Copy link
Author

cmurf commented Oct 22, 2021

Interestingly, this doesn't always happen. I just tried it again and did not receive this error message. But the next two times I ran it, I did get the error.

@cmurf
Copy link
Author

cmurf commented Oct 22, 2021

I still get the error after rm -f /home/chris/.config/containers/storage.conf which contained only

[storage]

driver = "btrfs"

@cmurf
Copy link
Author

cmurf commented Oct 22, 2021

OK so far it looks like I do not get this error if I do podman run hello-world in between each system reset. It only happens with back to back resets.

@mheon
Copy link
Member

mheon commented Oct 25, 2021

Why does podman system reset even make a pause process? This seems unnecessary. @giuseppe WDYT?

@giuseppe
Copy link
Member

Why does podman system reset even make a pause process? This seems unnecessary. @giuseppe WDYT?

I've opened a PR to check the pause pid file exists before attempting to move it to its own scope: #12111

giuseppe added a commit to giuseppe/libpod that referenced this issue Oct 27, 2021
check that the pause pid exists before trying to move it to a separate
scope.

Closes: containers#12065

Signed-off-by: Giuseppe Scrivano <[email protected]>
@edsantiago
Copy link
Member

Reopening. Test just failed in PR #12121, which (I checked) is based on main @ a05a445, which includes #12111.

@edsantiago edsantiago reopened this Oct 27, 2021
giuseppe added a commit to giuseppe/libpod that referenced this issue Oct 28, 2021
commit 6b3b0a1 introduced a check for
the PID file before attempting to move the PID to a new scope.

This is still vulnerable to TOCTOU race condition though, since the
PID file or the PID can be removed/killed after the check was
successful but before it was used.

Closes: containers#12065

[NO NEW TESTS NEEDED] it fixes a CI flake

Signed-off-by: Giuseppe Scrivano <[email protected]>
@giuseppe
Copy link
Member

Reopening. Test just failed in PR #12121, which (I checked) is based on main @ a05a445, which includes #12111.

ouch, another attempt at fixing it: #12126

mheon pushed a commit to mheon/libpod that referenced this issue Nov 12, 2021
check that the pause pid exists before trying to move it to a separate
scope.

Closes: containers#12065

Signed-off-by: Giuseppe Scrivano <[email protected]>
mheon pushed a commit to mheon/libpod that referenced this issue Nov 12, 2021
commit 6b3b0a1 introduced a check for
the PID file before attempting to move the PID to a new scope.

This is still vulnerable to TOCTOU race condition though, since the
PID file or the PID can be removed/killed after the check was
successful but before it was used.

Closes: containers#12065

[NO NEW TESTS NEEDED] it fixes a CI flake

Signed-off-by: Giuseppe Scrivano <[email protected]>
@edsantiago
Copy link
Member

Reopening due to failure in bodhi gating tests (rootless):

not ok 221 podman corrupt images - system reset
# (from function `is' in file ./helpers.bash, line 491,
#  from function `_corrupt_image_test' in file ./330-corrupt-images.bats, line 85,
#  in test file ./330-corrupt-images.bats, line 121)
#   `_corrupt_image_test "system reset -f"' failed
# umount: /tmp/podman_corrupt_test.pVOFEY/root/overlay: No such file or directory
# $ podman load -i /tmp/podman_corrupt_test.pVOFEY/img.tar
# Getting image source signatures
# Copying blob sha256:03901b4a2ea88eeaad62dbe59b072b28b6efa00491962b8741081c5df50c65e0
# Copying config sha256:961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4
# Writing manifest to image destination
# Storing signatures
# Loaded image(s): sha256:961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4
# $ podman tag 961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4 quay.io/libpod/alpine:test
# removed '/tmp/podman_corrupt_test.pVOFEY/root/vfs-images/961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4/manifest'
# $ podman images
# Error: error retrieving label for image "961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4": you may need to remove the image to resolve the error: fallback error checking whether image is a manifest list: error locating item named "manifest" for image with ID "961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4": file does not exist: error locating item named "manifest" for image with ID "961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4": file does not exist
# [ rc=125 (expected) ]
# $ podman system reset -f
# time="2021-11-12T21:42:17Z" level=warning msg="error determining if an image is a parent: error locating item named \"manifest\" for image with ID \"961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4\": file does not exist, ignoring the error"
# time="2021-11-12T21:42:17Z" level=warning msg="error determining parent of image: error locating item named \"manifest\" for image with ID \"961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4\": file does not exist, ignoring the error"
# $ podman images -a --noheading
# time="2021-11-12T21:42:17Z" level=warning msg="Failed to add pause process to systemd sandbox cgroup: Unit podman-pause.scope already exists."
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #|     FAIL: podman images -a, after system reset -f, is empty
# #| expected: '[no output]'
# #|   actual: 'time="2021-11-12T21:42:17Z" level=warning msg="Failed to add pause process to systemd sandbox cgroup: Unit podman-pause.scope already exists."'

Packages:

  Kernel: 5.16.0-0.rc0.20211111gitdebe436e77c7.11.fc36.x86_64
  conmon-2.0.30-2.fc36.x86_64
  container-selinux-2.171.0-1.fc36.noarch
  crun-1.3-1.fc36.x86_64
  iptables-legacy-1.8.7-13.fc35.x86_64
  iptables-legacy-libs-1.8.7-13.fc35.x86_64
  iptables-libs-1.8.7-13.fc35.x86_64
  libslirp-4.6.1-2.fc35.x86_64
  podman-3.4.2-1.fc36.x86_64
  podman-debuginfo-3.4.2-1.fc36.x86_64
  podman-debugsource-3.4.2-1.fc36.x86_64
  podman-docker-3.4.2-1.fc36.noarch
  podman-gvproxy-3.4.2-1.fc36.x86_64
  podman-gvproxy-debuginfo-3.4.2-1.fc36.x86_64
  podman-plugins-3.4.2-1.fc36.x86_64
  podman-plugins-debuginfo-3.4.2-1.fc36.x86_64
  podman-remote-3.4.2-1.fc36.x86_64
  podman-remote-debuginfo-3.4.2-1.fc36.x86_64
  podman-tests-3.4.2-1.fc36.x86_64
  rpm-plugin-systemd-inhibit-4.17.0-1.fc36.1.x86_64
  slirp4netns-1.1.12-2.fc36.x86_64
  systemd-249.6-2.fc36.x86_64
  systemd-libs-249.6-2.fc36.x86_64
  systemd-networkd-249.6-2.fc36.x86_64
  systemd-oomd-defaults-249.6-2.fc36.noarch
  systemd-pam-249.6-2.fc36.x86_64
  systemd-resolved-249.6-2.fc36.x86_64
  systemd-udev-249.6-2.fc36.x86_64

@edsantiago edsantiago reopened this Nov 15, 2021
@edsantiago
Copy link
Member

Closing again. I think this is another one of those "fixed a long long time ago" bugs that hasn't actually propagated into whatever is being built in f36.

@edsantiago
Copy link
Member

...and reopening again. This has just triggered in PR #11795 (the monster f35 Cirrus one), in f35 rootless:

# $ podman images -a --noheading
# time="2021-11-16T12:22:20-06:00" level=warning msg="Failed to add pause process to systemd sandbox cgroup: Unit podman-pause.scope already exists."
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #|     FAIL: podman images -a, after system reset -f, is empty
# #| expected: '[no output]'
# #|   actual: 'time="2021-11-16T12:22:20-06:00" level=warning msg="Failed to add pause process to systemd sandbox cgroup: Unit podman-pause.scope already exists."'
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(the PR is rebased on main @ f031bd2, which includes the purported fix (#12126)

@edsantiago edsantiago reopened this Nov 16, 2021
@cevich
Copy link
Member

cevich commented Nov 16, 2021

Oh, I guessed it was b/c of the 0.1.7 update PR. I rebased, let's see if it happens again.

@edsantiago
Copy link
Member

Seen again in #11795.

@edsantiago
Copy link
Member

...and again. At this point it is no longer a flake, it is a showstopper.

@edsantiago
Copy link
Member

@giuseppe PTAL ASAP.

@giuseppe
Copy link
Member

@edsantiago opened a PR: #12323

giuseppe added a commit to giuseppe/libpod that referenced this issue Nov 17, 2021
we try hard to re-use the existing podman-pause.scope name when it
already exists, causing any sort of race errors when the already
existing scope is terminating.

There is no such a requirement though, so just try with a random
name.

Closes: containers#12065

[NO NEW TESTS NEEDED] it fixes a race in the CI

Signed-off-by: Giuseppe Scrivano <[email protected]>
edsantiago pushed a commit to edsantiago/libpod that referenced this issue Nov 22, 2021
[Backport of containers#12323 into v3.4, to fix gating-test flakes]

we try hard to re-use the existing podman-pause.scope name when it
already exists, causing any sort of race errors when the already
existing scope is terminating.

There is no such a requirement though, so just try with a random
name.

Closes: containers#12065

[NO NEW TESTS NEEDED] it fixes a race in the CI

Signed-off-by: Giuseppe Scrivano <[email protected]>
edsantiago pushed a commit to edsantiago/libpod that referenced this issue Nov 23, 2021
we try hard to re-use the existing podman-pause.scope name when it
already exists, causing any sort of race errors when the already
existing scope is terminating.

There is no such a requirement though, so just try with a random
name.

Closes: containers#12065

[NO NEW TESTS NEEDED] it fixes a race in the CI

Signed-off-by: Giuseppe Scrivano <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
5 participants