Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Podman in systemd mode fails on non-systemd hosts #15647

Closed
LewisGaul opened this issue Sep 6, 2022 · 17 comments · Fixed by #15668
Closed

Podman in systemd mode fails on non-systemd hosts #15647

LewisGaul opened this issue Sep 6, 2022 · 17 comments · Fixed by #15668
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@LewisGaul
Copy link

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

On Alpine 3.15, trying to run systemd containers leads to an error from trying to stat the host's /sys/fs/cgroup/systemd/, which does not exist.

I realise Alpine may not be an officially supported distro, but this may be an issue worth fixing anyway?

Steps to reproduce the issue:

Run container on cgroups v1 host that does not have /sys/fs/cgroup/systemd/ in systemd mode (either with --systemd=always or with /sbin/init as the entrypoint).

Describe the results you received:

localhost:~# podman run --rm --systemd=always fedora
Error: error stat'ing file `/sys/fs/cgroup/systemd`: No such file or directory: OCI runtime attempted to invoke a command that was not found

Describe the results you expected:

No error.

Additional information you deem important (e.g. issue happens only occasionally):

Example above uses --systemd=always, but the default is for podman to detect whether the container is running systemd, so this issue can be seen even without the --systemd arg (and --systemd=false is a workaround).

Output of podman version:

Version:      3.4.4
API Version:  3.4.4
Go Version:   go1.17.4
Git Commit:   72df58eb05290e506c96069e0c5c8d0afab3041f
Built:        Sat Dec 11 13:04:57 2021
OS/Arch:      linux/amd64

Output of podman info:

host:
  arch: amd64
  buildahVersion: 1.23.1
  cgroupControllers:
  - cpuset
  - cpu
  - cpuacct
  - blkio
  - memory
  - devices
  - freezer
  - net_cls
  - perf_event
  - net_prio
  - hugetlb
  - pids
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: Unknown
    path: /usr/bin/conmon
    version: 'conmon version 2.0.30, commit: 6ef8a4d4d76656172a4b7a9d406bfc6c629c20db'
  cpus: 6
  distribution:
    distribution: alpine
    version: 3.15.0
  eventLogger: file
  hostname: localhost
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.15.5-0-lts
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 5813215232
  memTotal: 6227705856
  ociRuntime:
    name: crun
    package: Unknown
    path: /usr/bin/crun
    version: |-
      crun version 1.3
      commit: 4f6c8e0583c679bfee6a899c05ac6b916022561b
      spec: 1.0.0
      +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /etc/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: Unknown
    version: |-
      slirp4netns version 1.1.12
      commit: 7a104a101aa3278a2152351a082a6df71f57c9a3
      libslirp: 4.6.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 199225344
  swapTotal: 199225344
  uptime: 61h 19m 12.5s (Approximately 2.54 days)
plugins:
  log:
  - k8s-file
  - none
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev
  graphRoot: /var/lib/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 1
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 3.4.4
  Built: 1639227897
  BuiltTime: Sat Dec 11 13:04:57 2021
  GitCommit: 72df58eb05290e506c96069e0c5c8d0afab3041f
  GoVersion: go1.17.4
  OsArch: linux/amd64
  Version: 3.4.4

Package info (output of apk info podman):

podman-3.4.4-r0 description:
Simple management tool for pods, containers and images

podman-3.4.4-r0 webpage:
https://podman.io/

podman-3.4.4-r0 installed size:
35 MiB

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

No

Additional environment details (AWS, VirtualBox, physical, etc.):

QEMU VM running Alpine 3.15 cloud image.

@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 6, 2022
@mheon
Copy link
Member

mheon commented Sep 6, 2022

@vrothberg PTAL

@vrothberg
Copy link
Member

Thanks for reaching out, @LewisGaul!

Would systemd inside a container work without it being present on the host, @giuseppe @rhatdan? I do not know why it wouldn't but may be overlooking something.

@LewisGaul
Copy link
Author

Would systemd inside a container work without it being present on the host, @giuseppe @rhatdan? I do not know why it wouldn't but may be overlooking something.

Yes, we have a container that uses systemd and it works fine on Alpine with --systemd=false (where we remount cgroups in our entrypoint script before execing systemd).

@giuseppe
Copy link
Member

giuseppe commented Sep 6, 2022

/sys/fs/cgroup/systemd is managed by systemd on the host and it is expected for tracking processes in the container payload too.

You can manually mount it with:

# mkdir -p /sys/fs/cgroup/systemd && mount cgroup -t cgroup -o none,name=systemd /sys/fs/cgroup/systemd

@giuseppe giuseppe closed this as completed Sep 6, 2022
@vrothberg
Copy link
Member

@giuseppe should we add auto-detection to Podman for that?

@giuseppe
Copy link
Member

giuseppe commented Sep 6, 2022

IMO, Podman should not change the system configuration. Maybe just a warning when systemd mode is used && the host is not using systemd && cgroupv1 && /sys/fs/cgroup/systemd doesn't exist?

@LewisGaul
Copy link
Author

You can manually mount it with:

# mkdir -p /sys/fs/cgroup/systemd && mount cgroup -t cgroup -o none,name=systemd /sys/fs/cgroup/systemd

I'm fully aware of this workaround. This is not required to be able to create the mount inside the container though, and podman fails to handle this case.

IMO podman shouldn't require the /sys/fs/cgroup/systemd/ mount to exist on the host - it can be created in the container either way.

@giuseppe
Copy link
Member

giuseppe commented Sep 6, 2022

completely untested as I've no access to a cgroupv1 system without systemd at the moment, but would something like the following patch work for you?:

diff --git a/libpod/container_internal_linux.go b/libpod/container_internal_linux.go
index 5c5fd471b..c4a85bc64 100644
--- a/libpod/container_internal_linux.go
+++ b/libpod/container_internal_linux.go
@@ -1073,10 +1073,15 @@ func (c *Container) setupSystemd(mounts []spec.Mount, g generate.Generator) erro
 		g.AddMount(systemdMnt)
 	} else {
 		mountOptions := []string{"bind", "rprivate"}
-
+		typ := "bind"
 		var statfs unix.Statfs_t
 		if err := unix.Statfs("/sys/fs/cgroup/systemd", &statfs); err != nil {
-			mountOptions = append(mountOptions, "nodev", "noexec", "nosuid")
+			if os.IsNotExist(err) {
+				typ = "cgroup"
+				mountOptions = []string{"none", "name=systemd"}
+			} else {
+				mountOptions = append(mountOptions, "nodev", "noexec", "nosuid")
+			}
 		} else {
 			if statfs.Flags&unix.MS_NODEV == unix.MS_NODEV {
 				mountOptions = append(mountOptions, "nodev")
@@ -1094,7 +1099,7 @@ func (c *Container) setupSystemd(mounts []spec.Mount, g generate.Generator) erro
 
 		systemdMnt := spec.Mount{
 			Destination: "/sys/fs/cgroup/systemd",
-			Type:        "bind",
+			Type:        typ,
 			Source:      "/sys/fs/cgroup/systemd",
 			Options:     mountOptions,
 		}

@LewisGaul
Copy link
Author

@giuseppe the suggested patch looks like a reasonable approach to me.

FWIW here's a minimal reproducer (e.g. on Alpine with cgroups v1), showing that systemd containers do work on hosts that don't have /sys/fs/cgroup/systemd/, and specifically don't work in systemd mode!

$ mkdir systemd-ctr && cd systemd-ctr
$ cat << EOF > Dockerfile
FROM fedora
RUN yum install -y systemd
ENTRYPOINT ["/sbin/init"]
EOF
$ podman build ./ --tag fedora:systemd
$ podman run -it --cap-add sys_admin -e container=podman fedora:systemd
Error: error stat'ing file `/sys/fs/cgroup/systemd`: No such file or directory: OCI runtime attempted to invoke a command that was not found
$ podman run -it --cap-add sys_admin -e container=podman --systemd=false fedora:systemd
systemd v250.8-1.fc36 running in system mode (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization podman.
Detected architecture x86-64.
...

@giuseppe
Copy link
Member

giuseppe commented Sep 6, 2022

is there any special reason for not using cgroup v2? That would also solve the issue you are seeing

@LewisGaul
Copy link
Author

is there any special reason for not using cgroup v2? That would also solve the issue you are seeing

We provide the container image and users pick the host - we support cgroup v1 and cgroup v2. Using docker is also an option - there are quite a few alternatives.

If this is something that could be fixed in podman would it be possible to reopen the issue? :)

@rhatdan rhatdan reopened this Sep 6, 2022
@rhatdan
Copy link
Member

rhatdan commented Sep 6, 2022

I think @giuseppe fix is reasonable. Can it handle cgroupsV1 and V2 though.

@giuseppe
Copy link
Member

giuseppe commented Sep 6, 2022

one issue with my patch above is that both crun and runc treat the mount of type "cgroup" as an entire cgroup hierarchy, so there is no way to mount the systemd named cgroup alone. Given this limitation, you will still need the mount on the host for podman to work with the existing OCI runtimes

@LewisGaul
Copy link
Author

one issue with my patch above is that both crun and runc treat the mount of type "cgroup" as an entire cgroup hierarchy, so there is no way to mount the systemd named cgroup alone. Given this limitation, you will still need the mount on the host for podman to work with the existing OCI runtimes

I'm not sure I follow. It should be fine if the container runtime doesn't mount /sys/fs/cgroup/systemd at all - when systemd in the container starts it will create it.

@giuseppe
Copy link
Member

giuseppe commented Sep 7, 2022

it will create that only if you grant CAP_SYS_ADMIN, which makes your container privileged since it is almost equivalent to running as root on the host. That will also confuse systemd as it might perform some privileged operations that are not supposed to be done in a container, e.g. on my Fedora 36 there are 18 services that in a way or another depend on CAP_SYS_ADMIN:

$ grep CAP_SYS_ADMIN /usr/lib/systemd/system/*.service | wc -l
18

@LewisGaul
Copy link
Author

As per https://systemd.io/CONTAINER_INTERFACE/, CAP_SYS_ADMIN is required for running systemd inside a container.

@giuseppe
Copy link
Member

giuseppe commented Sep 7, 2022

Podman aims at the use case described later in that document: Fully Unprivileged Container Payload where CAP_SYS_ADMIN is missing. Your use case is even simpler to address as we have to just ignore the /sys/fs/cgroup/systemd mount, it won't work from a user namespace.

Given that adding this change won't regress other use cases, I've opened a PR: #15668

giuseppe added a commit to giuseppe/libpod that referenced this issue Sep 7, 2022
skip adding the /sys/fs/cgroup/systemd bind mount if it is not already
present on the host.

[NO NEW TESTS NEEDED] requires a system without systemd.

Closes: containers#15647

Signed-off-by: Giuseppe Scrivano <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 16, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants