Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support User= in systemd for running rootless services #12778

Closed
Gchbg opened this issue Jan 9, 2022 · 72 comments
Closed

support User= in systemd for running rootless services #12778

Gchbg opened this issue Jan 9, 2022 · 72 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@Gchbg
Copy link
Contributor

Gchbg commented Jan 9, 2022

Is this a BUG REPORT or FEATURE REQUEST?

/kind bug

Description

I want to have a systemd system service that runs a rootless container under an isolated user, but systemd rejects the sd_notify call and terminates the service.

Got notification message from PID 15150, but reception only permitted for main PID 14978

A similar problem was menitoned but not resolved in #5572, which seems to have been closed without a resolution.

Happy to help tracking this down.

Steps to reproduce the issue:

  1. Start with a Debian testing system. Create a system user with an empty home dir, and enable lingering:
groupadd -g 200 nginx
useradd -r -s /usr/sbin/nologin -l -b /var/lib -M -g nginx -u 200 nginx
usermod -v 165536-231071 -w 165536-231071 nginx
mkdir -m 770 /var/lib/nginx
nginx:nginx /var/lib/nginx
loginctl enable-linger nginx
  1. Use this unit file, adapted from podman generate systemd --new:
❯ cat /etc/systemd/system/nginx.service
[Unit]
Description=Nginx
Wants=network-online.target
After=network-online.target

[Service]
WorkingDirectory=/var/lib/nginx
User=nginx
Group=nginx
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=no
TimeoutStopSec=70
Type=notify
NotifyAccess=all
ExecStartPre=/bin/rm -f %T/%N.ctr-id
ExecStart=/usr/bin/podman run --cidfile=%T/%N.ctr-id --replace --rm -d --sdnotify=conmon --cgroups=no-conmon --name nginx nginx:mainline
ExecStop=/usr/bin/podman stop --cidfile=%T/%N.ctr-id -i
ExecStopPost=/usr/bin/podman rm --cidfile=%T/%N.ctr-id -f -i
KillMode=none

[Install]
WantedBy=default.target

❯ sudo systemctl daemon-reload
  1. Start the unit:
❯ sudo systemctl start nginx

Describe the results you received:

Jan 09 14:54:00 Cubert systemd[1]: /etc/systemd/system/nginx.service:24: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Jan 09 14:54:00 Cubert systemd[1]: Starting Nginx...
Jan 09 14:54:00 Cubert systemd[14978]: Started podman-15150.scope.
Jan 09 14:54:00 Cubert podman[15150]: Resolving "nginx" using unqualified-search registries (/etc/containers/registries.conf)
Jan 09 14:54:00 Cubert podman[15150]: Trying to pull docker.io/library/nginx:mainline...
Jan 09 14:54:03 Cubert podman[15150]: Getting image source signatures
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:a0bcbecc962ed2552e817f45127ffb3d14be31642ef3548997f58ae054deb5b2
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:a2abf6c4d29d43a4bf9fbb769f524d0fb36a2edab49819c1bf3e76f409f953ea
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:a9edb18cadd1336142d6567ebee31be2a03c0905eeefe26cb150de7b0fbc520b
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:589b7251471a3d5fe4daccdddfefa02bdc32ffcba0a6d6a2768bf2c401faf115
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:186b1aaa4aa6c480e92fbd982ee7c08037ef85114fbed73dbb62503f24c1dd7d
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:b4df32aa5a72e2a4316aad3414508ccd907d87b4ad177abd7cbd62fa4dab2a2f
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:589b7251471a3d5fe4daccdddfefa02bdc32ffcba0a6d6a2768bf2c401faf115
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:a0bcbecc962ed2552e817f45127ffb3d14be31642ef3548997f58ae054deb5b2
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:a9edb18cadd1336142d6567ebee31be2a03c0905eeefe26cb150de7b0fbc520b
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:b4df32aa5a72e2a4316aad3414508ccd907d87b4ad177abd7cbd62fa4dab2a2f
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:a2abf6c4d29d43a4bf9fbb769f524d0fb36a2edab49819c1bf3e76f409f953ea
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:186b1aaa4aa6c480e92fbd982ee7c08037ef85114fbed73dbb62503f24c1dd7d
Jan 09 14:54:12 Cubert podman[15150]: Copying config sha256:605c77e624ddb75e6110f997c58876baa13f8754486b461117934b24a9dc3a85
Jan 09 14:54:12 Cubert podman[15150]: Writing manifest to image destination
Jan 09 14:54:12 Cubert podman[15150]: Storing signatures
Jan 09 14:54:12 Cubert podman[15150]:
Jan 09 14:54:12 Cubert podman[15150]: 2022-01-09 14:54:12.101247642 +0200 EET m=+11.607938154 container create 7c7de83a412558d9ef53592734d3a52df9eecf331f696acfcdaac0ce33cf4c2a (image=docker.io/library/nginx:mainline, name=nginx, maintainer=NGINX Docker Maintainers <[email protected]>, PODMAN_SYSTEMD_UNIT=nginx.service)
Jan 09 14:54:12 Cubert systemd[14978]: Started libcrun container.
Jan 09 14:54:12 Cubert podman[15150]: 2022-01-09 14:54:00.536382139 +0200 EET m=+0.043073791 image pull  nginx:mainline
Jan 09 14:54:12 Cubert systemd[1]: [email protected]: Got notification message from PID 15150, but reception only permitted for main PID 14978
Jan 09 14:54:12 Cubert podman[15150]: 2022-01-09 14:54:12.141137063 +0200 EET m=+11.647827815 container init 7c7de83a412558d9ef53592734d3a52df9eecf331f696acfcdaac0ce33cf4c2a (image=docker.io/library/nginx:mainline, name=nginx, PODMAN_SYSTEMD_UNIT=nginx.service, maintainer=NGINX Docker Maintainers <[email protected]>)
Jan 09 14:54:12 Cubert systemd[1]: [email protected]: Got notification message from PID 15150, but reception only permitted for main PID 14978
Jan 09 14:54:12 Cubert podman[15150]: 2022-01-09 14:54:12.145611861 +0200 EET m=+11.652302766 container start 7c7de83a412558d9ef53592734d3a52df9eecf331f696acfcdaac0ce33cf4c2a (image=docker.io/library/nginx:mainline, name=nginx, PODMAN_SYSTEMD_UNIT=nginx.service, maintainer=NGINX Docker Maintainers <[email protected]>)
Jan 09 14:54:12 Cubert podman[15150]: 7c7de83a412558d9ef53592734d3a52df9eecf331f696acfcdaac0ce33cf4c2a
Jan 09 14:54:12 Cubert conmon[15215]: /docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
Jan 09 14:54:12 Cubert conmon[15215]: /docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
Jan 09 14:54:12 Cubert conmon[15215]: /docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
Jan 09 14:54:12 Cubert conmon[15215]: 10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
Jan 09 14:54:12 Cubert conmon[15215]: 10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
Jan 09 14:54:12 Cubert conmon[15215]: /docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
Jan 09 14:54:12 Cubert conmon[15215]: /docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
Jan 09 14:54:12 Cubert conmon[15215]: /docker-entrypoint.sh: Configuration complete; ready for start up
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: using the "epoll" event method
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: nginx/1.21.5
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: built by gcc 10.2.1 20210110 (Debian 10.2.1-6)
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: OS: Linux 5.15.0-2-amd64
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 524288:524288
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: start worker processes
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: start worker process 26
Jan 09 14:54:12 Cubert systemd[14978]: Started podman-15271.scope.
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: signal 3 (SIGQUIT) received, shutting down
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 26#26: gracefully shutting down
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 26#26: exiting
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 26#26: exit
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: signal 17 (SIGCHLD) received from 26
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: worker process 26 exited with code 0
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: exit
Jan 09 14:54:12 Cubert podman[15299]: 2022-01-09 14:54:12.393064442 +0200 EET m=+0.052274069 container remove 7c7de83a412558d9ef53592734d3a52df9eecf331f696acfcdaac0ce33cf4c2a (image=docker.io/library/nginx:mainline, name=nginx, PODMAN_SYSTEMD_UNIT=nginx.service, maintainer=NGINX Docker Maintainers <[email protected]>)
Jan 09 14:54:12 Cubert podman[15271]: 7c7de83a412558d9ef53592734d3a52df9eecf331f696acfcdaac0ce33cf4c2a
Jan 09 14:54:12 Cubert systemd[14978]: podman-15150.scope: Consumed 7.547s CPU time.
Jan 09 14:54:12 Cubert systemd[1]: nginx.service: Failed with result 'protocol'.
Jan 09 14:54:12 Cubert systemd[1]: Failed to start Nginx.

Describe the results you expected:

Nginx runs until the end of time.

Output of podman version:

Version:      3.4.4
API Version:  3.4.4
Go Version:   go1.17.5
Built:        Thu Jan  1 02:00:00 1970
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.23.1
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: 'conmon: /usr/bin/conmon'
    path: /usr/bin/conmon
    version: 'conmon version 2.0.25, commit: unknown'
  cpus: 1
  distribution:
    distribution: debian
    version: unknown
  eventLogger: journald
  hostname: Cubert
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 200
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 200
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
  kernel: 5.15.0-2-amd64
  linkmode: dynamic
  logDriver: journald
  memFree: 1015083008
  memTotal: 2041786368
  ociRuntime:
    name: crun
    package: 'crun: /usr/bin/crun'
    path: /usr/bin/crun
    version: |-
      crun version 0.17
      commit: 0e9229ae34caaebcb86f1fde18de3acaf18c6d9a
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/user/200/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: 'slirp4netns: /usr/bin/slirp4netns'
    version: |-
      slirp4netns version 1.0.1
      commit: 6a7b16babc95b6a3056b33fb45b74a6f62262dd4
      libslirp: 4.6.1
  swapFree: 0
  swapTotal: 0
  uptime: 8h 1m 8.23s (Approximately 0.33 days)
plugins:
  log:
  - k8s-file
  - none
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /var/lib/nginx/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/lib/nginx/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 0
  runRoot: /run/user/200/containers
  volumePath: /var/lib/nginx/.local/share/containers/storage/volumes
version:
  APIVersion: 3.4.4
  Built: 0
  BuiltTime: Thu Jan  1 02:00:00 1970
  GitCommit: ""
  GoVersion: go1.17.5
  OsArch: linux/amd64
  Version: 3.4.4

Package info (e.g. output of apt list podman):

podman/testing,now 3.4.4+ds1-1 amd64 [installed]

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)

Yes and yes.

Additional environment details (AWS, VirtualBox, physical, etc.):

Machine is a VM.

@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 9, 2022
@mheon
Copy link
Member

mheon commented Jan 10, 2022

This is a limitation on the systemd side. They will only accept notifications, or PID files, that are created by or sent by root, for security reasons - even if the User and Group of the unit file are explicitly set to start the process as a non-root user. Their recommendation was to start the container as a user service of the user in question via systemctl --user. There have been a few other issues about this, I'll try and dig them up.

@eriksjolund
Copy link
Contributor

Previous discussion: #9642
It contains links to some issues.

@Gchbg
Copy link
Contributor Author

Gchbg commented Jan 16, 2022

Thank you both. For now I've worked around it by managing the service under the user's systemd which is clunky to say the least. I don't understand systemd's security argument - if the process is run as a given user, why would systemd not allow that user's process to send sd_notify? Who else could? But I guess this is no flaw of podman.

#9642 mentions some code changes that need to happen to podman for sd_notify, what are those? And have they progressed since March?

I guess you could close this issue or use it to track progress.

@vrothberg
Copy link
Member

#9642 mentions some code changes that need to happen to podman for sd_notify, what are those? And have they progressed since March?

Yes, there is some progress. The main PID is now communicated via sd notify but there are still some remaining issues. For instance, %t resolves to the root's runtime dir - even when User=foo is set.

@vrothberg
Copy link
Member

I think the next big thing to tackle is finding a way how to lift the User= setting. While the process in ExecStart itself is run as the specified User/Group, the systemd specifiers (e.g., %t, %U, etc) remain to be root.

@Gchbg
Copy link
Contributor Author

Gchbg commented Jan 17, 2022

[...] The main PID is now communicated via sd notify [...]

But even that is rejected by systemd, as seen in the logs above.

@vrothberg
Copy link
Member

I fear there's not much Podman can do at the moment.

@wc7086
Copy link

wc7086 commented Jan 26, 2022

Only after solving this problem can become truly rootless.

So I have to keep using the root account for now.

@svdHero
Copy link

svdHero commented Jan 31, 2022

Is there a quick overview what, at the moment, the best approach / workaround is for starting podman containers with systemd as a specific non-root user?

Furthermore, if a container is run as root, is there a workaround how to change the ownership of files and directories created inside the container (in a bound volume) to a specific host user?

@wc7086
Copy link

wc7086 commented Jan 31, 2022

Furthermore, if a container is run as root, is there a workaround how to change the ownership of files and directories created inside the container (in a bound volume) to a specific host user?

use -e PUID=useruid -e PGID=usergid

use id username check UID and GID

@vrothberg
Copy link
Member

Is there a quick overview what, at the moment, the best approach / workaround is for starting podman containers with systemd as a specific non-root user?

The services need to be started and managed as the specific non-root user. Using the User= directive does not work yet.

@Gchbg
Copy link
Contributor Author

Gchbg commented Jan 31, 2022

Is there a quick overview what, at the moment, the best approach / workaround is for starting podman containers with systemd as a specific non-root user?

For the moment my workaround is to run such containers in a systemd --user. This means that for every system service I want to run as a rootless container, I need to create a separate system user, enable linger, and run a separate systemd --user instance for that user.

It works but it's clunky, e.g. restarting Nginx is sudo su -l nginx -s /bin/sh -c 'XDG_RUNTIME_DIR="/run/user/$(id -u)" DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/$(id -u)/bus" systemctl --user restart nginx' and running a command inside such a container might be something like sudo su -l nextcloud -s /bin/sh -c 'XDG_RUNTIME_DIR="/run/user/$(id -u)" DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/$(id -u)/bus" podman exec -u www-data -w /var/www/html nextcloud ./occ status'.

Inside these rootless containers root is mapped to the system user, which is a different uid for each service. If something inside the containers runs as non-root, that gets mapped to a high-numbered host uid by default. However with some magic on the host you can map a specific non-root uid in the container to a host uid of your choice, which can then be mapped to a different non-root uid in a different container running under a different user.

I should probably document my setup one of these days...

@eriksjolund
Copy link
Contributor

eriksjolund commented Jan 31, 2022

@Gchbg If you are running a recent systemd version (for instance by running Fedora 35), I think you could run

sudo systemd-run --machine=nginx@ --quiet --user --collect --pipe --wait systemctl --user restart nginx

No need to set DBUS_SESSION_BUS_ADDRESS and XDG_RUNTIME_DIR

@svdHero
Copy link

svdHero commented Feb 1, 2022

@wc7086

Furthermore, if a container is run as root, is there a workaround how to change the ownership of files and directories created inside the container (in a bound volume) to a specific host user?

use -e PUID=useruid -e PGID=usergid

use id username check UID and GID

Is that -e as in the podman run option --env for environment variables?

@vrothberg

Is there a quick overview what, at the moment, the best approach / workaround is for starting podman containers with systemd as a specific non-root user?

The services need to be started and managed as the specific non-root user. Using the User= directive does not work yet.

How does that relate to what @Gchbg and @eriksjolund wrote above? Do I have to run several instances of systemd or is there another way?

For systemd beginners like me, it is quite difficult to understand the various layers of abstraction and user permission between systemd, host processes and containers.
It would be really helpful to have a complete example in the podmand generate docs, that shows how to start a container or pod under a specific user during boot time.

After all, I would assume that this is the use case for 80 % of the users: run some container service that gets restarted automatically when the machine boots and that is as restricted as possible (by means of user permissions).

@wc7086
Copy link

wc7086 commented Feb 1, 2022

@wc7086

Furthermore, if a container is run as root, is there a workaround how to change the ownership of files and directories created inside the container (in a bound volume) to a specific host user?

use -e PUID=useruid -e PGID=usergid
use id username check UID and GID

Is that -e as in the podman run option --env for environment variables?

I got it wrong, modifying UID and GID via env requires entrypoint.sh。

https://docs.docker.com/engine/security/userns-remap/
Most of the docker documentation applies to podman.

@grooverdan
Copy link
Contributor

I think the next big thing to tackle is finding a way how to lift the User= setting. While the process in ExecStart itself is run as the specified User/Group, the systemd specifiers (e.g., %t, %U, etc) remain to be root.

With the %t cidfile removed in #13236, what are the remaining requirements? Does it matter if RequiresMountsFor=%t/containers uses the User %t rather than root?

@vrothberg
Copy link
Member

So far #13236 is an issue. To be sure it's working, we need a pull request :)

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@benyaminl
Copy link

[...] The main PID is now communicated via sd notify [...]

But even that is rejected by systemd, as seen in the logs above.

For now I use tmux to run the systemctl service from rootless podman. It works even after I detach or close ssh connection, because it kept the user logged in 🤣

@Gchbg
Copy link
Contributor Author

Gchbg commented Apr 14, 2022

For now I use tmux to run the systemctl service from rootless podman. It works even after I detach or close ssh connection, because it kept the user logged in 🤣

Could you please describe this in more detail? I'm curious how it compares to my workaround.

@benyaminl
Copy link

For now I use tmux to run the systemctl service from rootless podman. It works even after I detach or close ssh connection, because it kept the user logged in 🤣

Could you please describe this in more detail? I'm curious how it compares to my workaround.

It's just simple work around as I'm kepping tmux running, it means I'm always logged in, so the systemd user service will kept running as simple as that. It's just a silly ways for me for now.

Anyway loginctl should close this issue I think. I talk across folks on /r/podman, but it require root user first to allow user service running in background after boot.

@runiq
Copy link

runiq commented Apr 21, 2022

Just a quick heads-up: The commandline from #12778 (comment):

sudo systemd-run --machine=nginx@ --quiet --user --collect --pipe --wait systemctl --user restart nginx

can be simplified to:

sudo systemctl --user -M nginx@ restart nginx

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@ghost
Copy link

ghost commented Apr 15, 2023

Hello @vrothberg 👋

I tested the OpenFile directive without success.

As @hmoffatt and @eriksjolund mentionned, it seems to be possible to notify with very simple examples. This simple unit, does work ok (systemd 253):

[Service]
User=nobody
ExecStart=sh -c "sleep 1 && systemd-notify --ready"
Type=notify

Can you be more specific on why it’s an issue with podman? Is it because of the forking?

@vrothberg
Copy link
Member

Can you be more specific on why it’s an issue with podman? Is it because of the forking?

See #12778 (comment).

@eriksjolund
Copy link
Contributor

Can you be more specific on why it’s an issue with podman? Is it because of the forking?

It seems a part of the problem is to set conmon PID as the MAINPID.

Quote from Git commit

Let's be more restrictive when validating PID files and MAINPID=
messages: don't accept PIDs that make no sense, and if the configuration
source is not trusted, don't accept out-of-cgroup PIDs. A configuratin
source is considered trusted when the PID file is owned by root, or the
message was received from root.

I tried to use OpenFile= to set MAINPID in a test (without using Podman) but it didn't work.
(In the logs there was no mentioning of the MAINPID being read from the file)
Some files related to the test:
https://github.com/eriksjolund/test-systemd-mainpid-openfile/

Then I tried another test (also without using Podman) where I managed to set the MAINPID by using ExecStartPost with a leading + before the path to the executable.
(Such a command is run as root)

ExecStartPost=+/usr/bin/mytest_notifymainpid

mytest_notifymainpid source code contains

    std::string msg = std::format("MAINPID={}\n", mainpid);
    sd_pid_notify(senderpid, 0, msg.c_str());

senderpid is here the PID of the program that I started with

ExecStart=/usr/bin/mytest_notifyready_and_then_sleep

An untested idea: Let Podman send the READY=1 and then wait for the program in ExecStartPost (/usr/bin/mytest_notifymainpid) to finish before continuing. (Waiting could maybe be achieved with some sort of trigger file)

Output from journalctl

Jun 06 08:18:32 localhost.localdomain systemd[1]: test2.service: Got notification message from PID 4820 (MAINPID=4839, READY=1)
Jun 06 08:18:32 localhost.localdomain systemd[1]: test2.service: New main PID 4839 does not belong to service, but we'll accept it as the request to change it came from a privileged process.
Jun 06 08:18:32 localhost.localdomain systemd[1]: test2.service: Supervising process 4839 which is not our child. We'll most likely not notice when it exits.

@eriksjolund
Copy link
Contributor

eriksjolund commented Jun 11, 2023

It seems to work.

I tried out an echo server that listens on TCP port 908.

$ echo hello | socat  -t 60 - tcp4:127.0.0.1:908
hello
$

The echo server replied hello.

The file /etc/systemd/system/echo.socket contains:

[Unit]
Description=echo server

[Socket]
ListenStream=0.0.0.0:908

[Install]
WantedBy=default.target

The port number is smaller than 1024. An unprivileged user does not
normally have the privileges to listen on such a port as I didn't modify /proc/sys/net/ipv4/ip_unprivileged_port_start

$ cat /proc/sys/net/ipv4/ip_unprivileged_port_start
1024

The file /etc/systemd/system/echo.service contains:

[Unit]
Description=Podman container-echo.service
Wants=network-online.target
After=network-online.target
#RequiresMountsFor=%t/containers

[Service]
PAMName=login
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=on-failure
TimeoutStopSec=70
User=test
ExecStart=/usr/bin/podman run \
        --cidfile=/var/tmp/%n.ctr-id \
        --conmon-pidfile /var/tmp/conmon-pidfile \
        --cgroups=no-conmon \
        --rm \
        --sdnotify=conmon \
        --replace \
        --name echo \
        --network none ghcr.io/eriksjolund/socket-activate-echo
ExecStartPost=+/var/tmp/notify-mainpid /var/tmp/conmon-pidfile
ExecStop=/usr/bin/podman stop \
        --ignore -t 10 \
        --cidfile=/var/tmp/%n.ctr-id
ExecStopPost=/usr/bin/podman rm \
        -f \
        --ignore -t 10 \
        --cidfile=/var/tmp/%n.ctr-id
Type=notify
NotifyAccess=all

[Install]
WantedBy=default.target

A summary of the proof-of-concept demo

  • as the user test first run podman unshare /bin/true to make sure the podman user namespace is created (otherwise the environment variable SYSTEMD_EXEC_PID will not match the normal podman process PID. See the related topic `podman PID != service MainPID` happens when catatonit is not already running #18842)
  • /usr/bin/podman run --sdnotify=conmon --conmon-pidfile /var/tmp/conmon-pidfile ...
  • modify libpod/container_internal.go to only send READY=1 (not MAINPID=), and to sleep 5 seconds afterwards.

As soon as libpod/container_internal.go has sent READY=1, systemd will start
the executable /var/tmp/notify-mainpid as root because the service was configured with

StartExecPost=+/var/tmp/notify-mainpid /var/tmp/conmon-pidfile

notify-mainpid is a little program I wrote as a proof-of-concept

#include <systemd/sd-daemon.h>
#include <format>
#include <iostream>
#include <fstream>

int main(int argc, char *argv[]) {
  if (argc != 2) {
    fprintf(stderr, "error: incorrect number of arguments\n");
    return 1;
  }  
  std::ifstream mainpid_stream(argv[1]);
  pid_t mainpid;
  mainpid_stream >> mainpid;
  char *podmanpidstr = getenv("SYSTEMD_EXEC_PID");
  pid_t podmanpid = atoi(podmanpidstr);
  std::string msg = std::format("MAINPID={}\nREADY=1", mainpid);
  sd_pid_notify(podmanpid, 0, msg.c_str());
  return 0;
}

notify-mainpid sends a notification message on behalf of the podman process (the current MAINPID) and notifies systemd that MAINPID should be equal to the conmon PID.

I created a branch
https://github.com/eriksjolund/podman/tree/issue-12778-proof-of-concept-sdnotify-conmon

where I put the code. There is room for a lot of improvements, for example to replace the racy solution with the 5 seconds delay with something else.

This demo was for --sdnotify=conmon. Doing somthing similar for --sdnotify=container will be more complicated as more synchronization will have to take place. Maybe OpenFile= could be used to improve security and synchronization.

@quulah
Copy link

quulah commented Sep 16, 2023

I've been following this issue for a while. And have had a few tries at it as well.

I think I've gotten it to work now, with User= on the unit, with Type=forking and PIDFile= as empty.

Admittedly, this is for a hobby Minecraft server, but I've been thinking of using the same pattern for actual production stuff as well. So, I wonder if I'm missing something here as this does seem to work. :) If it's only the notify things, then I assume I can go forward with this solution as stopping, starting and so on seem to be fine.

Logs also work via journald. With the conmon name and PID, but the container name is in the journald JSON output, so I can grab it from there if necessary.

I've generated the unit file with the containers.podman.podman_container Ansible module, but that should call the same generate as podman. Additionally, I've created the /run/user/<uid> directory for the user, but other than that I think this looks pretty much out-of-the-box.

@vrothberg
Copy link
Member

Thanks for sharing, @quulah.

I think I've gotten it to work now, with User= on the unit, with Type=forking and PIDFile= as empty.

With Type=forking systemd may chose the wrong PID as the main PID. SDNotify policies won't work as well. So I fear it's not a generic solution for other workloads.

@sjpb
Copy link

sjpb commented Sep 19, 2023

I've kind of forgotten everything that's been tried, but what's wrong with using Type=simple (it doesn't appear in this thread)? That definitely works (for at least some versions of podman, etc).

@vrothberg
Copy link
Member

The only supported ways of running Podman inside of system is via the units generated via podman generate systemd and Quadlet. generate systemd is slowly being deprecated in favor of Quadlet.

Those generated units use Type=notify to achieve the best integration with systemd as possible. That, among other things, enables restart policies to work, auto updates and rollbacks, and to make use of custom notify policies that the container workloads may require.

Type=simple should work as well but only for simple use cases. But it's leaving supported terrain. Certainly if there is a Podman bug, it will be fixed.

@sjpb
Copy link

sjpb commented Sep 19, 2023

But podman generate systemd already can't do User=, so this whole issue is unsupported then I guess.

@vrothberg
Copy link
Member

At the moment, it is unsupported but this issue is meant to find means where it can be supported.

@ygalblum
Copy link
Contributor

Maybe I'm wrong here (I didn't read the entire thread). But, since the user is already defined as lingering, what about running the systemd service as a user service (~/.config/systemd/user/ for generate or ~/.config/containers/systemd/ for Quadlet) instead of setting the User field?

@vrothberg
Copy link
Member

@ygalblum, in most cases it's a UX issue. It's easier to manage rootless services as the root user when the services make use of User=.

@tomhughes
Copy link

Specifically what I would like is to be able to use DynamicUser= so I don't even have to worry about pre-creating users for each service, never mind writing user units for them all.

@rhatdan
Copy link
Member

rhatdan commented Sep 19, 2023

So you want to run podman as a separate user or do you want your containers all running with different users. You could use --userns=auto for the second option.

@tomhughes
Copy link

Sure that runs each container in a separate namespace on the container side but they're still all running as the same user on the host side so if there is any sort of vulnerability that allows breaking out of the container then there's no isolation left if the containers are all running as the same user on the host side and if that user is root then it's game over.

@runiq
Copy link

runiq commented Sep 19, 2023

@tomhughes If you use SELinux and mount with :z, shouldn't that help alleviate issues like these?

@quulah
Copy link

quulah commented Sep 19, 2023

@vrothberg Thanks, the wrong PID being selected would indeed a source of confusion and weirdness. :)

I also realized that I'd be missing out on auto update, as to my limited understanding that requires visibility to both the systemd unit and the container, since the label is checked, image pulled and unit then restarted. With a root unit and user container that would probably require some workarounds.

I'm also mostly after the UX here, so that it's similar to how any other service is managed. DynamicUser= would indeed be nice as well.

For what it's worth I realized that systemd has better support now for managing user units as root. And has had for a while, but I've missed this.

https://github.com/systemd/systemd/blob/28795f2c138203fb700fc394f0937708af886116/NEWS#LL2820C10-L2820C10

systemctl --user -M lennart@ start quux

While this is not quite systemctl start quux, it's probably close enough for my purposes. No fiddling about with machinectl or stuff like that, at least.

And I can do the configuring with Quadlet which seems like the way to go now. Generating systemd units hasn't been a problem, since I can leverage Ansible for that, but Quadlet is a nice abstraction.

@markstos
Copy link
Contributor

@quulah Are you suggesting that in a file at /etc/systemd/system/wrapper.service, you would have an ExecStart= line like:

ExecStart=systemctl --user -M lennart@ start other-service.service

@rhatdan
Copy link
Member

rhatdan commented Sep 19, 2023

@tomhughes No --userns=auto runs everything as a different user. No overlap with the root user or the user who ran the container. The UID running podman is not in the user namespace of the container processes. There is some risk in just running Podman, which could be mitigated via running in a different user for each run. However from an SELinux point of view, their is a chance that the container MCS Range could overlap. There is no guarantee if two different users run a podman command, that the containers could not run with the same SELinux label. SELinux separation is only guaranteed for a single podman database.

@vrothberg
Copy link
Member

And I can do the configuring with Quadlet which seems like the way to go now. Generating systemd units hasn't been a problem, since I can leverage Ansible for that, but Quadlet is a nice abstraction.

@quulah the latest version of the Ansible Role for Podman supports Quadlet as well.

@quulah
Copy link

quulah commented Sep 20, 2023

@markstos Not really, just documenting the fact that you need to use --user -M when kicking these services.

@ppenguin
Copy link

ppenguin commented Oct 12, 2023

Specifically what I would like is to be able to use DynamicUser= so I don't even have to worry about pre-creating users for each service, never mind writing user units for them all.

This is also what I need, and what seems to be a pretty valid use case. Additionally, for peristent state one can still use StateDirectory and User and Group in combination with DynamicUser. The advantage is that with this one doesn't have to take care of creating StateDirectory and can use LoadCredential transparently.

I actually almost got it working but ran against a brick wall with this issue, i.e. podman(-compose) appears to choke on newuidmap, presumably because the DynamicUser environment has in some unknown way limited permissions.

Hacking away with things like AmbientCapabilities = "CAP_SETUID" and/or verifying the capabilities on newuidmap didn't make a difference.

(I got the normal systemd --user stuff working pretty well, but it's extremely cumbersome (even on a declarative system like NixOS), because you have to manually take care of ensuring the service users etc. and their respective home dirs, like @Gchbg and @tomhughes already mentioned).

@Visne
Copy link

Visne commented Nov 1, 2023

I've kind of forgotten everything that's been tried, but what's wrong with using Type=simple

Type=simple should work as well but only for simple use cases. But it's leaving supported terrain.

Since I don't think it was mentioned yet, you should probably not do that since it can happen that the Podman process is killed while the container keeps running (see #9642 (reply in thread)).

@sjpb
Copy link

sjpb commented Nov 2, 2023

I'm still confused why we're all having problems with this; clearly using User= is not the recommended/supported approach. So the recommended/supported approach really is to run containers as root? Am I missing something and people generally think that's ok? Non-containerised services wouldn't be running as root right? So why is it ok to run containerised services as root?

@mattventura
Copy link

I'm still confused why we're all having problems with this; clearly using User= is not the recommended/supported approach. So the recommended/supported approach really is to run containers as root? Am I missing something and people generally think that's ok? Non-containerised services wouldn't be running as root right? So why is it ok to run containerised services as root?

Personally, I have taken to just running it as a user service with lingering enabled. It still lets me start it on boot, and manage/observe it via systemctl and journalctl.

@rhatdan
Copy link
Member

rhatdan commented Nov 2, 2023

I think at this point we should change this to a discussion. User= causes lots of issues with running podman and rootless support is fairly easy. I also recomend that people look at using rootful with --userns=auto, which will run your containers each in a unigue user nemespace.

@containers containers locked and limited conversation to collaborators Nov 2, 2023
@rhatdan rhatdan converted this issue into discussion #20573 Nov 2, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests