pursuing conventional systemd+podman interaction #6400

andrewgdunn · 2020-05-27T00:44:13Z

This is an RFE after talking with @mheon for a bit in IRC (thanks for that, sorry I kept you so late). In the shortest form I can think of the enhancement would be: facilitate podman/conmon interacting with systemd in a way that provides console output for systemctl and journalctl. In bullet form:

create a "system" user (e.g. UID/GID less than 1000 by convention), set shell to /sbin/nologin
create a sub-UID and sub-GID mapping for that user
create a "system" level unit file (e.g. /etc/systemd/system/<unit>.service) that specifies that "system" user in User=.
systemctl start <unit>.service and be able to see the console output of the container
journalctl -u <unit>.service and be able to see the historical console output of the container

My use case is that I want to use podman to run images that are essentially "system" services, but as "user" because I want the rootless isolation. I've been consuming podman for a bit now (starting with 1.8.2) and am likely stuck on that version because in new versions my approach gets broken: I loose all logging from the container. I have tried --log-driver=journald but have no idea how to find a hand-hold for the console output (what -u should I be looking for, because its not .service, and it's not the container... and it's not podman-.scope). Basically podman doesn't provide the init system with a console hand-hold so I'm rolling blind.

Here is an example of mattermost, under 1.8.2 this works how I'd like it to work (e.g. I'm getting console output). I'm doing some things that are different than what podman generate systemd offers, but it's because my explicit goal is to:

run the container rootless under a "system" user
see what the heck is going on inside the container with my init system (rather than having to sudo -u <user> -h <home> podman logs <container-name>)

[root@vault ~]# systemctl cat podman-mattermost.service 
# /etc/systemd/system/podman-mattermost.service
[Unit]
Description=Podman running mattermost
Wants=network.target
After=network-online.target
Requires=podman-mattermost-postgres.service

[Service]
WorkingDirectory=/app/gitlab
User=gitlab
Group=gitlab
Restart=no
ExecStartPre=/usr/bin/rm -f %T/%N.pid %T/%N.cid
ExecStartPre=/usr/bin/podman rm --ignore -f mattermost
ExecStart=/usr/bin/podman run --conmon-pidfile %T/%N.pid --cidfile %T/%N.cid --cgroups=no-conmon \
  --name=mattermost \
  --env-file /app/gitlab/mattermost/mattermost.env \
  --publish 127.0.0.1:8065:8065 \
  --security-opt label=disable \
  --health-cmd=none \
  --volume /app/gitlab/mattermost/data:/mattermost/data \
  --volume /app/gitlab/mattermost/logs:/mattermost/logs \
  --volume /app/gitlab/mattermost/config:/mattermost/config \
  --volume /app/gitlab/mattermost/plugins:/mattermost/client/plugins \
  docker.io/mattermost/mattermost-team-edition:release-5.24
ExecStop=/usr/bin/podman stop --ignore mattermost -t 30
ExecStopPost=/usr/bin/podman rm --ignore -f mattermost
ExecStopPost=/usr/bin/rm -f %T/%N.pid %T/%N.cid
KillMode=none
Type=simple

[Install]
WantedBy=multi-user.target default.target

[root@vault ~]# systemctl cat podman-mattermost-postgres.service 
# /etc/systemd/system/podman-mattermost-postgres.service
[Unit]
Description=Podman running postgres for mattermost
Wants=network.target
After=network-online.target podman-mattermost.service
PartOf=podman-mattermost.service

[Service]
WorkingDirectory=/app/gitlab
User=gitlab
Group=gitlab
Restart=no
ExecStartPre=/usr/bin/rm -f %T/%N.pid %T/%N.cid
ExecStartPre=/usr/bin/podman rm --ignore -f postgres
ExecStart=/usr/bin/podman run --conmon-pidfile %T/%N.pid --cidfile %T/%N.cid --cgroups=no-conmon \
  --name=postgres \
  --env-file /app/gitlab/mattermost/postgres.env \
  --net=container:mattermost \
  --volume /app/gitlab/mattermost/postgres:/var/lib/postgresql/data:Z \
  docker.io/postgres:12
ExecStop=/usr/bin/podman stop --ignore postgres -t 30
ExecStopPost=/usr/bin/podman rm --ignore -f postgres
ExecStopPost=/usr/bin/rm -f %T/%N.pid %T/%N.cid
KillMode=none
Type=simple

[Install]
WantedBy=multi-user.target default.target

With these units above I am able to:

run as rootless as the "system" user (in this case I'm running both gitlab and mattermost)
see the console output (notice the Type=simple and lack of -d)
- in both systemctl <unit> and journalctl -u <unit>
have the container instance be ephemeral (excessive ExecPre and ExecStop)
have a shared networking namespace so that mattermost and postgres can talk
have container level dependencies represented through the init system
- podman-mattermost.service requires the podman-mattermost-postgres.service (Requires=)
- podman-mattermost-postgres.service will get a stop signal if I stop podman-mattermost.service (PartOf=)
- there are challenges here where podman-mattermost.service closes out the networking namespace before podman-mattermost-postgres.service can finish up (I think), so it's not ideal... i'd be interested in suggestions.

Tagging @lsm5 as well since I think for my use case I'm relegated to use 1.8.2 in F32 for the time being... so I am wondering if that is going away any-time soon?

The text was updated successfully, but these errors were encountered:

andrewgdunn · 2020-05-27T01:01:39Z

If I didn't hit it clearly, I did try to adopt 1.9.2. It requires a couple things (but ultimately does not work well). #6084 has some more information as well.

do the loginctl enable-linger on the "system" user
switch over to the more conventional things from podman generate systemd like -d, Type=forking

Starting this up, you can only see console output from the container by doing sudo -u <user> -h <home> podman logs <containername> where systemctl/journalctl give you nothing.

The --log-driver=journald doesn't allow for anything better... because I can't figure out what the unit is to actually query logs from (I think it might be some composite of the container id?)... and when you do a sudo -u <user> -h <home> podman logs <containername> you get nothing.

lsm5 · 2020-05-27T01:03:42Z

you can get 1.8.2-2 from https://koji.fedoraproject.org/koji/buildinfo?buildID=1479547

I'll save it to my fedorapeople page as well and send you the URL later.

giuseppe · 2020-05-27T10:52:06Z

if you enable linger mode and there is already the user session running, is there any disadvantage in in installing the .service file into ~/.config/systemd/user/?

andrewgdunn · 2020-05-27T11:14:35Z

@giuseppe for you to be able to do that you'd need a shell for that "system" account. Above I'm creating the user as root with a /sbin/nologin shell. To access the systemctl --user session you'd actually need to login, or you'd need to set the XDG_RUNTIME_DIR variable... I think... (it could also be DBUS_SESSION_BUS_ADRESS) like XDG_RUNTIME_DIR=/run/user/$UID systemctl --user status. It generally gets messy.

Also, this suggestion doesn't address what I'm primarily asking for above: desiring console output from the running container to be seen by systemd/journald. Without being able to see the combination on:

systemd log output
podman log output
the console output of the running container

You have an extremely hard time figuring out what is going on with the system (you have to look in multiple places to piece together the state of errors).

mheon · 2020-05-27T13:45:55Z

@vrothberg The core ask here (viewing logs for systemd-managed Podman) seems to be a pretty valid one - our current forking= approach does break this, and podman logs becomes very inconvenient when the services are running as rootless and you have to sudo into each of them to get logs.

I was thinking that it ought to be possible for the journald log driver to write straight to the logs for the unit file if we know it, and we did add something similar for auto-update?

vrothberg · 2020-05-27T13:55:31Z

There was a very similar request by @lucab :coreos/fedora-coreos-docs#75 (comment)

I was also thinking about the log driver 👍

rhatdan · 2020-06-09T20:06:58Z

@ashley-cui Could you look into the --log-driver changes?

goochjj · 2020-06-12T01:28:43Z

@storrgie I've been pursuing similar things recently.

Do -d, and keep the forking.
Enable --log-driver journald

That alone should take care of all container logs showing up in journald, you just need to do
journalctl CONTAINER_NAME=mattermost

As conmon will be providing those keys - CONTAINER_ID an CONTAINER_NAME. I've been doing lots of testing, basically what I've been doing is - start a container, generate the output to journald, then use journalctl -n 10 to grab the last 10 lines and find a line it logged, tweaking for 20 or 30 lines or whatever it takes. Then journalctl -n 10 -o json-pretty or -o json to get the raw line and figure out what other metadata you have to work with.

You could use CONTAINER_TAG too... i.e. add --log-opt tag=WhateverYouWant and find it with
journalctl CONTAINER_TAG=WhateverYouWant

If you want it to show under the unit, like I do, I do this:
--cgroup-parent=/system.slice/%n --cgroup-manager cgroupfs

Note, my container is root, not rootless, and the host is running Flatcar. My guess is you can get similar results by possibly tweaking the cgroup-parent. By putting the processes under the cgroup, systemd finds that they're associated with a unit - but I'd expect conmon being in the correct cgroup SHOULD be all you need.

The added benefit of running all the processes in the systemd service's cgroup is that bind mounted /dev/log ALSO associates to the unit file, automagically. You don't get the automagic CONTAINER_NAME from conmon journald records, but you DO get anything you put in the service file as a LogExtraField - so you could use that to find your logs as well.

TravisBowers · 2020-06-12T21:12:21Z

I'm running rootless containers on Fedora Server. I'm able to see logs using --log-opt tag=<tag> and journalctl CONTAINER_TAG=<tag>. However, when I add --cgroup-parent=/system.slice/%n --cgroup-manager cgroupfs, my units fail with result 'exit-code. @rhatdan, are they failing because they're rootless?

mheon · 2020-06-12T21:57:00Z

I really do not recommend running --cgroup-manager=cgroupfs with systemd-managed Podman - you end up with both systemd and Podman potentially altering the same cgroup, and I think there's the potential for them to trample each other. If you want to stay in the systemd cgroup, I'd recommend using the crun OCI runtime and passing --cgroups=disabled to prevent Podman from creating a container cgroup. We lose the ability to set resource limits, but you can just set them from within the systemd unit, so it's not a big loss.

mheon · 2020-06-12T21:58:03Z

(There is also --cgroups=no-conmon to only place Conmon in the systemd cgroup - we use that by default in unit files from podman generate systemd)

andrewgdunn · 2020-06-12T22:54:18Z

I see traffic on the mailing list from @rhatdan about an FAQ... I'm feeling more and more as I learn about this project that the idea this can "replace" docker is basically gimmicky at this stage. There is no clear golden pathway for running containers as daemons on systems with podman+systemd. It seems fraught with edge cases. I'd really love to see this ticket be taken seriously as I think there are a LOT of people trying to depart docker land and systemd+podman is a way to rid yourself of the docker monolithic daemon.

mheon · 2020-06-12T23:59:12Z

I think we definitely need a single page containing everything we recommend about running containers inside units (best practices, and the reasons for them). I've probably explained why we made the choice for forking vs simple five times at this point; having a single page with a definitive answer on that would be greatly helpful to everyone. We'll need to hash some things out as part of this, especially the use of rootless Podman + root systemd as this issue asks, but even getting the basics written down would be a start.

lucab · 2020-06-13T08:29:13Z

@mheon that would indeed help, but I'm not sure that's going to solve much. For example, from the thread at coreos/fedora-coreos-docs#75, that content currently exists in the form of a blog post which unfortunately is:

already stale at this point (podman-generate does not generate that unit anymore)
not really integrating well with systemd service handling (e.g. journald, sd-notify, user setting, etc)
somehow concerning/fragile (e.g. KillMode=none)

I think it would be better to first devise a podman mode which works well when integrated in the systemd ecosystem, and only then document it.

As a sidenote, many containerized services (eg. etcd, haproxy, etc.) do use sd-notify in order to signal when they are actually initialized and ready to start serving requests. For that kind of autoscale-friendly logic to work, a Type=notify service unit would be required.

mheon · 2020-06-13T14:12:09Z

I believe the reason we can't auto-generate Type=notify is because things are not good if the app in the container does not support it (Podman can hang) but it should work if you set it (though I'm actually not sure if it respects our PID files - if it acts like Type=simple in that respect it will never be really safe to use.) On the rest, I think the most important thing is getting logging via Journald working properly. Some things like KillMode I do not expect to be resolved, and I honestly don't view it as a problem - our design here is different than typical services by necessity (running without a daemon forced this), so we don't quite fit into the usual pattern Systemd expects. Podman will still guarantee that things are cleaned up on stop, as we would if we are not managed by Systemd.

…

On Sat, Jun 13, 2020, 04:29 Luca Bruno ***@***.***> wrote: @mheon <https://github.com/mheon> that would indeed help, but I'm not sure that's going to solve much. For example, from the thread at coreos/fedora-coreos-docs#75 <coreos/fedora-coreos-docs#75>, that content currently exists in the form of a blog post <https://www.redhat.com/sysadmin/podman-shareable-systemd-services> which unfortunately is: - already stale at this point (podman-generate does generate that unit anymore) - not really integrating well with systemd service handling (e.g. journald, sd-notify, user setting, etc) - somehow concerning/fragile (e.g. KillMode) I think it would be better to first devise a podman mode which works well when integrated in the systemd ecosystem, and only then document it. As a sidenote, many containerized services (eg. etcd, haproxy, etc.) do use sd-notify in order to signal when they are actually initialized and ready to start serving requests. For that kind of autoscale-friendly logic to work, a Type=notify service unit would be required. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6400 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3AOCDQFY7UN2HBATOOAETRWM2GLANCNFSM4NLPF4LQ> .

mheon · 2020-06-13T14:14:21Z

On the user setting specifically - I still believe that is a an issue with Systemd. We're in contact with the Systemd team to try and find a solution.

…

On Sat, Jun 13, 2020, 10:11 Matthew Heon ***@***.***> wrote: I believe the reason we can't auto-generate Type=notify is because things are not good if the app in the container does not support it (Podman can hang) but it should work if you set it (though I'm actually not sure if it respects our PID files - if it acts like Type=simple in that respect it will never be really safe to use.) On the rest, I think the most important thing is getting logging via Journald working properly. Some things like KillMode I do not expect to be resolved, and I honestly don't view it as a problem - our design here is different than typical services by necessity (running without a daemon forced this), so we don't quite fit into the usual pattern Systemd expects. Podman will still guarantee that things are cleaned up on stop, as we would if we are not managed by Systemd. On Sat, Jun 13, 2020, 04:29 Luca Bruno ***@***.***> wrote: > @mheon <https://github.com/mheon> that would indeed help, but I'm not > sure that's going to solve much. For example, from the thread at > coreos/fedora-coreos-docs#75 > <coreos/fedora-coreos-docs#75>, that content > currently exists in the form of a blog post > <https://www.redhat.com/sysadmin/podman-shareable-systemd-services> > which unfortunately is: > > - already stale at this point (podman-generate does generate that > unit anymore) > - not really integrating well with systemd service handling (e.g. > journald, sd-notify, user setting, etc) > - somehow concerning/fragile (e.g. KillMode) > > I think it would be better to first devise a podman mode which works well > when integrated in the systemd ecosystem, and only then document it. > > As a sidenote, many containerized services (eg. etcd, haproxy, etc.) do > use sd-notify in order to signal when they are actually initialized and > ready to start serving requests. For that kind of autoscale-friendly logic > to work, a Type=notify service unit would be required. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#6400 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AB3AOCDQFY7UN2HBATOOAETRWM2GLANCNFSM4NLPF4LQ> > . >

goochjj · 2020-06-13T17:34:09Z

We got the User setting working, it was mainly a problem with -d, no? Unless there's something else outstanding, I think that's solved. Similarly, the journald log-driver works well for me... unless you try to log a tty, which would be a bad idea anyway, now that exec is fixed.

Systemd integration isn't great with docker either - docker's log-driver is exactly analogous to what conmon does, docker's containers are launched by the daemon which puts them in another cgroup, unless you use cgroup-parent tricks, and sometimes getting the container to work right w.r.t. logging and groups requires hacks like systemd-docker which throws a hacky shim around sd-notify. So are we really saying podman+systemd is somehow worse? Or just not better? Because it seems better to me. Doesn't seem like Docker has a golden pathway either.

I've run docker w/ cgroup-parent sharing the unit's cgroup and systemd-docker (even though it's unsupported) for over a year, and haven't had any problems with systemd and docker fighting. I'm not sure why podman would... but I defer to the experts.

The only thing I have with docker now that I don't have with podman is bind mounting /dev/log works - because I put the docker container in the same cgroup as the unit. Without that, I'd need some sort of syslog proxy, which would probably have to live in conmon, and is a whole other discussion and probably only relevant to me.

vrothberg · 2020-06-15T08:30:25Z

@mheon that would indeed help, but I'm not sure that's going to solve much. For example, from the thread at coreos/fedora-coreos-docs#75, that content currently exists in the form of a blog post which unfortunately is:
* already stale at this point (podman-generate does generate that unit anymore)

That's not accurate. We just updated the blog post last week and do that regularly. The units are still generated the same way. Once Podman v2 is out, we need to create some upstream docs as a living document and point the blog post there.

* not really integrating well with systemd service handling (e.g. journald, sd-notify, user setting, etc)

We only support Type=forking with podman generate systemd.

* somehow concerning/fragile (e.g. `KillMode=none`)

We've been discussing that already in depth. We want Podman to handle shutdown (and killing) and prevent signal races with systemd which does not know the order in which all processes should be killed.

I think it would be better to first devise a podman mode which works well when integrated in the systemd ecosystem, and only then document it.

As a sidenote, many containerized services (eg. etcd, haproxy, etc.) do use sd-notify in order to signal when they are actually initialized and ready to start serving requests. For that kind of autoscale-friendly logic to work, a Type=notify service unit would be required.

Type=notify is supported but we don't generate them with podman generate systemd. I guess this could be part of an upstream doc?

vrothberg · 2020-06-15T08:32:37Z

I think we definitely need a single page containing everything we recommend about running containers inside units (best practices, and the reasons for them). I've probably explained why we made the choice for forking vs simple five times at this point; having a single page with a definitive answer on that would be greatly helpful to everyone. We'll need to hash some things out as part of this, especially the use of rootless Podman + root systemd as this issue asks, but even getting the basics written down would be a start.

I agree and made a similar conclusion last week when working with support on some issues. Once v2 is out (and all fixes are in), I'd love us to create a living upstream document that the blog post can link to.

vrothberg · 2020-06-15T08:38:41Z

I opened #6604 to break out the logging discussion.

lucab · 2020-06-15T09:00:55Z

@vrothberg thanks! I shouldn't have piled up more topics in here, sorry for that.
If you prefer, I can split the other ones (e.g. sd-notify) to their own tickets, so they can be incrementally closed as soon as we are done.

vrothberg · 2020-06-15T09:08:04Z

No worries at all, @lucab! All input and feedback is much appreciated.

If you prefer, I can split the other ones (e.g. sd-notify) to their own tickets, so they can be incrementally closed as soon as we are done.

That would be great, sure. While we support sd-notify, we don't generate these types. Having a dedicated issue will help us agree on how such a unit should look like and eventually get that into upstream docs (and man pages). Thanks a lot!

goochjj · 2020-06-17T13:21:20Z

Since we're having this discussion, and there's plenty of talk about Killmode, and cgroups, and where things should reside - it makes sense to me that podman's integration with systemd already has a blueprint - that being systemd-nspawn. The [email protected] unit includes things like:

KillMode=mixed
Delegate=yes
Slice=machine.slice

This means (among other things) you end up with
/machine.slice/unit.service/supervisor - which contains the systemd-nspawn ("conmon"-esque) process, and
/machine.slice/unit.service/payload - which contains the contained processes

And systemd has no problem monitoring the supervisor Pid, I'm guessing because Delegate is set, and it's a sub-cgroup.

nspawn has options like --slice, --property, --register, and --keep-unit - probably all of which should be implemented similarly in podman... and the caveats are already spelled out in the documentation.

https://www.freedesktop.org/software/systemd/man/systemd-nspawn.html

nspawn also has options for the journal - how it's bind mounted and supported, plus setting the machine ID properly for those logs... etc.

I'd imagine we'd want nspawn to be the template?

goochjj · 2020-06-17T13:25:10Z

And doing Delegate and sub-cgroups like that also means systemctl status knows the Main PID is the supervisor, but shows the full process tree including the payload clearly in the status output, and the service type is sd-notify, so I imagine it's talking back to systemd to let it know these things.

goochjj · 2020-06-17T13:30:17Z

For that matter I've wondered if it's possible to use/wrap/hack/mangle something into place to allow systemd-nspawn itself to be the OCI container runtime, instead of crun or runc. Moreso a thought experiment than anything else, but the key hangup seems to be nspawn wants a specific mount to use, which podman can provide since it already did all the work to create the appropriate overlay bindmount.

Probably involves reading config.json and turning it into command line arguments? I'm unclear separation-wise which parts of the above fit into which parts of the execution lifecycle.

mheon · 2020-06-17T13:36:26Z

There was talk about making nspawn accept OCI specs, even that may not be necessary. I don't know how well it would interface with Conmon though.

On the Delegate change - I'd have to think more about what this means for containers which forward host cgroups into the container (we'll need a way to guarantee that the entire unit cgroup isn't forwarded). I also think we'll need to ensure that the container remembers it was started with cgroupfs, so that other Podman commands launched from outside the unit file that require cgroups (e.g. podman stats) still work.

giuseppe · 2020-06-17T14:12:01Z

to simulate what nspawn does we'd need to tell the OCI runtime to use the cgroup already created by conmon instead of creating a new one.

Next crun version will automatically create a /container subcgroup in the same way nspawn does.

I think we can go a step further and get closer to what nspawn does by having a single cgroup for conmon+container payload

goochjj · 2020-06-30T15:48:17Z

@jdoss If you're using selinux, I suggest you compile and place the crun binary in /usr/local/bin, as that folder is recognized in the policy. If you're going to have a local podman or runc or crun it should be there, and chcon'd to match, i.e. chcon --reference=/usr/bin/crun /usr/local/bin/crun

In /etc/containers/containers.conf:

runtime = "crun"

[engine.runtimes]
crun = [ "/usr/local/bin/crun" ]

Or specify it on the command line as @giuseppe indicated.

jdoss · 2020-06-30T15:54:15Z

@goochjj and @giuseppe I just compiled crun from master and put it in /usr/local/bin/crun and it's still getting the same error:

# /usr/local/bin/crun --version
crun version 0.13.227-d38b
commit: d38b8c28fc50a14978a27fa6afc69a55bfdd2c11
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL

Jun 30 15:48:12 mycool mycool-elasticsearch[65963]: [conmon:d]: failed to write to /proc/self/oom_score_adj: Permission denied
Jun 30 15:48:12 mycool conmon[65963]: conmon c0cf8da55a1936150298 <ndebug>: failed to write to /proc/self/oom_score_adj: Permission denied
Jun 30 15:48:12 mycool conmon[65964]: conmon c0cf8da55a1936150298 <ninfo>: attach sock path: /tmp/run-1001/libpod/tmp/socket/c0cf8da55a1936150298fdf9608ad04154a9de89c774c0fc93e2809b489a97b7/attach
Jun 30 15:48:12 mycool conmon[65964]: conmon c0cf8da55a1936150298 <ninfo>: addr{sun_family=AF_UNIX, sun_path=/tmp/run-1001/libpod/tmp/socket/c0cf8da55a1936150298fdf9608ad04154a9de89c774c0fc93e2809b489a97b7/attach}
Jun 30 15:48:12 mycool conmon[65964]: conmon c0cf8da55a1936150298 <ninfo>: terminal_ctrl_fd: 13
Jun 30 15:48:12 mycool conmon[65964]: conmon c0cf8da55a1936150298 <ninfo>: winsz read side: 15, winsz write side: 15
Jun 30 15:48:12 mycool conmon[65965]: conmon c0cf8da55a1936150298 <nwarn>: Failed to chown stdin
Jun 30 15:48:12 mycool conmon[65964]: conmon c0cf8da55a1936150298 <error>: Failed to create container: exit status 1
Jun 30 15:48:12 mycool mycool-elasticsearch[65952]: time="2020-06-30T15:48:12Z" level=debug msg="Received: -1"
Jun 30 15:48:12 mycool mycool-elasticsearch[65952]: time="2020-06-30T15:48:12Z" level=debug msg="Cleaning up container c0cf8da55a1936150298fdf9608ad04154a9de89c774c0fc93e2809b489a97b7"
Jun 30 15:48:12 mycool mycool-elasticsearch[65952]: time="2020-06-30T15:48:12Z" level=debug msg="unmounted container \"c0cf8da55a1936150298fdf9608ad04154a9de89c774c0fc93e2809b489a97b7\""
Jun 30 15:48:12 mycool mycool-elasticsearch[65952]: time="2020-06-30T15:48:12Z" level=debug msg="ExitCode msg: \"cannot set limits without cgroups: oci runtime error\""
Jun 30 15:48:12 mycool mycool-elasticsearch[65952]: Error: cannot set limits without cgroups: OCI runtime error
Jun 30 15:48:12 mycool systemd[1]: mycool-elasticsearch.service: Control process exited, code=exited, status=126/n/a

goochjj · 2020-06-30T16:37:55Z

Add --pids-limit 0 to your run args

goochjj · 2020-06-30T16:45:17Z

Wait you're cgroups v2 now? I don't have that problem under cgroups v2 rootless. What does cat /proc/self/cgroup show?

jdoss · 2020-06-30T16:51:54Z

--pids-limit 0 does let the containers start, but yea, I booted FCOS into cgroups v2 with rootless here. I have a non-root user mycool that is being used via systemd to launch these containers.

[core@mycool ~]$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-1.scope

goochjj · 2020-06-30T16:56:52Z

I can't get the infra container to start, because you're binding to ports 80 and 443 as non-root...

goochjj · 2020-06-30T16:57:18Z

setting /proc/sys/net/ipv4/ip_unprivileged_port_start

goochjj · 2020-06-30T16:58:06Z

Hmm and there it is

If I remove your --pod it works

jdoss · 2020-06-30T16:59:09Z

 - path: /etc/sysctl.d/90-ip-unprivileged-port-start.conf
    mode: 0644
    contents:
      inline: |
        net.ipv4.ip_unprivileged_port_start = 0

To allow the pod to bind to those ports.

goochjj · 2020-06-30T17:26:58Z

I think it's because you're using a pod.

When I run this as the user, rootless, I get this:

Pod creates:
/user.slice/user-(uid).slice/user@(uid).service/user.slice/user-libpod_pod_(podid).slice/libpod-(infracid).scope/container
/user.slice/user-(uid).slice/user@(uid).service/user.slice/user-libpod_pod_(podid).slice/libpod-conmon-(infracid).scope

Container (without split) creates:
/user.slice/user-(uid).slice/user@(uid).service/user.slice/user-libpod_pod_(podid).slice/libpod-(escid).scope/container
/user.slice/user-(uid).slice/user@(uid).service/user.slice/user-libpod_pod_(podid).slice/libpod-conmon-(escid).scope

Through Systemd as the user, I get this:
Pod creates:
/user.slice/user-(uid).slice/user@(uid).service/user.slice/user-libpod_pod_(podid).slice/libpod-(infracid).scope/container
/system.slice/mycool-pod.service

Container (without split) creates:
/user.slice/user-(uid).slice/user@(uid).service/user.slice/user-libpod_pod_(podid).slice/libpod-(escid).scope/container
/system.slice/mycool-elasticsearch.service

goochjj · 2020-06-30T17:45:12Z

TLDR, @giuseppe would have to modify/extend another PR to handle pods.

It looks like when a container is spawned in a pod, it assumes its parent slice will be the parent cgroup path. (Which is reasonable) Since pod create doesn't have a --cgroups split option, the pod's conmon is attached to the service cgroup, and the pod's slice is in the user slice, divorced from the service's cgroup.

You can't simultaneously have a service (i.e. elasticsearch) be part of the unit's service, and also the pod's slice. Nor can you have a second systemd unit muck around with the pod's cgroup - that's probably a bad idea.

What's your desired outcome here, @jdoss jdoss?

/system.slice/mycool-pod.service/supervisor -> pod conmon
/system.slice/mycool-pod.service/container -> infra container
/system.slice/mycool-elasticsearch.service/supervisor -> conmon
/system.slice/mycool-elasticsearch.service/container -> ES processes

Then ALL the pod services aren't contained in a slice.

Right now it's
/system.slice/mycool-pod.service -> pod conmon
/(user's systemd service)/user.slice/user-libpod_pod_(podid).slice/libpod-(cid).scope/container -> infra procs
/system.slice/mycool-elasticsearch.service -> conmon
/(user's systemd service)/user.slice/user-libpod_pod_(podid).slice/libpod-(cid).scope/container -> elasticsearch procs

Is this insufficient in some way?

goochjj · 2020-06-30T17:57:29Z

Or maybe we should do this in a more systemd-like way?

i.e. Slice=machines-mycool_pod.slice

Pod
/machines.slice/machines-mycool_pod.slice/mycool-pod.service/supervisor -> pod conmon
/machines.slice/machines-mycool_pod.slice/mycool-pod.service/container -> infra container
/machines.slice/machines-mycool_pod.slice/mycool-elasticsearch.service/supervisor -> conmon
/machines.slice/machines-mycool_pod.slice/mycool-elasticsearch.service/container -> ES processes

Then everything is properly in a parent slice - is this what we'd want split to do with pods?

If so, the --cgroups split would have to be set at the pod create level, and child services would have to know if split is passed, to not inherit the cgroup-parent of the pod.

goochjj · 2020-06-30T18:00:54Z

--pids-limit 0 does let the containers start, but yea, I booted FCOS into cgroups v2 with rootless here. I have a non-root user mycool that is being used via systemd to launch these containers.
[core@mycool ~]$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-1.scope

@giuseppe I don't know what's causing this - but there are times when I need to set --pids-limit 0. It seems like there's a default of pids-limit 2048 coming from somewhere, not the config file and not the command line, and then when crun sees it can't do cgroups with pids-limit, it throws the runtime error.

If you happen to get the cgroup right - i.e. it's something crun can modify and it has a pids controller, then the error isn't present.

jdoss · 2020-06-30T18:01:56Z

@goochjj I am trying to set things up so I can have many pods running under a rootless user/users via systemd units with the User= directive for each stack of applications running as rootless containers inside the pod. Having everything in it's own pod namespace as a rootless user is pretty great so I don't need to juggle ports on each stack of application, just the pod ports. I also like the isolation pods give each application stack deployment.

Since FCOS doesn't support user systemd units via Ignition, I have to set them up in as system units. Which is fine since I like using system units over user units anyways to prevent them from modified by nonroot users.

goochjj · 2020-06-30T18:06:48Z

Right, but all this works for you without --cgroups split, correct? Is there something you're hoping to gain with --cgroups split?

mheon · 2020-06-30T18:28:07Z

The pids-limit is probably Podman automatically trying to set the maximum available for that rlimit - we should code that to only happen if cgroups are present.

jdoss · 2020-06-30T18:30:45Z

@goochjj I was running FCOS with cgroups v1 up until I saw this thread that introduced --cgroups split so I started down this road of giving it a try with cgroups v2. Trying my old setup that works on FCOS cgroups v1 on FCOS with cgroups v2 doesn't work at all without setting --pids-limit 0.

I am am not trying to gain anything specific by using --cgroups split. I thought it would help provide me with a better setup for my use case.

goochjj · 2020-07-01T13:50:27Z

@Mehon I'm unclear on why cgroups aren't present... let alone that default.

It's really annoying, and seems to be cgroupsv1 specific. Should I create this as a separate issue?

mheon · 2020-07-01T14:04:40Z

I believe that's a requirement forced on us by cgroups v1 not being safe for rootless use, unless I'm greatly misunderstanding?

goochjj · 2020-07-01T14:09:22Z

@mheon I'm fine with that, as long as it doesn't explicitly require me to --pids-limit 0 everything, which it's currently doing.

This code

118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 302)            // then ignore the settings.  If the caller asked for a
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 303)            // non-default, then try to use it.
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 304)            setPidLimit := true
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 305)            if rootless.IsRootless() {
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 306)                    cgroup2, err := cgroups.IsCgroup2UnifiedMode()
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 307)                    if err != nil {
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 308)                            return nil, err
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 309)                    }
4352d58549 (Daniel J Walsh    2020-03-27 10:13:51 -0400 310)                    if (!cgroup2 || (runtimeConfig != nil && runtimeConfig.Engine.CgroupManager != cconfig.SystemdCgroupsManager)) && config.Resources.PidsLimit == sysinfo.GetDefaultPidsLimit() {
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 311)                            setPidLimit = false
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 312)                    }
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 313)            }
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 314)            if setPidLimit {
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 315)                    g.SetLinuxResourcesPidsLimit(config.Resources.PidsLimit)
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 316)                    addedResources = true
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 317)            }

in pkg/spec/spec.go seems to indicate it should already be ignoring the default on cgroups v1. I'm digging.

goochjj · 2020-07-01T14:09:58Z

Cuz this isn't great.

(focal)mrwizard@FocalCG1Dev:~/src/podman
$ podman run --rm -it alpine sh
Error: cannot set limits without cgroups: OCI runtime error

mheon · 2020-07-01T14:10:32Z

This is definitely a bug. Is this 2.0? pkg/spec is deprecated, we've moved to pkg/specgen/generate - so the offending code likely lives there.

goochjj · 2020-07-01T14:11:18Z

2.1.0-dev. Actually, master, plus my sdnotify

So, sounds like I should create a new issue.
:-D

goochjj · 2020-07-01T14:14:58Z

#6834

github-actions · 2020-08-01T00:12:45Z

A friendly reminder that this issue had no activity for 30 days.

rhatdan · 2020-08-04T14:21:19Z

Fixed in master.

giuseppe mentioned this issue May 27, 2020

Problem running rootless podman from a daemon user #6383

Closed

rhatdan assigned ashley-cui Jun 9, 2020

github-actions bot added the stale-issue label Aug 1, 2020

rhatdan closed this as completed Aug 4, 2020

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023

pursuing conventional systemd+podman interaction #6400

pursuing conventional systemd+podman interaction #6400

Comments

andrewgdunn commented May 27, 2020

andrewgdunn commented May 27, 2020

lsm5 commented May 27, 2020 • edited Loading

giuseppe commented May 27, 2020

andrewgdunn commented May 27, 2020 • edited Loading

mheon commented May 27, 2020

vrothberg commented May 27, 2020

rhatdan commented Jun 9, 2020

goochjj commented Jun 12, 2020

TravisBowers commented Jun 12, 2020

mheon commented Jun 12, 2020

mheon commented Jun 12, 2020

andrewgdunn commented Jun 12, 2020 • edited Loading

mheon commented Jun 12, 2020

lucab commented Jun 13, 2020 • edited Loading

mheon commented Jun 13, 2020 via email

mheon commented Jun 13, 2020 via email

goochjj commented Jun 13, 2020

vrothberg commented Jun 15, 2020

vrothberg commented Jun 15, 2020

vrothberg commented Jun 15, 2020

lucab commented Jun 15, 2020 • edited Loading

vrothberg commented Jun 15, 2020

goochjj commented Jun 17, 2020

goochjj commented Jun 17, 2020

goochjj commented Jun 17, 2020

mheon commented Jun 17, 2020

giuseppe commented Jun 17, 2020

goochjj commented Jun 30, 2020

jdoss commented Jun 30, 2020 • edited Loading

goochjj commented Jun 30, 2020

goochjj commented Jun 30, 2020

jdoss commented Jun 30, 2020

goochjj commented Jun 30, 2020

goochjj commented Jun 30, 2020

goochjj commented Jun 30, 2020

jdoss commented Jun 30, 2020

goochjj commented Jun 30, 2020

goochjj commented Jun 30, 2020

goochjj commented Jun 30, 2020

goochjj commented Jun 30, 2020

jdoss commented Jun 30, 2020

goochjj commented Jun 30, 2020

mheon commented Jun 30, 2020

jdoss commented Jun 30, 2020

goochjj commented Jul 1, 2020

mheon commented Jul 1, 2020

goochjj commented Jul 1, 2020

goochjj commented Jul 1, 2020

mheon commented Jul 1, 2020

goochjj commented Jul 1, 2020

goochjj commented Jul 1, 2020

github-actions bot commented Aug 1, 2020

rhatdan commented Aug 4, 2020

lsm5 commented May 27, 2020 •

edited

Loading

andrewgdunn commented May 27, 2020 •

edited

Loading

andrewgdunn commented Jun 12, 2020 •

edited

Loading

lucab commented Jun 13, 2020 •

edited

Loading

lucab commented Jun 15, 2020 •

edited

Loading

jdoss commented Jun 30, 2020 •

edited

Loading