Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplify podman systemd generate - remove cidfile #13236

Closed
grooverdan opened this issue Feb 14, 2022 · 33 comments
Closed

simplify podman systemd generate - remove cidfile #13236

grooverdan opened this issue Feb 14, 2022 · 33 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@grooverdan
Copy link
Contributor

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind feature

Description

podman systemd generate creates an ExecStartPre to remove a cidfile, and ExecStart line that include a cid file, and an ExecStop that uses it, and an ExecStopPost that removes it.

For a Type=notify with the MAINPID of the conman pushed (per comment #12778 (comment) / #9642) all of these usages of cidfile are not needed.

The Type=notify with an accurately communicated pid, and conman acting on all signals to shutdown the container means all the cidfile related directives can be removed.

Steps to reproduce the issue:

  1. bin/podman generate systemd --name --new --template {container}

Describe the results you received:

$ bin/podman generate systemd --name --new --template quizzical_satoshi
# [email protected]
# autogenerated by Podman 4.0.0-dev
# Tue Feb 15 10:04:40 AEDT 2022

[Unit]
Description=Podman container-quizzical_satoshi.service for %I
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=%t/containers

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n-%i
Restart=on-failure
TimeoutStopSec=70
ExecStartPre=/bin/rm -f %t/%n.ctr-id
ExecStart=/home/dan/repos/podman/bin/podman run --name=container-quizzical_satoshi-%i --cidfile=%t/%n.ctr-id --cgroups=no-conmon --rm --sdnotify=conmon -d -ti fedora:35
ExecStop=/home/dan/repos/podman/bin/podman stop --ignore --cidfile=%t/%n.ctr-id
ExecStopPost=/home/dan/repos/podman/bin/podman rm -f --ignore --cidfile=%t/%n.ctr-id
Type=notify
NotifyAccess=all

[Install]
WantedBy=default.target

Describe the results you expected:

$ bin/podman generate systemd --name --new --template quizzical_satoshi
# [email protected]
# autogenerated by Podman 4.0.0-dev
# Tue Feb 15 10:04:40 AEDT 2022

[Unit]
Description=Podman container-quizzical_satoshi.service for %I
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=%t/containers

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n-%i
Restart=on-failure
TimeoutStopSec=70
ExecStart=/home/dan/repos/podman/bin/podman run --name=container-quizzical_satoshi-%i --cgroups=no-conmon --rm --sdnotify=conmon -d -ti fedora:35
Type=notify
NotifyAccess=all

[Install]
WantedBy=default.target

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

$ bin/podman version
Client:       Podman Engine
Version:      4.0.0-dev
API Version:  4.0.0-dev
Go Version:   go1.16.13
Git Commit:   5977fd509582d6dc8727ce8f8a78011888a1dc17-dirty
Built:        Tue Feb 15 10:03:25 2022
OS/Arch:      linux/amd64

Package info (e.g. output of rpm -q podman or apt list podman):

built from source (today)

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

Discovered investigating @eriksjolund's good use of systemd, podman and socket activation examples eriksjolund/mariadb-podman-socket-activation#1

@openshift-ci openshift-ci bot added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 14, 2022
@grooverdan
Copy link
Contributor Author

With ExecStop removed TimeoutStopSec=70 is probably not needed either. The default KillMode=control-group should given the container and conman the fair chance to cleanly terminate. A systemd aware entry point can also EXTEND_TIMEOUT_USEC if it so desires.

@vrothberg
Copy link
Member

Thanks for opening the issue, @grooverdan! Your suggestion sounds good to me. One thing we'd need is the --replace flag in podman run to be sure that name-conflicting containers are removed.

@Luap99
Copy link
Member

Luap99 commented Feb 15, 2022

Please be aware of the consequences. We had this before and it did not work correctly, #11315

There are a number of problems when we do not use podman stop:

  1. we have to set the correct stop signal (can be fixed in the unit)
  2. we need to use the correct stop timeout (can be fixed in the unit)
  3. This is the biggest problem. If conmon exits systemd will think the unit is done and starts killing everything including the cleanup process so we now will leak mounts, network interfaces, etc...

@vrothberg
Copy link
Member

This is the biggest problem. If conmon exits systemd will think the unit is done and starts killing everything including the cleanup process so we now will leak mounts, network interfaces, etc...

Very fair point, I didn't consider this yet. The Podman clean-up process created by conmon should probably run outside the unit's cgroup.

@Luap99
Copy link
Member

Luap99 commented Feb 15, 2022

Also see #11304 (comment)

@grooverdan
Copy link
Contributor Author

Right so, https://github.com/containers/conmon/blob/e2215a1c4c01c25f2fc1206ad4df012d10374b99/src/ctr_exit.c#L222 as the SIGTERM handler should reap_children? I don't think the sdnotify setting should alter the behaviour here.

So addressing the list:

  1. This should be the handler for all signals that by default terminate(?)
  2. If we wait for the children, the timeout is handled too(?).
  3. SendSIGKILL=no to prevent the undue tear-down of conmon.

@vrothberg
Copy link
Member

The Podman clean-up process created by conmon should probably run outside the unit's cgroup.

I think this should be enough. @giuseppe WDYT?

@Luap99
Copy link
Member

Luap99 commented Feb 15, 2022

I don't think so, if you have a process that does not respond to sigterm, systemd will wait the timout and then send sigkill to the main pid conmon. This will cause conmon to exit but the container process will keep running AFAICT. The cleanup process will never be started.

We can also not use SendSIGKILL=no because otherwise process will never be terminated if sigterm is ignored
Just test with this unit file:

[Unit]
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=%t/containers

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=on-failure
TimeoutStopSec=10
ExecStart=podman run --name=testcon --cgroups=no-conmon --rm --sdnotify=conmon -d alpine sleep inf
Type=notify
NotifyAccess=all

[Install]
WantedBy=default.target

@vrothberg
Copy link
Member

We can also not use SendSIGKILL=no because otherwise process will never be terminated if sigterm is ignored

I concur. We need systemd to be able to nuke when needed.

@giuseppe
Copy link
Member

The Podman clean-up process created by conmon should probably run outside the unit's cgroup.

I think this should be enough. @giuseppe WDYT?

how can we do that? Create a new systemd scope?

@vrothberg
Copy link
Member

how can we do that? Create a new systemd scope?

Do you think that could work?

@Luap99
Copy link
Member

Luap99 commented Feb 15, 2022

But this would be to late, no? conmon has to spawn the cleanup process but if conmon is killed with sigkill this is not possible.

@giuseppe
Copy link
Member

could we move the cleanup to a ExecStopPost action?

@vrothberg
Copy link
Member

But this would be to late, no? conmon has to spawn the cleanup process but if conmon is killed with sigkill this is not possible.

If things go south, yes.

could we move the cleanup to a ExecStopPost action?

Then we need to communicate the ID of the container somewhere and would need the --cidfile again.

It seems that it's not as straight-forward as I'd wish it could be. In the end, it would "just" be a workaround for systemd. systemd would still reject the mainPID being sent by conmon.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Mar 19, 2022

@grooverdan @vrothberg What should we do with this issue?

@grooverdan
Copy link
Contributor Author

But this would be to late, no? conmon has to spawn the cleanup process but if conmon is killed with sigkill this is not possible.

If things go south, yes.

Is an early spawn of the cleanup process possible that activates when the parent process dies? I normally see cases of SIGCHILD but not the other way around.

could we move the cleanup to a ExecStopPost action?

Then we need to communicate the ID of the container somewhere and would need the --cidfile again.

It seems that it's not as straight-forward as I'd wish it could be. In the end, it would "just" be a workaround for systemd. systemd would still reject the mainPID being sent by conmon.

Doesn't NotifyAccess=all resolve this? If it does the first conman process that does the cleanup and its child sends mainPID to advertise itself as sacrificial? I don't know how that would work if the container process sends mainPid too.

But if this is all to hard/messy for little gain I guess we can just close this.

@vrothberg
Copy link
Member

Doesn't NotifyAccess=all resolve this? If it does the first conman process that does the cleanup and its child sends mainPID to advertise itself as sacrificial? I don't know how that would work if the container process sends mainPid too.

I don't think we should change the mainPID for the clean-up process as it will very likely cause hiccups.

The main challenge is to find a way to prevent the clean-up process from being killed by systemd. @msekletar, do you have suggestions? The problem in a nutshell:

  • We communicate the container ID via a file
  • The file is then used in StopPost do podman rm the container
  • We'd love to find an alternative that doesn't need a file
  • conmon will spawn a podman container cleanup process but we're afraid of the cleanup process being killed by systemd

Is there a way to wait for a child process of conmon until systemd would nuke everything?

@giuseppe
Copy link
Member

But if this is all to hard/messy for little gain I guess we can just close this.

would the gain just be to not use an external file?

If so, I agree we should probably close this issue since it seems there are no valid alternatives

@eriksjolund
Copy link
Contributor

Another mechanism that might be useful is to store information with memfd_create() and sd_pid_notify_with_fds(). I don't know if that mechanism could be helpful for this issue, but I think it's worth mentioning.

I tested it with a minimal C program and saw that file descriptors that were stored from ExecStart, ExecStop and ExecStopPost are all available to the next instance of ExecStart in case the service was explicitly restarted (systemctl --user restart my.service) or if a restart was triggered by Restart=on-failure. See Table 2 in man systemd.service. A failing Watchdog would also trigger a restart. I noticed that sd_listen_fds_with_names() unfortunately returns zero when run from a program that is executed in ExecStop or ExecStopPost (otherwise it could have been a way to pass the container ID to them).

A sketchy idea: In case conmon would like to perform a cleanup, conmon could store its intention (and the container ID) in a memfd_create file and have it stored by systemd. If the service is restarted before the container cleanup has completed, the new ExecStart podman instance could retrieve the stored information (via sd_listen_fds_with_names()) and try to complete the cleanup.

@grooverdan
Copy link
Contributor Author

would the gain just be to not use an external file?

The gain, by moving to Type=notify, is like @eriksjolund's socket activation example shows, is that we get to use containers on services that are aware of systemd, and use its notify/api controls, running both in a container as a systemd service transparently.

@msekletar
Copy link
Contributor

I think what @grooverdan suggets in the issue description is doable but requires changes in conmon. Conmon can't exit immediately after starting cleanup process in the container cgroup. Instead it needs to wait for cleanup process to finish and only after that it exists to signal systemd that unit is no longer active. To avoid sending kill signal to other processes running in the cgroup podman should generate unit files with KillMode=mixed.

For bonus points conmon could initiate cleanup process, wait for it and it can even extend originally configured stop timeout using EXTEND_TIMEOUT_USEC= notifications via sd_notify API.

@vrothberg
Copy link
Member

Thanks for taking a look, @msekletar, and for the chat off GitHub.

I agree that this is the only way. For this to work, we had to update both

  • conmon to wait for the cleanup process to finish
  • podman to not nuke conmon during cleanup which probably requires an additional flag to make the behavior conditional

@mheon, what are your thoughts?

@mheon
Copy link
Member

mheon commented Apr 4, 2022

Podman killing Conmon is a safety measure to ensure that we have a clean slate for restarting the container - Conmon holds the container's ports open, and if it's not gone attempting a container restart via the cleanup process (as happens with containers with restart policy) is not possible. Of course, Conmon could potentially be written to clean up all open FDs instead to clear that conflict, but I somewhat suspect it's not the only one.

Might be easier to add this to conmon-rs, which was always intended to be a longer-living service.

@github-actions
Copy link

github-actions bot commented May 6, 2022

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented May 6, 2022

@haircommander @saschagrunert WDYT?

@github-actions
Copy link

github-actions bot commented Jun 6, 2022

A friendly reminder that this issue had no activity for 30 days.

@mbach04
Copy link

mbach04 commented Dec 19, 2022

Bump on this one. Current state of podman generated systemd unit files are problematic in fedora-37, podman 4.3.1

@vrothberg
Copy link
Member

@mbach04, can you elaborate on what's problematic?

@mbach04
Copy link

mbach04 commented Dec 20, 2022

Fedora 37, podman 4.3.1, clean install
Start a simple pod with a yaml file, podman play kube myfile.yaml
Generate the systemd unit file, move to the appropriate location and the unit file is flagged as bad.

@vrothberg
Copy link
Member

@mbach04, generate systemd on containers created with kube play is not supported (see the docs). Instead, I recommend using the systemd template for running K8s YAML in systemd.

@mbach04
Copy link

mbach04 commented Dec 21, 2022

If the result of running podman play kube something,yaml where the file defines a pod, and the result is a Pod running in Podman, where is the limitation that results in podman not being able to generate a valid systemd unit file for a running pod?

I'm seeking to understand this as it appears like a very sensible thing to me. What's the delta between what gets created with just podman pod create and podman play kube... ?

To expand on that, I believe podman generate kube is a thing...which would sort of close the loop in the development cycle here when bouncing between Podman and proper Kube.

@rhatdan
Copy link
Member

rhatdan commented Jul 30, 2023

We now recommend that users use quadlet for running pods under systemd using kubernetes.YAML. We do not support them directly.

@rhatdan rhatdan closed this as completed Jul 30, 2023
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Oct 29, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/feature Categorizes issue or PR as related to a new feature. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

9 participants