Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Active podman process blocks system reboot/shutdown #14531

Closed
1player opened this issue Jun 8, 2022 · 34 comments · Fixed by #16785
Closed

Active podman process blocks system reboot/shutdown #14531

1player opened this issue Jun 8, 2022 · 34 comments · Fixed by #16785
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@1player
Copy link

1player commented Jun 8, 2022

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

An active podman process is unable to be cleanly stopped by systemd reboot/shutdown, and thus has to be killed after the 2min grace period expires.

Steps to reproduce the issue:

  1. podman run -it docker.io/library/busybox
  2. Inside the container: sleep infinity
  3. Reboot the system

Describe the results you received:

Shutdown procedure hangs for ~2 minutes because podman can't be stopped. Then podman is killed and shutdown is complete.

Describe the results you expected:

The podman container to be cleanly terminated as the system shuts down.

Package info (e.g. output of rpm -q podman or apt list podman):

podman-4.1.0-1.fc36.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

Experienced this issue on Fedora Workstation 36 and Fedora Silverblue 36.

Downstream bug reports:

@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 8, 2022
@vrothberg
Copy link
Member

Thanks for reaching out, @1player.

I don't think there is much Podman can do. sleep in busybox does not seem to respond to SIGSTOP, so systemd has to wait for the grace period to end until it can kill the process.

@Luap99
Copy link
Member

Luap99 commented Jun 8, 2022

I agree, it is best to call podman stop before shutdown. This uses a 10 seconds timeout before sending SIGKILL, can be changed with -t

@rhatdan
Copy link
Member

rhatdan commented Jun 8, 2022

I wrote this in bugzilla too:
https://bugzilla.redhat.com/show_bug.cgi?id=2084498

I believe that podman run/start should catch SIGTERM and then execute podman stop on its pods/containers. This would cause the containers to exit properly or exit after 10 seconds.
This might be a slight deviation from Docker in some corner cases, but I believe this is the right behaviour. Especially if a container is running with a STOP_SIGNAL that is different
then SIG_TERM.  In the common case where a container is sending SIG_TERM, there is no change except the container gets killed after 10 seconds.  In the case where STOP_SIGNAL is set
then the container has a chance to close cleanly (systemd based containers for example). 
The only case that really changes is a corner case where user expects SIGTERM of Podman to send SIGTERM to container, when the container is not useing STOP_SIGNAL==SIGTERM.  In this case
users could just call `podman kill --signal SIGTERM $CTR`

From a user point of view, I think this is the most user friendly way to handle this.

@vrothberg
Copy link
Member

vrothberg commented Jun 8, 2022

@rhatdan, I don't think that would help in this scenario.

If there's a container running that does not adhere to sigterm/stop etc. Then systemd is blocked on the process.

We could think of a podman-shutdown.service that is being called on shutdown though.

@rhatdan
Copy link
Member

rhatdan commented Jun 8, 2022

Whell it would exit with a SIGKILL after 10 seconds. Having a podman-shutdown.service might make some sense and do a
podman pod stop --all
podman stop --all

@mheon
Copy link
Member

mheon commented Jun 8, 2022

Stop timeout is also user-configurable, so someone could theoretically have a container with a stop timeout of 90 seconds to ensure their container always has time to perform its safe shutdown routine, but that would still stall the system for 90 seconds on shutdown, potentially.

@1player
Copy link
Author

1player commented Jun 8, 2022

I don't think there is much Podman can do. sleep in busybox does not seem to respond to SIGSTOP, so systemd has to wait for the grace period to end until it can kill the process.

I think the sleep in my example is a red herring. I notice this problem every time I use toolbox on a Fedora machine. Whenever I reboot, systemd complains it's not able to stop podman.

I have switched to distrobox since, and I have the same problem. Should these two utilities pass some special option to podman to avoid this?

@1player
Copy link
Author

1player commented Jun 8, 2022

Sorry for double posting, but please also note this comment of mine from https://bugzilla.redhat.com/show_bug.cgi?id=2081664#c2

As additional details, journalctl suggests that the hung shutdown is caused by /usr/bin/conmon not responding to signals. It only seems to be stuck when running an interactive process.
Example:
If I run toolbox run sleep infinity, the toolbox container can be stopped immediately with podman stop or sending SIGINT to the conmon process.
If I run toolbox run /bin/sh, the toolbox container CANNOT be stopped by podman stop (fails with: "container has active exec sessions, refusing to clean up: container state improper"), and conmon doesn't respond to SIGINT.

Is this happening because of podman "refusing to clean up"?

@vrothberg
Copy link
Member

I have switched to distrobox since, and I have the same problem. Should these two utilities pass some special option to podman to avoid this?

I would expect the tools to manage the containers and call podman stop.

Is this happening because of podman "refusing to clean up"?

That may explain why the containers are still running: podman stop failed.

@Meister1593
Copy link

Is issue still tracked? This is quite annoying bug, is there a workaround?
Forcefully killing containers through scripts at least.

@vrothberg
Copy link
Member

Is issue still tracked? This is quite annoying bug, is there a workaround? Forcefully killing containers through scripts at least.

The issue in 89luca89/distrobox#340 looks different than the one discussed here:

container has active exec sessions, refusing to clean up: container state improper

I do not know what distrobox does but it needs to exit from all exec sessions before. At the moment, I don't see how this relates to the initial bug here when the container ignores a signal and gets killed after a grace period.

@1player
Copy link
Author

1player commented Jun 28, 2022

This is not limited to distrobox. podman exhibits the exact same behaviour. It seems that running some applications inside the container puts it in a state that podman/conmon refuses to stop it gracefully upon system shutdown.

I run emacs and pretty much all my dev tools inside a distrobox container, and most times it hangs on shutdown, but sometimes it doesn't.

I do not understand, as explained in #14531 (comment), why running toolbox run /bin/sh is reason enough for podman stop to quit working. I imagine the sh process would answer to a SIGTERM, and thus terminating a container should be possible.

Maybe it is caused by subshells spawned inside the container, which causes podman to refuse terminating it, hence the delay until SIGKILL is called.

@Luap99
Copy link
Member

Luap99 commented Jun 28, 2022

It is your container process that is not responding to the signal, AFAIK shells do not shutdown on SIGTERM.

@1player
Copy link
Author

1player commented Jun 28, 2022

It is your container process that is not responding to the signal, AFAIK shells do not shutdown on SIGTERM.

Are you saying that this is a toolbox and distrobox bug, and not podman?

@Luap99
Copy link
Member

Luap99 commented Jun 28, 2022

yes, what is podman/systemd supposed to do when you container process does not shutdown on a normal stop signal, i.e. SIGTERM. So the only thing to do is to wait and send SIGKILL after timeout. You can change the stop signal and timeout with --stop-signal and --stop-timeout but I guess this only works when it is stopped via podman and not if systemd tries to kill it.

@1player
Copy link
Author

1player commented Jun 28, 2022

yes, what is podman/systemd supposed to do when you container process does not shutdown on a normal stop signal, i.e. SIGTERM. So the only thing to do is to wait and send SIGKILL after timeout. You can change the stop signal and timeout with --stop-signal and --stop-timeout but I guess this only works when it is stopped via podman and not if systemd tries to kill it.

Sorry for being obtuse, but then why podman just throws its hands in the air and says container has active exec sessions, refusing to clean up: container state improper when running podman stop? It looks like it's refusing to do anything, not that it has sent a signal and nothing has responded.

@Luap99
Copy link
Member

Luap99 commented Jun 28, 2022

I think you have to stop all exec session before, not sure if podman stop should to do that. @mheon might know that better?

@mheon
Copy link
Member

mheon commented Jun 28, 2022

Podman stop should do it. This is probably a distinct issue. Open a new bug with the full template filled out, please.

@1player
Copy link
Author

1player commented Jun 28, 2022

Is it really a distinct issue? As I described above, this seems to be the cause of this problem. Podman refusing to stop a container because "it has active exec session", thus causing issues with toolbox, thus causing shutdown issues.

There are no particular logs to see, except that upon shutdown, journalctl points out that conmon had to be SIGKILL'd, as I mentioned above. I've provided a simple reproduction example, not sure what more can I do.

Here's the gist of it: a podman container should always be able to be stopped, except in case of an unresponsive process, which I would expect podman stop to send a SIGKILL in that case. But otherwise, podman stop should stop a container, and not complain about "active exec sessions" which I'm not sure I understand what it means concretely.

@mheon
Copy link
Member

mheon commented Jun 29, 2022

Are you certain Podman is refusing to stop the container? That error message doesn't read as a stop error to me, but a cleanup error. The container should have exited at this point, Podman is just having trouble cleaning up after it.

@mheon
Copy link
Member

mheon commented Jun 29, 2022

Given this, it definitely smells like a different issue. Podman is seemingly having trouble handling cleanup on containers as the system shuts down, which is distinct from this issue where Podman takes a long time to kill containers that refuse to gracefully exit, causing shutdown to hang.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Jul 30, 2022

@vrothberg @giuseppe @mheon Do any of the fixups made recently to deadlocks address this issue?

@mheon
Copy link
Member

mheon commented Jul 30, 2022 via email

@1player
Copy link
Author

1player commented Aug 1, 2022

BTW, Fedora is supposed to shorten the timeout before unresponsive processes are SIGKILLed from 2 minutes down to 15 seconds, so if this is still open when that change ships, users won't notice anything during shutdown but containers will still be killed forcefully.

As a big toolbox/distrobox user, I get this issue 4 out of every 5 times I reboot my workstation, and I don't keep any long running services inside the container.

@github-actions
Copy link

github-actions bot commented Sep 1, 2022

A friendly reminder that this issue had no activity for 30 days.

@1player
Copy link
Author

1player commented Dec 5, 2022

This is still an issue and making life on Fedora Silverblue more painful than it needs to be.

@vrothberg vrothberg removed the kind/bug Categorizes issue or PR as related to a bug. label Dec 5, 2022
@vrothberg
Copy link
Member

@1player can you share the exact systemd unit that you run Podman in?

@queeup
Copy link

queeup commented Dec 6, 2022

@vrothberg, you can test it with this container service on Silverblue. It takes 2 min to reboot/shutdown.

  • syncthing-test.service
    • systemctl --user start syncthing-test.service
    • Then reboot.the system.

PS: This is syncthing official container, I didn't add any volume or any published port.

Dockerfile: https://github.com/syncthing/syncthing/blob/main/Dockerfile

Only way to reboot this systemd container service without waiting is use --no-healthcheck on podman args.

# autogenerated by Podman 4.3.1
# Tue Dec  6 16:27:12 +03 2022

[Unit]
Description=Podman syncthing-test.service
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=%t/containers

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=no
TimeoutStopSec=70
ExecStartPre=/bin/rm \
    -f %t/%n.ctr-id
ExecStart=/usr/bin/podman run \
    --cidfile=%t/%n.ctr-id \
    --cgroups=no-conmon \
    --rm \
    --sdnotify=conmon \
    --replace \
    --detach \
    --name syncthing-test docker.io/syncthing/syncthing
ExecStop=/usr/bin/podman stop \
    --ignore -t 10 \
    --cidfile=%t/%n.ctr-id
ExecStopPost=/usr/bin/podman rm \
    -f \
    --ignore -t 10 \
    --cidfile=%t/%n.ctr-id
Type=notify
NotifyAccess=all

[Install]
WantedBy=default.target

@vrothberg
Copy link
Member

Thanks for sharing, @queeup! I will take a look tomorrow. It's surprising to me as the stop-timeout is set to 10. So the container should - in theory - be killed after 10 seconds.

@vrothberg
Copy link
Member

I can reproduce

@vrothberg
Copy link
Member

The image ships a health check (see below) so Podman will run it on container start. But even a simple alpine top container with --health-cmd /bin/ls causes the boot to timeout/hang.

"Healthcheck": {                      
  "Test": [                           
    "CMD-SHELL",                      
    "nc -z 127.0.0.1 8384 || exit 1"  
  ],                                  
  "Interval": 60000000000,            
  "Timeout": 10000000000              
},                                    

@vrothberg
Copy link
Member

I wished having found more time to work on this bug. One thing I noticed while debugging is that we're stuck on stopping the transient health-check timer.

I hope to find some time tomorrow.

vrothberg added a commit to vrothberg/libpod that referenced this issue Dec 8, 2022
When stopping the transient systemd timer/unit which powers running
health checks, make sure to ignore its dependencies.  It turns out
that we're otherwise running into a timeout when running a container in
a systemd unit and reboot.

An alternative may be to further tweak some attributes/options when
creating the timer/unit via systemd-run but it seems safe to just ignore
the dependencies and stop.

[NO NEW TESTS NEEDED] - we don't yet have means to test reboots.

Fixes: containers#14531
Signed-off-by: Valentin Rothberg <[email protected]>
@vrothberg
Copy link
Member

#16785 fixes the issue and will make it into Podman 4.4.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 8, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants