-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error forwarding signal 18 (sometimes 15) to container ... #8086
Comments
ah reproduced the deadlock again, here's the process tree when it's locked:
stracing that: $ strace -p 11738
strace: Process 11738 attached
--- stopped by SIGTTIN --- |
SIGTTIN pauses the process it's sent to; if you hit Podman with that while we're holding a lock, then yes, I fully expect it will effectively deadlock until execution restarts. It's probably getting in before we get That snippet you provided seems to be very aggressively sending terminal flow control signals to Podman. The warning messages there are narrow race conditions with signals being sent after the container has exited but before |
the I have for instance a script which is just: $ cat t.sh
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)' I'm not sure what could be sending the actual case where I encountered the error was with a much larger timeout value ( |
Let me take a look and see if I can reproduce |
The "Error forwarding signal" bit definitely reproduces. First step is probably to drop that down to an Info level log if it's caused by the container being in a bad state - that will stop spamming the terminal with logs, at least. |
I can see a potential race where Podman will allow the Podman process to be killed, but the container to survive, if a signal arrives in a narrow window between signal proxying starting and the container starting (we start proxying before the container is fully started to ensure we don't miss sending anything). I don't think that precisely matches what's going on here, but I will see what can be done about fixing it. |
Partial fix in #8191 |
ok adjusting the above script: timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)' it looks like $ ./bin/podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
$ time timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
real 0m2.003s
user 0m0.533s
sys 0m1.194s
$ ./bin/podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b88107396029 ghcr.io/pre-commit-ci/runner-image python3 -c import... 4 seconds ago Created inspiring_merkle |
Oh, you're seeing that |
Can you confirm if |
I've never run podman as root, hope I'm doing this right! $ sudo ./bin/podman run --rm -ti ubuntu:focal echo hi
[sudo] password for asottile:
Trying to pull docker.io/library/ubuntu:focal...
Getting image source signatures
Copying blob 6a5697faee43 done
Copying blob a254829d9e55 done
Copying blob ba13d3bc422b done
Copying config d70eaf7277 done
Writing manifest to image destination
Storing signatures
hi
$ sudo ./bin/podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES |
Yes - that looks right. Rootless only, then. Thanks! |
A friendly reminder that this issue had no activity for 30 days. |
Is this still an issue with the current main branch? |
the
$ pstree -halp 12047
timeout,12047 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c import time; time.sleep(5)
└─podman,12048 run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c import time; time.sleep(5)
├─(slirp4netns,12060)
├─{podman},12049
├─{podman},12050
├─{podman},12051
├─{podman},12052
├─{podman},12053
├─{podman},12055
├─{podman},12056
└─{podman},12058 $ strace -p 12048
strace: Process 12048 attached
--- stopped by SIGTTIN ---
^Cstrace: Process 12048 detached I'm also still seeing the "Error forwarding signal 18" issue: $ timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
ERRO[0000] container not running
container not running
ERRO[0004] Error forwarding signal 18 to container bc8b042991732be239cfb2a27e5c5de26debde570dcf2628da0c3485fea8acdc: error sending signal to container bc8b042991732be239cfb2a27e5c5de26debde570dcf2628da0c3485fea8acdc: `/usr/bin/runc kill bc8b042991732be239cfb2a27e5c5de26debde570dcf2628da0c3485fea8acdc 18` failed: exit status 1 $ git rev-parse HEAD
9c5fe954cca8b4bcb8f552645e1f52a5d9824134 |
@mheon any more thoughts on this? |
On the deadlock: we're being sent a SIGTTIN and pausing execution, at a very inconvenient time (in the middle of a critical section). Podman is behaving as expected in my view, and if we're never sent a signal to un-pause execution then the lock remains held and Podman will deadlock. We could look into doing something similar to the signal-inhibit logic we added to the create logic, to inhibit signals like SIGTTIN while in critical sections, I suppose, but I'm a bit leery at the performance implications of adding/removing a signal handler every time we take a lock. It also sounds like we still have some races in |
@mheon it isn't really a rapid succession of signals -- just one SIGTERM is enough to reproduce this. The original I'm a little confused where the SIGTTIN is coming from at all -- if I'm reading correctly, the docs on this indicate that podman would have to attempt to read from stdin while backgrounded in order to receive that signal |
When Podman is run attached with Do you have a reproducer with a single SIGTERM? The timing conditions here are extremely narrow, and you're the only person I've heard report them. Are you on a particularly slow VM or embedded system? |
hmmm, but it's not backgrounded so I'm confused why it gets SIGTTIN I filled a shell script with this:
and when it doesn't deadlock I get the SIGTERM issue about ~5% of the time (which I showed here: #8086 (comment)) I'm in a VM, but I wouldn't consider it slow |
also the timing conditions aren't narrow -- I'm seeing this with a 180 second |
Podman v2.2.1 Hitting this bug also.
|
Using CTRL+C on and the entrypoint traps SIG and the container is not started with Using |
Same problem here with Podman version 2.2.1. I have an ASP.NET web service running inside a container. When I do a
This happens every time. |
I am also experiencing this issue on Podman version 2.2.1, when pressing I can confirm that the issue does not occur when the container is started via |
I'm still on 3.2.3 (apt hasn't updated yet) but it's still an issue there: $ yes | head -10 | xargs --replace -P5 timeout 2 podman run --rm --init ubuntu:focal sleep 5
2021-09-01T00:36:14.000528871Z: open pidfd: No such process
ERRO[0002] Error forwarding signal 18 to container e9011c0f5fd9d602d6c0468be5d48efacd6860b8365e3b02dbcedb2fa9eaa6b4: error sending signal to container e9011c0f5fd9d602d6c0468be5d48efacd6860b8365e3b02dbcedb2fa9eaa6b4: `/usr/bin/crun kill e9011c0f5fd9d602d6c0468be5d48efacd6860b8365e3b02dbcedb2fa9eaa6b4 18` failed: exit status 1
2021-09-01T00:36:14.000538067Z: open pidfd: No such process
ERRO[0002] Error forwarding signal 18 to container cc03ab0a3f5a8d49710b33c6949ffd6cf4046b492eeed9551d3128ebda9e0ce3: error sending signal to container cc03ab0a3f5a8d49710b33c6949ffd6cf4046b492eeed9551d3128ebda9e0ce3: `/usr/bin/crun kill cc03ab0a3f5a8d49710b33c6949ffd6cf4046b492eeed9551d3128ebda9e0ce3 18` failed: exit status 1
$ podman version
Version: 3.2.3
API Version: 3.2.3
Go Version: go1.15.2
Built: Wed Dec 31 19:00:00 1969
OS/Arch: linux/amd64 |
Ok, no, it still seems broken, if use
|
@rhatdan this isn't fixed, @bitstrings edited their comment after posting |
It's not fixed. It just doesn't happen as often on my desktop on low load. It is however still an issue. |
I tried reproducing but did not manage to. @asottile, can you share |
here it is, basically the same as in the original post but with updated versions a note: this machine is running virtualized so it's significantly slower (and others have noted that it requires a loaded system otherwise) so that may help you to reproduce $ podman info --debug
host:
arch: amd64
buildahVersion: 1.22.3
cgroupControllers: []
cgroupManager: cgroupfs
cgroupVersion: v1
conmon:
package: 'conmon: /usr/libexec/podman/conmon'
path: /usr/libexec/podman/conmon
version: 'conmon version 2.0.27, commit: '
cpus: 5
distribution:
distribution: ubuntu
version: "20.04"
eventLogger: journald
hostname: babibox
idMappings:
gidmap:
- container_id: 0
host_id: 1000
size: 1
- container_id: 1
host_id: 100000
size: 65536
uidmap:
- container_id: 0
host_id: 1000
size: 1
- container_id: 1
host_id: 100000
size: 65536
kernel: 5.11.0-34-generic
linkmode: dynamic
memFree: 6699016192
memTotal: 8345640960
ociRuntime:
name: crun
package: 'crun: /usr/bin/crun'
path: /usr/bin/crun
version: |-
crun version 0.20.1.5-925d-dirty
commit: 0d42f1109fd73548f44b01b3e84d04a279e99d2e
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
os: linux
remoteSocket:
path: /run/user/1000/podman/podman.sock
security:
apparmorEnabled: false
capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
rootless: true
seccompEnabled: true
seccompProfilePath: /usr/share/containers/seccomp.json
selinuxEnabled: false
serviceIsRemote: false
slirp4netns:
executable: /usr/bin/slirp4netns
package: 'slirp4netns: /usr/bin/slirp4netns'
version: |-
slirp4netns version 1.1.8
commit: unknown
libslirp: 4.3.1-git
SLIRP_CONFIG_VERSION_MAX: 3
libseccomp: 2.4.3
swapFree: 1964396544
swapTotal: 1964396544
uptime: 1m 49.01s
registries:
search:
- docker.io
- quay.io
store:
configFile: /home/asottile/.config/containers/storage.conf
containerStore:
number: 0
paused: 0
running: 0
stopped: 0
graphDriverName: overlay
graphOptions:
overlay.mount_program:
Executable: /usr/bin/fuse-overlayfs
Package: 'fuse-overlayfs: /usr/bin/fuse-overlayfs'
Version: |-
fusermount3 version: 3.9.0
fuse-overlayfs: version 1.5
FUSE library version 3.9.0
using FUSE kernel interface version 7.31
graphRoot: /home/asottile/.local/share/containers/storage
graphStatus:
Backing Filesystem: extfs
Native Overlay Diff: "false"
Supports d_type: "true"
Using metacopy: "false"
imageStore:
number: 7
runRoot: /run/user/1000/containers
volumePath: /home/asottile/.local/share/containers/storage/volumes
version:
APIVersion: 3.3.1
Built: 0
BuiltTime: Wed Dec 31 19:00:00 1969
GitCommit: ""
GoVersion: go1.16.6
OsArch: linux/amd64
Version: 3.3.1
|
Thanks! @giuseppe should we do some exponential back off when forwarding the signal? |
do not we risk to send the signal multiple times this way? Would we repeat it only when we get an error from the OCI runtime? |
@giuseppe, yes, we should only do the back off when there's an OCI runtime error. |
the error One alternative is to check if the container is still running when we get an error. If the container is not running, then we ignore it and stop forwarding signals. There is a possibility that the container was running when the signal was sent but I think we can live with it. |
and I am not even sure this is a bug. If the user tries to send a signal to a container that is already terminated, maybe it is correct to report the problem? |
fwiw that's not what is happening here -- |
A friendly reminder that this issue had no activity for 30 days. |
@vrothberg @mheon @giuseppe any knew thoughts on this one? I just tried it locally and run 100 containers and had no issues. But my machine is not fully loaded. |
in case it helps, here's a |
The |
crun returns that error when the container is already exited. I think a fully loaded machine just makes easier to trigger the race where Podman thinks the container is running and sends a signal. On such failures, probably Podman should ask the runtime if the container is still alive, and ignore any error when it is already terminated |
@giuseppe @vrothberg I guess in case of error would could make another call to --- a/libpod/oci_conmon_linux.go
+++ b/libpod/oci_conmon_linux.go
@@ -407,6 +407,11 @@ func (r *ConmonOCIRuntime) KillContainer(ctr *Container, signal uint, all bool)
args = append(args, "kill", ctr.ID(), fmt.Sprintf("%d", signal))
}
if err := utils.ExecCmdWithStdStreams(os.Stdin, os.Stdout, os.Stderr, env, r.path, args...); err != nil {
+ // check if container is already dead, exit gracefully
+ err := unix.Kill(ctr.state.PID, 0)
+ if err == unix.ESRCH {
+ return nil
+ }
return errors.Wrapf(err, "error sending signal to container %s", ctr.ID())
} |
I think that would not work well with VMs. It is better to go through the OCI runtime ( |
Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind bug
Description
This is distilled down from a larger example, reproducing only with coreutils
xargs
/timeout
+sleep
This is a pretty reliable example:
occasionally I'll get outputs like this:
(
runner-image
is this one)My actual use case ends like this however:
With
-i
, it occasionally deadlocks, breaking mostpodman
commands (such aspodman ps
for that user). the only way I've been able to recover the deadlock is tokill -9
that podman process(hung forever)
strace
ofpodman ps
shows what looks like a spinlock with a timeout:the signals in question are:
15: SIGTERM
18: SIGCONT
Steps to reproduce the issue:
Describe the results you received:
See output above
Describe the results you expected:
I expect things to exit gracefully, and not deadlock
Additional information you deem important (e.g. issue happens only occasionally):
Output of
podman version
:Output of
podman info --debug
:Package info (e.g. output of
rpm -q podman
orapt list podman
):Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?
Yes
Additional environment details (AWS, VirtualBox, physical, etc.):
VirtualBox + aws
The text was updated successfully, but these errors were encountered: