Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error forwarding signal 18 (sometimes 15) to container ... #8086

Closed
asottile opened this issue Oct 20, 2020 · 63 comments · Fixed by #12394
Closed

Error forwarding signal 18 (sometimes 15) to container ... #8086

asottile opened this issue Oct 20, 2020 · 63 comments · Fixed by #12394
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@asottile
Copy link
Contributor

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

This is distilled down from a larger example, reproducing only with coreutils xargs / timeout + sleep

This is a pretty reliable example:

$ yes | head -10 | xargs --replace -P5 timeout 2 podman run --rm --init ubuntu:focal sleep 5
ERRO[0000] container not running                        
ERRO[0002] Error forwarding signal 18 to container d9ae2ff8a421a835771930f9743210c994e2a162df6d4099eded6f5609048c6c: error sending signal to container d9ae2ff8a421a835771930f9743210c994e2a162df6d4099eded6f5609048c6c: `/usr/lib/cri-o-runc/sbin/runc kill d9ae2ff8a421a835771930f9743210c994e2a162df6d4099eded6f5609048c6c 18` failed: exit status 1 
ERRO[0000] container not running                        
ERRO[0002] Error forwarding signal 18 to container 69b4770e164abce77ecb0c0105414ed59a274296d13703eeaebc00b5f7873283: error sending signal to container 69b4770e164abce77ecb0c0105414ed59a274296d13703eeaebc00b5f7873283: `/usr/lib/cri-o-runc/sbin/runc kill 69b4770e164abce77ecb0c0105414ed59a274296d13703eeaebc00b5f7873283 18` failed: exit status 1 

occasionally I'll get outputs like this:

+ timeout 2 podman run --rm runner-image:latest python3 -c 'import time; time.sleep(5)'
ERRO[0002] Error forwarding signal 18 to container 473036e98db1884e7a6baac667a135db7e5176b9c34db0c885b6bc10efd6fd11: can only kill running containers. 473036e98db1884e7a6baac667a135db7e5176b9c34db0c885b6bc10efd6fd11 is in state stopped: container state improper 

(runner-image is this one)

My actual use case ends like this however:

time="2020-10-19T15:07:21-07:00" level=error msg="container not running"
time="2020-10-19T15:07:21-07:00" level=error msg="Error forwarding signal 15 to container 31e767a031b7cd086dbda154b9df5adaff694fd111e7feedee748d91c7eedcdb: error sending signal to container 31e767a031b7cd086dbda154b9df5adaff694fd111e7feedee748d91c7eedcdb: `/usr/lib/cri-o-runc/sbin/runc kill 31e767a031b7cd086dbda154b9df5adaff694fd111e7feedee748d91c7eedcdb 15` failed: exit status 1"

With -i, it occasionally deadlocks, breaking most podman commands (such as podman ps for that user). the only way I've been able to recover the deadlock is to kill -9 that podman process

+ timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'

(hung forever)

strace of podman ps shows what looks like a spinlock with a timeout:

$ strace podman ps

...

futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL

the signals in question are:

15: SIGTERM
18: SIGCONT

Steps to reproduce the issue:

  1. See commands above

Describe the results you received:

See output above

Describe the results you expected:

I expect things to exit gracefully, and not deadlock

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

$ podman version
Version:      2.1.1
API Version:  2.0.0
Go Version:   go1.15.2
Built:        Wed Dec 31 16:00:00 1969
OS/Arch:      linux/amd64

Output of podman info --debug:

$ podman info --debug
host:
  arch: amd64
  buildahVersion: 1.16.1
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: 'conmon: /usr/libexec/podman/conmon'
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.0.20, commit: '
  cpus: 5
  distribution:
    distribution: ubuntu
    version: "20.04"
  eventLogger: journald
  hostname: babibox
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.4.0-51-generic
  linkmode: dynamic
  memFree: 5532704768
  memTotal: 8348299264
  ociRuntime:
    name: runc
    package: 'cri-o-runc: /usr/lib/cri-o-runc/sbin/runc'
    path: /usr/lib/cri-o-runc/sbin/runc
    version: 'runc version spec: 1.0.2-dev'
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  rootless: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: 'slirp4netns: /usr/bin/slirp4netns'
    version: |-
      slirp4netns version 1.1.4
      commit: unknown
      libslirp: 4.3.1-git
      SLIRP_CONFIG_VERSION_MAX: 3
  swapFree: 1964396544
  swapTotal: 1964396544
  uptime: 5h 13m 0.85s (Approximately 0.21 days)
registries:
  search:
  - docker.io
  - quay.io
store:
  configFile: /home/asottile/.config/containers/storage.conf
  containerStore:
    number: 2
    paused: 0
    running: 2
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: 'fuse-overlayfs: /usr/bin/fuse-overlayfs'
      Version: |-
        fusermount3 version: 3.9.0
        fuse-overlayfs: version 1.1.0
        FUSE library version 3.9.0
        using FUSE kernel interface version 7.31
  graphRoot: /home/asottile/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 23
  runRoot: /run/user/1000/containers
  volumePath: /home/asottile/.local/share/containers/storage/volumes
version:
  APIVersion: 2.0.0
  Built: 0
  BuiltTime: Wed Dec 31 16:00:00 1969
  GitCommit: ""
  GoVersion: go1.15.2
  OsArch: linux/amd64
  Version: 2.1.1

Package info (e.g. output of rpm -q podman or apt list podman):

$ apt list podman
Listing... Done
podman/unknown,now 2.1.1~2 amd64 [installed]
podman/unknown 2.1.1~2 arm64
podman/unknown 2.1.1~2 armhf
podman/unknown 2.1.1~2 s390x

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

VirtualBox + aws

@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 20, 2020
@asottile
Copy link
Contributor Author

ah reproduced the deadlock again,

here's the process tree when it's locked:

podman,11738 run --rm -i runner-image:latest python3 -c import time; time.sleep(5)
  ├─(slirp4netns,11752)
  ├─{podman},11739
  ├─{podman},11740
  ├─{podman},11741
  ├─{podman},11742
  ├─{podman},11743
  ├─{podman},11745
  ├─{podman},11746
  └─{podman},11749

stracing that:

$ strace -p 11738
strace: Process 11738 attached
--- stopped by SIGTTIN ---

@mheon
Copy link
Member

mheon commented Oct 20, 2020

SIGTTIN pauses the process it's sent to; if you hit Podman with that while we're holding a lock, then yes, I fully expect it will effectively deadlock until execution restarts. It's probably getting in before we get sig-proxy configured to proxy signals into the container (otherwise it's be the container process getting the signal).

That snippet you provided seems to be very aggressively sending terminal flow control signals to Podman. The warning messages there are narrow race conditions with signals being sent after the container has exited but before podman run has exited, and we're getting enough signals to hit them consistently. The deadlock is more concerning - something sent us a SIGTTIN to pause execution but never sent a SIGCONT to resume it. Podman can't do anything about this - we received a stop signal and respected it.

@asottile
Copy link
Contributor Author

the xargs was just to make it more reproducible -- it reproduces on its own just very rarely

I have for instance a script which is just:

$ cat t.sh 
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'

I'm not sure what could be sending SIGTTIN -- timeout only seems to send SIGTERM (is it possible podman is accidentally signaling itself here?) -- I could only reproduce this deadlock with -i (maybe something related to setting tty flags?)

the actual case where I encountered the error was with a much larger timeout value (180s), which should be quite long enough for podman to be up and ready -- but I didn't have the patience to run that in a loop

@mheon mheon self-assigned this Oct 29, 2020
@mheon
Copy link
Member

mheon commented Oct 29, 2020

Let me take a look and see if I can reproduce

@mheon
Copy link
Member

mheon commented Oct 29, 2020

The "Error forwarding signal" bit definitely reproduces.

First step is probably to drop that down to an Info level log if it's caused by the container being in a bad state - that will stop spamming the terminal with logs, at least.

@mheon
Copy link
Member

mheon commented Oct 29, 2020

I can see a potential race where Podman will allow the Podman process to be killed, but the container to survive, if a signal arrives in a narrow window between signal proxying starting and the container starting (we start proxying before the container is fully started to ensure we don't miss sending anything). I don't think that precisely matches what's going on here, but I will see what can be done about fixing it.

@mheon
Copy link
Member

mheon commented Oct 29, 2020

Partial fix in #8191

@rhatdan
Copy link
Member

rhatdan commented Oct 31, 2020

@asottile Could you try again with the code in master branch. @mheon got a partial fix in, see if this helps your issue.

@asottile
Copy link
Contributor Author

ok adjusting the above script:

timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'

it looks like --rm is broken so I can't really run this in a loop:

$ ./bin/podman ps -a
CONTAINER ID  IMAGE   COMMAND  CREATED  STATUS  PORTS   NAMES
$ time timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'

real	0m2.003s
user	0m0.533s
sys	0m1.194s
$ ./bin/podman ps -a
CONTAINER ID  IMAGE                               COMMAND               CREATED        STATUS   PORTS   NAMES
b88107396029  ghcr.io/pre-commit-ci/runner-image  python3 -c import...  4 seconds ago  Created          inspiring_merkle

@mheon
Copy link
Member

mheon commented Oct 31, 2020

Oh, you're seeing that --rm is broken as well? I saw it locally on one of my machines, but other team members didn't. Interesting.

@mheon
Copy link
Member

mheon commented Oct 31, 2020

Can you confirm if --rm is working as root?

@asottile
Copy link
Contributor Author

I've never run podman as root, hope I'm doing this right!

$ sudo ./bin/podman run --rm -ti ubuntu:focal echo hi
[sudo] password for asottile: 
Trying to pull docker.io/library/ubuntu:focal...
Getting image source signatures
Copying blob 6a5697faee43 done  
Copying blob a254829d9e55 done  
Copying blob ba13d3bc422b done  
Copying config d70eaf7277 done  
Writing manifest to image destination
Storing signatures
hi
$ sudo ./bin/podman ps -a
CONTAINER ID  IMAGE   COMMAND  CREATED  STATUS  PORTS   NAMES

@mheon
Copy link
Member

mheon commented Oct 31, 2020

Yes - that looks right. Rootless only, then. Thanks!

@github-actions
Copy link

github-actions bot commented Dec 1, 2020

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Dec 1, 2020

Is this still an issue with the current main branch?

@asottile
Copy link
Contributor Author

asottile commented Dec 2, 2020

the --rm issue appears to be fixed, the deadlock is not:

$ bash -x t.sh
+ timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
$ pstree -halp 12047
timeout,12047 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c import time; time.sleep(5)
  └─podman,12048 run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c import time; time.sleep(5)
      ├─(slirp4netns,12060)
      ├─{podman},12049
      ├─{podman},12050
      ├─{podman},12051
      ├─{podman},12052
      ├─{podman},12053
      ├─{podman},12055
      ├─{podman},12056
      └─{podman},12058
$ strace -p 12048
strace: Process 12048 attached
--- stopped by SIGTTIN ---
^Cstrace: Process 12048 detached

I'm also still seeing the "Error forwarding signal 18" issue:

$ timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
ERRO[0000] container not running                        
container not running
ERRO[0004] Error forwarding signal 18 to container bc8b042991732be239cfb2a27e5c5de26debde570dcf2628da0c3485fea8acdc: error sending signal to container bc8b042991732be239cfb2a27e5c5de26debde570dcf2628da0c3485fea8acdc: `/usr/bin/runc kill bc8b042991732be239cfb2a27e5c5de26debde570dcf2628da0c3485fea8acdc 18` failed: exit status 1 
$ git rev-parse HEAD
9c5fe954cca8b4bcb8f552645e1f52a5d9824134

@rhatdan
Copy link
Member

rhatdan commented Dec 24, 2020

@mheon any more thoughts on this?

@mheon
Copy link
Member

mheon commented Dec 24, 2020

On the deadlock: we're being sent a SIGTTIN and pausing execution, at a very inconvenient time (in the middle of a critical section). Podman is behaving as expected in my view, and if we're never sent a signal to un-pause execution then the lock remains held and Podman will deadlock. We could look into doing something similar to the signal-inhibit logic we added to the create logic, to inhibit signals like SIGTTIN while in critical sections, I suppose, but I'm a bit leery at the performance implications of adding/removing a signal handler every time we take a lock.

It also sounds like we still have some races in --sig-proxy but I have doubts we will ever fully eliminate those. Forwarding signals at this rate was never the intended purpose of --sig-proxy (it was only intended to forward user-sent SIGTERM and similar into the container, not handle any serious volume of signals). Given that it's just a warning message I hesitate to say we should put that much effort in there - might be worth a "Good First Issue" tag?

@asottile
Copy link
Contributor Author

@mheon it isn't really a rapid succession of signals -- just one SIGTERM is enough to reproduce this. The original xargs command is just to make it happen more often (but it reproduces in serial)

I'm a little confused where the SIGTTIN is coming from at all -- if I'm reading correctly, the docs on this indicate that podman would have to attempt to read from stdin while backgrounded in order to receive that signal

@mheon
Copy link
Member

mheon commented Dec 24, 2020

When Podman is run attached with -i we are always reading from STDIN to forward it into the container. So podman run -i ... will always be reading from STDIN.

Do you have a reproducer with a single SIGTERM? The timing conditions here are extremely narrow, and you're the only person I've heard report them. Are you on a particularly slow VM or embedded system?

@asottile
Copy link
Contributor Author

When Podman is run attached with -i we are always reading from STDIN to forward it into the container. So podman run -i ... will always be reading from STDIN.

hmmm, but it's not backgrounded so I'm confused why it gets SIGTTIN

I filled a shell script with this:

timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'

and when it doesn't deadlock I get the SIGTERM issue about ~5% of the time (which I showed here: #8086 (comment))

I'm in a VM, but I wouldn't consider it slow

@asottile
Copy link
Contributor Author

also the timing conditions aren't narrow -- I'm seeing this with a 180 second timeout command -- just the minimal reproduces I've been showing are much smaller because I don't think anyone has the patience to run a 3 minute timeout in a loop to debug (I wouldn't even call a 4 second timeout narrow either)

@bitstrings
Copy link
Contributor

Podman v2.2.1

Hitting this bug also.

ERRO[0000] container not running                        
ERRO[0000] Error forwarding signal 15 to container b260052814cadf180bb9be699a37193842259fbe1ed4b75e85130012cc047b0a: error sending signal to container b260052814cadf180bb9be699a37193842259fbe1ed4b75e85130012cc047b0a: `/usr/bin/runc kill b260052814cadf180bb9be699a37193842259fbe1ed4b75e85130012cc047b0a 15` failed: exit status 1 

@bitstrings
Copy link
Contributor

Using CTRL+C on and the entrypoint traps SIG and the container is not started with -it we sometime get the error. However, most of the time the container terminates before a clean shutdown (i.e.: trap 'shutdown' SIGTERM SIGINT ...).

Using -it seems to make things stable again.

@svdHero
Copy link

svdHero commented Jan 29, 2021

Same problem here with Podman version 2.2.1.

I have an ASP.NET web service running inside a container. When I do a podman stop -a the container and the web service inside is shutdown gracefully. However, if I do a CTRL+C, then I get the output:

ERRO[0000] container not running                        
ERRO[0003] Error forwarding signal 15 to container 7a9d095b3287e18c8eb131a0101f64dd7105a0181580e74a295d7570c26cd110: error sending signal to container 7a9d095b3287e18c8eb131a0101f64dd7105a0181580e74a295d7570c26cd110: `/usr/bin/runc kill 7a9d095b3287e18c8eb131a0101f64dd7105a0181580e74a295d7570c26cd110 15` failed: exit status 1

This happens every time.

@realJayNay
Copy link

I am also experiencing this issue on Podman version 2.2.1, when pressing Ctrl-C to shutdown a container that was started via podman run. Consistently reproducible.

I can confirm that the issue does not occur when the container is started via podman run -it.

@asottile
Copy link
Contributor Author

asottile commented Sep 1, 2021

I'm still on 3.2.3 (apt hasn't updated yet) but it's still an issue there:

$ yes | head -10 | xargs --replace -P5 timeout 2 podman run --rm --init ubuntu:focal sleep 5
2021-09-01T00:36:14.000528871Z: open pidfd: No such process
ERRO[0002] Error forwarding signal 18 to container e9011c0f5fd9d602d6c0468be5d48efacd6860b8365e3b02dbcedb2fa9eaa6b4: error sending signal to container e9011c0f5fd9d602d6c0468be5d48efacd6860b8365e3b02dbcedb2fa9eaa6b4: `/usr/bin/crun kill e9011c0f5fd9d602d6c0468be5d48efacd6860b8365e3b02dbcedb2fa9eaa6b4 18` failed: exit status 1 
2021-09-01T00:36:14.000538067Z: open pidfd: No such process
ERRO[0002] Error forwarding signal 18 to container cc03ab0a3f5a8d49710b33c6949ffd6cf4046b492eeed9551d3128ebda9e0ce3: error sending signal to container cc03ab0a3f5a8d49710b33c6949ffd6cf4046b492eeed9551d3128ebda9e0ce3: `/usr/bin/crun kill cc03ab0a3f5a8d49710b33c6949ffd6cf4046b492eeed9551d3128ebda9e0ce3 18` failed: exit status 1 
$ podman version
Version:      3.2.3
API Version:  3.2.3
Go Version:   go1.15.2
Built:        Wed Dec 31 19:00:00 1969
OS/Arch:      linux/amd64

@bitstrings
Copy link
Contributor

bitstrings commented Sep 1, 2021

Ok, no, it still seems broken, if use stress-ng -c24 in the background, then yes I get the same errors.

> podman version
Version:      3.3.0
API Version:  3.3.0
Go Version:   go1.16.6
Built:        Fri Aug 20 15:36:14 2021
OS/Arch:      linux/amd64

@rhatdan rhatdan closed this as completed Sep 1, 2021
@asottile
Copy link
Contributor Author

asottile commented Sep 1, 2021

@rhatdan this isn't fixed, @bitstrings edited their comment after posting

@bitstrings
Copy link
Contributor

bitstrings commented Sep 1, 2021

It's not fixed. It just doesn't happen as often on my desktop on low load. It is however still an issue.

@rhatdan rhatdan reopened this Sep 1, 2021
@vrothberg
Copy link
Member

I tried reproducing but did not manage to.

@asottile, can you share podman info --debug?

@asottile
Copy link
Contributor Author

here it is, basically the same as in the original post but with updated versions

a note: this machine is running virtualized so it's significantly slower (and others have noted that it requires a loaded system otherwise) so that may help you to reproduce

$ podman info --debug
host:
  arch: amd64
  buildahVersion: 1.22.3
  cgroupControllers: []
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: 'conmon: /usr/libexec/podman/conmon'
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.0.27, commit: '
  cpus: 5
  distribution:
    distribution: ubuntu
    version: "20.04"
  eventLogger: journald
  hostname: babibox
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.11.0-34-generic
  linkmode: dynamic
  memFree: 6699016192
  memTotal: 8345640960
  ociRuntime:
    name: crun
    package: 'crun: /usr/bin/crun'
    path: /usr/bin/crun
    version: |-
      crun version 0.20.1.5-925d-dirty
      commit: 0d42f1109fd73548f44b01b3e84d04a279e99d2e
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: 'slirp4netns: /usr/bin/slirp4netns'
    version: |-
      slirp4netns version 1.1.8
      commit: unknown
      libslirp: 4.3.1-git
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.4.3
  swapFree: 1964396544
  swapTotal: 1964396544
  uptime: 1m 49.01s
registries:
  search:
  - docker.io
  - quay.io
store:
  configFile: /home/asottile/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: 'fuse-overlayfs: /usr/bin/fuse-overlayfs'
      Version: |-
        fusermount3 version: 3.9.0
        fuse-overlayfs: version 1.5
        FUSE library version 3.9.0
        using FUSE kernel interface version 7.31
  graphRoot: /home/asottile/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 7
  runRoot: /run/user/1000/containers
  volumePath: /home/asottile/.local/share/containers/storage/volumes
version:
  APIVersion: 3.3.1
  Built: 0
  BuiltTime: Wed Dec 31 19:00:00 1969
  GitCommit: ""
  GoVersion: go1.16.6
  OsArch: linux/amd64
  Version: 3.3.1

@vrothberg
Copy link
Member

Thanks!

@giuseppe should we do some exponential back off when forwarding the signal?

@giuseppe
Copy link
Member

giuseppe commented Oct 1, 2021

@giuseppe should we do some exponential back off when forwarding the signal?

do not we risk to send the signal multiple times this way? Would we repeat it only when we get an error from the OCI runtime?

@vrothberg
Copy link
Member

@giuseppe, yes, we should only do the back off when there's an OCI runtime error.

@giuseppe
Copy link
Member

giuseppe commented Oct 1, 2021

the error 2021-09-01T00:36:14.000528871Z: open pidfd: No such process happens when the container process already exited.

One alternative is to check if the container is still running when we get an error. If the container is not running, then we ignore it and stop forwarding signals.

There is a possibility that the container was running when the signal was sent but I think we can live with it.

@giuseppe
Copy link
Member

giuseppe commented Oct 1, 2021

and I am not even sure this is a bug. If the user tries to send a signal to a container that is already terminated, maybe it is correct to report the problem?

@asottile
Copy link
Contributor Author

asottile commented Oct 1, 2021

and I am not even sure this is a bug. If the user tries to send a signal to a container that is already terminated, maybe it is correct to report the problem?

fwiw that's not what is happening here -- timeout only sends one signal to the container and there's some race in tearing down the container and reporting the exit code when terminated by that signal

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Nov 22, 2021

@vrothberg @mheon @giuseppe any knew thoughts on this one? I just tried it locally and run 100 containers and had no issues. But my machine is not fully loaded.

@asottile
Copy link
Contributor Author

in case it helps, here's a --log-level=debug output for a container which hits this

log.30148.txt

@rhatdan
Copy link
Member

rhatdan commented Nov 22, 2021

The open pidfd error is coming from crun.

@bitstrings
Copy link
Contributor

bitstrings commented Nov 22, 2021

Nothing changed. When the system is under some load I do get

image

I say "under some load" because it seems to be much less frequent otherwise.

@vrothberg
Copy link
Member

@giuseppe @flouthoc could you eyeball the pidfd logic in crun?

@giuseppe
Copy link
Member

crun returns that error when the container is already exited.

I think a fully loaded machine just makes easier to trigger the race where Podman thinks the container is running and sends a signal.

On such failures, probably Podman should ask the runtime if the container is still alive, and ignore any error when it is already terminated

@flouthoc
Copy link
Collaborator

@giuseppe @vrothberg I guess in case of error would could make another call to kill and if its unix.ESRCH means no PID is there and we could terminate gracefully we are already going it for SIGKILL. I could open a PR but want to check if everybody agrees with this.

--- a/libpod/oci_conmon_linux.go
+++ b/libpod/oci_conmon_linux.go
@@ -407,6 +407,11 @@ func (r *ConmonOCIRuntime) KillContainer(ctr *Container, signal uint, all bool)
                args = append(args, "kill", ctr.ID(), fmt.Sprintf("%d", signal))
        }
        if err := utils.ExecCmdWithStdStreams(os.Stdin, os.Stdout, os.Stderr, env, r.path, args...); err != nil {
+               // check if container is already dead, exit gracefully
+               err := unix.Kill(ctr.state.PID, 0)
+               if err == unix.ESRCH {
+                       return nil
+               }
                return errors.Wrapf(err, "error sending signal to container %s", ctr.ID())
        }

@giuseppe
Copy link
Member

I think that would not work well with VMs. It is better to go through the OCI runtime ($OCI-RUNTIME state $CTR).

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.