race condition yielding "Cannot get exit code: died not found: unable to find event" #11633

dustymabe · 2021-09-17T15:53:56Z

/kind bug

Description

xref: sister report in against FCOS: coreos/fedora-coreos-tracker#966

In Fedora CoreOS we have a test that tests various invocations of podman with different options. It runs in a tight loop and I believe it has exposed a race condition where the following error gets displayed:

ERRO[0000] Cannot get exit code: died not found: unable to find event

and the exit code of the podman run is 127.

Steps to reproduce the issue:

Currently we're only seeing this on AWS aarch64 FCOS instances. Once you have access to an instance:

build echo container

tmpdir=$(mktemp -d); cd $tmpdir; echo -e "FROM scratch\nCOPY . /" > Dockerfile;
b=$(which echo); libs=$(sudo ldd $b | grep -o /lib'[^ ]*' | sort -u);
sudo rsync -av --relative --copy-links $b $libs ./;
sudo podman build --network host --layers=false -t localhost/echo .

run this script which loops until one of the commands fails:

$ cat /tmp/script.sh 
#!/bin/bash
set -eux -o pipefail
while true; do
    sudo podman run --net=none --rm --memory=128m --memory-swap=128m echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --memory-reservation=10m echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --cpu-shares=100 echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --cpu-period=1000 echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --cpuset-cpus=0 echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --cpuset-mems=0 echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --cpu-quota=1000 echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --blkio-weight=10 echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --memory=128m echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --shm-size=1m echo echo 1 > /tmp/output.txt
done

Describe the results you received:
Error

Describe the results you expected:
No Error

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

$ podman version
Version:      3.3.1
API Version:  3.3.1
Go Version:   go1.16.6
Built:        Mon Aug 30 20:45:47 2021
OS/Arch:      linux/arm64

Output of podman info --debug:

$ sudo podman info --debug                                                                                                                                                                                                  
host:
  arch: arm64
  buildahVersion: 1.22.3
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.0.29-2.fc34.aarch64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.29, commit: '
  cpus: 4
  distribution:
    distribution: fedora
    version: "34"
  eventLogger: journald
  hostname: ip-172-31-81-157
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.13.16-200.fc34.aarch64
  linkmode: dynamic
  memFree: 7625285632
  memTotal: 8154873856
  ociRuntime:
    name: crun
    package: crun-1.0-1.fc34.aarch64
    path: /usr/bin/crun
    version: |-
      crun version 1.0
      commit: 139dc6971e2f1d931af520188763e984d6cdfbf8
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.1.12-2.fc34.aarch64
    version: |-
      slirp4netns version 1.1.12
      commit: 7a104a101aa3278a2152351a082a6df71f57c9a3
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.0
  swapFree: 0
  swapTotal: 0
  uptime: 38m 7.93s
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageStore:
    number: 1
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 3.3.1
  Built: 1630356347
  BuiltTime: Mon Aug 30 20:45:47 2021
  GitCommit: ""
  GoVersion: go1.16.6
  OsArch: linux/arm64
  Version: 3.3.1

Package info (e.g. output of rpm -q podman or apt list podman):

podman-3.3.1-1.fc34.aarch64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)

No

Additional environment details (AWS, VirtualBox, physical, etc.):

AWS aarch64 ami-0d04187158a93719f Fedora CoreOS 34.20210917.20.0 on c6g.xlarge instance type.

The text was updated successfully, but these errors were encountered:

dustymabe · 2021-09-17T15:54:24Z

Note that adding --log-level=debug seemed to make the problem go away (hence race condition).

dustymabe · 2021-09-17T15:59:24Z

example run of script:

$ bash /tmp/script.sh 
+ true
+ sudo podman run --net=none --rm --memory=128m --memory-swap=128m echo echo 1
+ sudo podman run --net=none --rm --memory-reservation=10m echo echo 1
+ sudo podman run --net=none --rm --cpu-shares=100 echo echo 1
+ sudo podman run --net=none --rm --cpu-period=1000 echo echo 1
ERRO[0000] Cannot get exit code: died not found: unable to find event

rhatdan · 2021-09-17T16:28:19Z

@mheon PTAL

rhatdan · 2021-09-17T16:30:10Z

@mheon thought, this might be the podman container cleanup code throwing the error. Since podman run --rm will cleanup the container itself, would it make more sense to not launch the container cleanup code in conmon at all?

mheon · 2021-09-17T17:04:14Z

No guarantee that the user doesn't cause Podman to exit before the container stops - could be a SIGTERM to Podman, could just be using the disconnect keys to detach from the containers. We need the cleanup process.

The error here is very strange, because this should only be happening on an external podman rm -f of the container in question. In that case, we can't retrieve the exit code from the DB, so we have to look back in the events. I think the logic for that might not be working properly.

mheon · 2021-09-17T17:08:46Z

Ah, it's probably the script - podman run is no longer running under interactive priority, so the cleanup process has an even chance to get in first.

The question becomes - why isn't the event being written?

mheon · 2021-09-17T17:11:22Z

The script for building the echo container doesn't work. I am going to substitute busybox in an attempt to reproduce.

mheon · 2021-09-17T17:13:20Z

Naturally, that was just enough to make it not reproduce...

mheon · 2021-09-17T17:18:40Z

I did follow the code flow here, and I don't see how we can get into a position where the container is removed without having written the event - there's only one codepath to get a container from Running into Stopped, and it forces a read of the exit file and creation of an Exited event. No possibility of races given the container is locked for the duration of the operation.

mheon · 2021-09-17T18:05:16Z

Reproduced on a VM provided by @dustymabe

I can confirm that the event does exist (irrelevant bits snipped):

podman[10247]: 2021-09-17 18:00:10.937100544 +0000 UTC m=+0.403416945 container init 36474026d77b008641e37da5567b38ed30c9c8af47f96cb024a0b1147327e8f8 (image=localhost/echo:latest, name=swap, io.buildah.version=1.22.3)
podman[10247]: 2021-09-17 18:00:10.943747205 +0000 UTC m=+0.410063615 container start 36474026d77b008641e37da5567b38ed30c9c8af47f96cb024a0b1147327e8f8 (image=localhost/echo:latest, name=swap, io.buildah.version=1.22.3)
podman[10247]: 2021-09-17 18:00:10.943893511 +0000 UTC m=+0.410209945 container attach 36474026d77b008641e37da5567b38ed30c9c8af47f96cb024a0b1147327e8f8 (image=localhost/echo:latest, name=swap, io.buildah.version=1.22.3)
podman[10288]: 2021-09-17 18:00:11.000245718 +0000 UTC m=+0.046318191 container died 36474026d77b008641e37da5567b38ed30c9c8af47f96cb024a0b1147327e8f8 (image=localhost/echo:latest, name=swap)
podman[10288]: 2021-09-17 18:00:11.020850194 +0000 UTC m=+0.066922667 container remove 36474026d77b008641e37da5567b38ed30c9c8af47f96cb024a0b1147327e8f8 (image=localhost/echo:latest, name=swap, io.buildah.version=1.22.3)

mheon · 2021-09-17T18:06:20Z

From what I'm seeing, every container (including the ones that do not fail) is being removed by the cleanup process.

I think the race is somehow around the journal.

mheon · 2021-09-17T18:09:36Z

I swapped events backend around with --events-backend=file and it no longer reproduces.

Current theory: Either we're writing the event to the journal too late (from the timestamps in the example, that doesn't seem likely?) or the way we're reading the journal can sometimes miss lines?

mheon · 2021-09-17T18:33:16Z

Suspicion: we had a similar problem with logs, fixed by @nalind in 1411fa5

mheon · 2021-09-17T18:52:44Z

Alright, here's my current working theory: the journald events code unconditionally calls Next() on the cursor - so we can never return the first entry in the journal. If the first entry happens to be the Died event (I don't see how that is, given there should always be a remove event afterwards, but this is the best I have...) then we miss it.

mheon · 2021-09-17T19:28:12Z

Update: I added a check to ensure that Next() did not run the first time we loop, but it did not resolve the issue.

Luap99 · 2021-09-20T09:04:58Z

@dustymabe If you want to fix this race for now in your tests, this should work:

...
sudo podman run --net=none --memory=128m --memory-swap=128m echo echo 1 > /tmp/output.txt
sudo podman rm -f -l
...

mheon · 2021-09-20T15:42:34Z

New theory: If we fail to grab the event, retry up to N times (probably 3-5) at some rate (quarter-second?) to give the journal time to update.

rhatdan · 2021-09-20T18:29:52Z

SGTM

vrothberg · 2021-09-21T10:54:34Z

SGTM as well. I took a look and really cannot spot an issue in the code.

mheon · 2021-09-21T19:18:56Z

#11681 seems to fix

There's a potential race around extremely short-running containers and events with journald. Events may not be written for some time (small, but appreciable) after they are received, and as such we can fail to retrieve it if there is a sufficiently short time between us writing the event and trying to read it. Work around this by just retrying, with a 0.25 second delay between retries, up to 4 times. [NO TESTS NEEDED] because I have no idea how to reproduce this race in CI. Fixes containers#11633 Signed-off-by: Matthew Heon <[email protected]>

dustymabe mentioned this issue Sep 17, 2021

kola: podman.base/resources test failing with a race condition coreos/fedora-coreos-tracker#966

Closed

mheon mentioned this issue Sep 21, 2021

Add a backoff and retries to retrieving exited event #11681

Merged

openshift-merge-robot closed this as completed in #11681 Sep 22, 2021

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

race condition yielding "Cannot get exit code: died not found: unable to find event" #11633

race condition yielding "Cannot get exit code: died not found: unable to find event" #11633

dustymabe commented Sep 17, 2021

dustymabe commented Sep 17, 2021

dustymabe commented Sep 17, 2021

rhatdan commented Sep 17, 2021

rhatdan commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

Luap99 commented Sep 20, 2021

mheon commented Sep 20, 2021

rhatdan commented Sep 20, 2021

vrothberg commented Sep 21, 2021

mheon commented Sep 21, 2021

race condition yielding "Cannot get exit code: died not found: unable to find event" #11633

race condition yielding "Cannot get exit code: died not found: unable to find event" #11633

Comments

dustymabe commented Sep 17, 2021

dustymabe commented Sep 17, 2021

dustymabe commented Sep 17, 2021

rhatdan commented Sep 17, 2021

rhatdan commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

mheon commented Sep 17, 2021

Luap99 commented Sep 20, 2021

mheon commented Sep 20, 2021

rhatdan commented Sep 20, 2021

vrothberg commented Sep 21, 2021

mheon commented Sep 21, 2021