Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race condition yielding "Cannot get exit code: died not found: unable to find event" #11633

Closed
dustymabe opened this issue Sep 17, 2021 · 20 comments · Fixed by #11681
Closed
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@dustymabe
Copy link
Contributor

/kind bug

Description

xref: sister report in against FCOS: coreos/fedora-coreos-tracker#966

In Fedora CoreOS we have a test that tests various invocations of podman with different options. It runs in a tight loop and I believe it has exposed a race condition where the following error gets displayed:

ERRO[0000] Cannot get exit code: died not found: unable to find event

and the exit code of the podman run is 127.

Steps to reproduce the issue:

Currently we're only seeing this on AWS aarch64 FCOS instances. Once you have access to an instance:

  • build echo container
tmpdir=$(mktemp -d); cd $tmpdir; echo -e "FROM scratch\nCOPY . /" > Dockerfile;
b=$(which echo); libs=$(sudo ldd $b | grep -o /lib'[^ ]*' | sort -u);
sudo rsync -av --relative --copy-links $b $libs ./;
sudo podman build --network host --layers=false -t localhost/echo .
  • run this script which loops until one of the commands fails:
$ cat /tmp/script.sh 
#!/bin/bash
set -eux -o pipefail
while true; do
    sudo podman run --net=none --rm --memory=128m --memory-swap=128m echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --memory-reservation=10m echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --cpu-shares=100 echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --cpu-period=1000 echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --cpuset-cpus=0 echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --cpuset-mems=0 echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --cpu-quota=1000 echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --blkio-weight=10 echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --memory=128m echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --shm-size=1m echo echo 1 > /tmp/output.txt
done

Describe the results you received:
Error

Describe the results you expected:
No Error

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

$ podman version
Version:      3.3.1
API Version:  3.3.1
Go Version:   go1.16.6
Built:        Mon Aug 30 20:45:47 2021
OS/Arch:      linux/arm64

Output of podman info --debug:

$ sudo podman info --debug                                                                                                                                                                                                  
host:
  arch: arm64
  buildahVersion: 1.22.3
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.0.29-2.fc34.aarch64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.29, commit: '
  cpus: 4
  distribution:
    distribution: fedora
    version: "34"
  eventLogger: journald
  hostname: ip-172-31-81-157
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.13.16-200.fc34.aarch64
  linkmode: dynamic
  memFree: 7625285632
  memTotal: 8154873856
  ociRuntime:
    name: crun
    package: crun-1.0-1.fc34.aarch64
    path: /usr/bin/crun
    version: |-
      crun version 1.0
      commit: 139dc6971e2f1d931af520188763e984d6cdfbf8
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.1.12-2.fc34.aarch64
    version: |-
      slirp4netns version 1.1.12
      commit: 7a104a101aa3278a2152351a082a6df71f57c9a3
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.0
  swapFree: 0
  swapTotal: 0
  uptime: 38m 7.93s
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageStore:
    number: 1
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 3.3.1
  Built: 1630356347
  BuiltTime: Mon Aug 30 20:45:47 2021
  GitCommit: ""
  GoVersion: go1.16.6
  OsArch: linux/arm64
  Version: 3.3.1

Package info (e.g. output of rpm -q podman or apt list podman):

podman-3.3.1-1.fc34.aarch64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)

No

Additional environment details (AWS, VirtualBox, physical, etc.):

AWS aarch64 ami-0d04187158a93719f Fedora CoreOS 34.20210917.20.0 on c6g.xlarge instance type.

@dustymabe
Copy link
Contributor Author

Note that adding --log-level=debug seemed to make the problem go away (hence race condition).

@dustymabe
Copy link
Contributor Author

example run of script:

$ bash /tmp/script.sh 
+ true
+ sudo podman run --net=none --rm --memory=128m --memory-swap=128m echo echo 1
+ sudo podman run --net=none --rm --memory-reservation=10m echo echo 1
+ sudo podman run --net=none --rm --cpu-shares=100 echo echo 1
+ sudo podman run --net=none --rm --cpu-period=1000 echo echo 1
ERRO[0000] Cannot get exit code: died not found: unable to find event

@rhatdan
Copy link
Member

rhatdan commented Sep 17, 2021

@mheon PTAL

@rhatdan
Copy link
Member

rhatdan commented Sep 17, 2021

@mheon thought, this might be the podman container cleanup code throwing the error. Since podman run --rm will cleanup the container itself, would it make more sense to not launch the container cleanup code in conmon at all?

@mheon
Copy link
Member

mheon commented Sep 17, 2021

No guarantee that the user doesn't cause Podman to exit before the container stops - could be a SIGTERM to Podman, could just be using the disconnect keys to detach from the containers. We need the cleanup process.

The error here is very strange, because this should only be happening on an external podman rm -f of the container in question. In that case, we can't retrieve the exit code from the DB, so we have to look back in the events. I think the logic for that might not be working properly.

@mheon
Copy link
Member

mheon commented Sep 17, 2021

Ah, it's probably the script - podman run is no longer running under interactive priority, so the cleanup process has an even chance to get in first.

The question becomes - why isn't the event being written?

@mheon
Copy link
Member

mheon commented Sep 17, 2021

The script for building the echo container doesn't work. I am going to substitute busybox in an attempt to reproduce.

@mheon
Copy link
Member

mheon commented Sep 17, 2021

Naturally, that was just enough to make it not reproduce...

@mheon
Copy link
Member

mheon commented Sep 17, 2021

I did follow the code flow here, and I don't see how we can get into a position where the container is removed without having written the event - there's only one codepath to get a container from Running into Stopped, and it forces a read of the exit file and creation of an Exited event. No possibility of races given the container is locked for the duration of the operation.

@mheon
Copy link
Member

mheon commented Sep 17, 2021

Reproduced on a VM provided by @dustymabe

I can confirm that the event does exist (irrelevant bits snipped):

podman[10247]: 2021-09-17 18:00:10.937100544 +0000 UTC m=+0.403416945 container init 36474026d77b008641e37da5567b38ed30c9c8af47f96cb024a0b1147327e8f8 (image=localhost/echo:latest, name=swap, io.buildah.version=1.22.3)
podman[10247]: 2021-09-17 18:00:10.943747205 +0000 UTC m=+0.410063615 container start 36474026d77b008641e37da5567b38ed30c9c8af47f96cb024a0b1147327e8f8 (image=localhost/echo:latest, name=swap, io.buildah.version=1.22.3)
podman[10247]: 2021-09-17 18:00:10.943893511 +0000 UTC m=+0.410209945 container attach 36474026d77b008641e37da5567b38ed30c9c8af47f96cb024a0b1147327e8f8 (image=localhost/echo:latest, name=swap, io.buildah.version=1.22.3)
podman[10288]: 2021-09-17 18:00:11.000245718 +0000 UTC m=+0.046318191 container died 36474026d77b008641e37da5567b38ed30c9c8af47f96cb024a0b1147327e8f8 (image=localhost/echo:latest, name=swap)
podman[10288]: 2021-09-17 18:00:11.020850194 +0000 UTC m=+0.066922667 container remove 36474026d77b008641e37da5567b38ed30c9c8af47f96cb024a0b1147327e8f8 (image=localhost/echo:latest, name=swap, io.buildah.version=1.22.3)

@mheon
Copy link
Member

mheon commented Sep 17, 2021

From what I'm seeing, every container (including the ones that do not fail) is being removed by the cleanup process.

I think the race is somehow around the journal.

@mheon
Copy link
Member

mheon commented Sep 17, 2021

I swapped events backend around with --events-backend=file and it no longer reproduces.

Current theory: Either we're writing the event to the journal too late (from the timestamps in the example, that doesn't seem likely?) or the way we're reading the journal can sometimes miss lines?

@mheon
Copy link
Member

mheon commented Sep 17, 2021

Suspicion: we had a similar problem with logs, fixed by @nalind in 1411fa5

@mheon
Copy link
Member

mheon commented Sep 17, 2021

Alright, here's my current working theory: the journald events code unconditionally calls Next() on the cursor - so we can never return the first entry in the journal. If the first entry happens to be the Died event (I don't see how that is, given there should always be a remove event afterwards, but this is the best I have...) then we miss it.

@mheon
Copy link
Member

mheon commented Sep 17, 2021

Update: I added a check to ensure that Next() did not run the first time we loop, but it did not resolve the issue.

@Luap99
Copy link
Member

Luap99 commented Sep 20, 2021

@dustymabe If you want to fix this race for now in your tests, this should work:

...
sudo podman run --net=none --memory=128m --memory-swap=128m echo echo 1 > /tmp/output.txt
sudo podman rm -f -l
...

@mheon
Copy link
Member

mheon commented Sep 20, 2021

New theory: If we fail to grab the event, retry up to N times (probably 3-5) at some rate (quarter-second?) to give the journal time to update.

@rhatdan
Copy link
Member

rhatdan commented Sep 20, 2021

SGTM

@vrothberg
Copy link
Member

SGTM as well. I took a look and really cannot spot an issue in the code.

@mheon
Copy link
Member

mheon commented Sep 21, 2021

#11681 seems to fix

mheon added a commit to mheon/libpod that referenced this issue Sep 21, 2021
There's a potential race around extremely short-running
containers and events with journald. Events may not be written
for some time (small, but appreciable) after they are received,
and as such we can fail to retrieve it if there is a sufficiently
short time between us writing the event and trying to read it.

Work around this by just retrying, with a 0.25 second delay
between retries, up to 4 times.

[NO TESTS NEEDED] because I have no idea how to reproduce this
race in CI.

Fixes containers#11633

Signed-off-by: Matthew Heon <[email protected]>
mheon added a commit to mheon/libpod that referenced this issue Sep 22, 2021
There's a potential race around extremely short-running
containers and events with journald. Events may not be written
for some time (small, but appreciable) after they are received,
and as such we can fail to retrieve it if there is a sufficiently
short time between us writing the event and trying to read it.

Work around this by just retrying, with a 0.25 second delay
between retries, up to 4 times.

[NO TESTS NEEDED] because I have no idea how to reproduce this
race in CI.

Fixes containers#11633

Signed-off-by: Matthew Heon <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants