-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
race condition yielding "Cannot get exit code: died not found: unable to find event" #11633
Comments
Note that adding |
example run of script:
|
@mheon PTAL |
@mheon thought, this might be the podman container cleanup code throwing the error. Since podman run --rm will cleanup the container itself, would it make more sense to not launch the container cleanup code in conmon at all? |
No guarantee that the user doesn't cause Podman to exit before the container stops - could be a SIGTERM to Podman, could just be using the disconnect keys to detach from the containers. We need the cleanup process. The error here is very strange, because this should only be happening on an external |
Ah, it's probably the script - The question becomes - why isn't the event being written? |
The script for building the echo container doesn't work. I am going to substitute busybox in an attempt to reproduce. |
Naturally, that was just enough to make it not reproduce... |
I did follow the code flow here, and I don't see how we can get into a position where the container is removed without having written the event - there's only one codepath to get a container from Running into Stopped, and it forces a read of the exit file and creation of an Exited event. No possibility of races given the container is locked for the duration of the operation. |
Reproduced on a VM provided by @dustymabe I can confirm that the event does exist (irrelevant bits snipped):
|
From what I'm seeing, every container (including the ones that do not fail) is being removed by the cleanup process. I think the race is somehow around the journal. |
I swapped events backend around with Current theory: Either we're writing the event to the journal too late (from the timestamps in the example, that doesn't seem likely?) or the way we're reading the journal can sometimes miss lines? |
Alright, here's my current working theory: the journald events code unconditionally calls |
Update: I added a check to ensure that |
@dustymabe If you want to fix this race for now in your tests, this should work:
|
New theory: If we fail to grab the event, retry up to N times (probably 3-5) at some rate (quarter-second?) to give the journal time to update. |
SGTM |
SGTM as well. I took a look and really cannot spot an issue in the code. |
#11681 seems to fix |
There's a potential race around extremely short-running containers and events with journald. Events may not be written for some time (small, but appreciable) after they are received, and as such we can fail to retrieve it if there is a sufficiently short time between us writing the event and trying to read it. Work around this by just retrying, with a 0.25 second delay between retries, up to 4 times. [NO TESTS NEEDED] because I have no idea how to reproduce this race in CI. Fixes containers#11633 Signed-off-by: Matthew Heon <[email protected]>
There's a potential race around extremely short-running containers and events with journald. Events may not be written for some time (small, but appreciable) after they are received, and as such we can fail to retrieve it if there is a sufficiently short time between us writing the event and trying to read it. Work around this by just retrying, with a 0.25 second delay between retries, up to 4 times. [NO TESTS NEEDED] because I have no idea how to reproduce this race in CI. Fixes containers#11633 Signed-off-by: Matthew Heon <[email protected]>
/kind bug
Description
xref: sister report in against FCOS: coreos/fedora-coreos-tracker#966
In Fedora CoreOS we have a test that tests various invocations of podman with different options. It runs in a tight loop and I believe it has exposed a race condition where the following error gets displayed:
and the exit code of the podman run is
127
.Steps to reproduce the issue:
Currently we're only seeing this on AWS aarch64 FCOS instances. Once you have access to an instance:
echo
containerDescribe the results you received:
Error
Describe the results you expected:
No Error
Additional information you deem important (e.g. issue happens only occasionally):
Output of
podman version
:Output of
podman info --debug
:Package info (e.g. output of
rpm -q podman
orapt list podman
):Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)
No
Additional environment details (AWS, VirtualBox, physical, etc.):
AWS aarch64
ami-0d04187158a93719f
Fedora CoreOS34.20210917.20.0
onc6g.xlarge
instance type.The text was updated successfully, but these errors were encountered: