-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock when container exits while killing by podman #15492
Comments
libpod.Kill locks container's mutex I tried to create PR with a fix, but to be honest, it looks a little bit difficult. The simplest way is unlock mutex in the beginning of the libpod.Kill function (as in Attach function), but race condition can occur in this case. |
@tyler92 could you send SIGHUP to the deadlocked process and paste the stacktrace here? |
In real life this can happen when we launch |
Can you elaborate what's deadlocking after the session is closed? |
Can you point out that code path? I don't think that's possible. |
OK, found it: https://github.com/containers/podman/blob/main/libpod/oci_conmon_common.go#L283 |
For some reason I can't get stack though SIGHUP, but I reproduce it in IDE:
|
Excellent, thanks @tyler92. I am sure we'll find a solution. |
The same thing as in description, but in SSH case SIGTERM is sent by sshd daemon. STR:
But FYI: I tried to reproduce this in amd64 machine and I had no success. But in ARMv7 - ~50% of attempts lead to the issue. My ARMv7 host is significantly slower - maybe it's the reason why I faced with deadlocks last time =) |
Commit 30e7cbc accidentally added a deadlock as Podman was waiting for the exit code to show up when the container transitioned to stopped. Code paths that require the exit code to be written (by the cleanup process) should already be using `(*Container).Wait()` in a deadlock free way. [NO NEW TESTS NEEDED] as I did not manage to a reproducer that would work in CI. Ultimately, it's a race condition. Fixes: containers#15492 Signed-off-by: Valentin Rothberg <[email protected]>
Sure, give me some time. |
After several iterations, I don't see deadlocks, but it looks a little bit strange.
This process list is increased until an error from (1). And even if I stop my script they still exist. |
I don't have a good explanation for that observation. @mheon WDYT? |
The hung crun instances are concerning - those are OCI runtimes that have escaped the supervision of a Conmon. They could be completely hung in their setup code, but |
Ok, thanks for the fix. I'll investigate other problems and create separate issues with details if necessary. |
Thanks a lot, @tyler92 ! |
Commit 30e7cbc accidentally added a deadlock as Podman was waiting for the exit code to show up when the container transitioned to stopped. Code paths that require the exit code to be written (by the cleanup process) should already be using `(*Container).Wait()` in a deadlock free way. [NO NEW TESTS NEEDED] as I did not manage to a reproducer that would work in CI. Ultimately, it's a race condition. Backport-for: containers#15492 Signed-off-by: Valentin Rothberg <[email protected]>
I'm not ready to create a separate issue right now, but I gathered logs: podman-run.log (here is one line from test script) It looks like |
If Podman is ending up in a bad state ( |
Does it look like a bad state? |
Yep. The container is stuck in Created state, when it should be in Initialized. It doesn't have a PID registered for Conmon. It looks like Podman died before it could register that Conmon had started. |
Commit 30e7cbc accidentally added a deadlock as Podman was waiting for the exit code to show up when the container transitioned to stopped. Code paths that require the exit code to be written (by the cleanup process) should already be using `(*Container).Wait()` in a deadlock free way. [NO NEW TESTS NEEDED] as I did not manage to a reproducer that would work in CI. Ultimately, it's a race condition. Backport-for: containers#15492 BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2124716 BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2125647 Signed-off-by: Valentin Rothberg <[email protected]>
Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind bug
Description
There is a situation when podman tries to lock the same mutex twice and deadlocks. After that, all podman's commands do not work due to endless waiting for the release of this mutex. There are two possibilities to reproduce this issue - realistic and not. The following steps are about not real but simple way:
Steps to reproduce the issue:
Describe the results you received:
After some time the script will hang.
Describe the results you expected:
The script will never stop and it will do its work.
Additional information you deem important (e.g. issue happens only occasionally):
The issue happens only occasionally. In my case, it's not reproduced in the adm64 machine, but very often reproduced in my weak ARMv7 device. When issue happens I have the following output:
and the processes list looks like:
Output of
podman version
:Output of
podman info
:Package info (e.g. output of
rpm -q podman
orapt list podman
):Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)
Yes
Additional environment details (AWS, VirtualBox, physical, etc.):
The text was updated successfully, but these errors were encountered: