-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Podman lock contention when attempting to restart multiple containers #11940
Comments
This doesn't seem like a deadlock - it more seems like Podman is constantly attempting to restart containers, resulting in at least one container having its lock taken at all times, making Is this a particularly slow system you're testing on? It could explain why things appear to deadlock. I'm fairly convinced there's no actual deadlock here, just a severely taxed system. |
Thanks for looking into this @mheon. Yea it's a dedicated server with 48 cores, the deadlocking is somewhat inconsistent for me. I tried it again and couldn't get it to deadlock, but other times, it deadlocks after the first couple restart cycle on 8 containers. I would let podman commands for 5-10 minutes before removing the lock file and killing processes. I'm using |
A friendly reminder that this issue had no activity for 30 days. |
A friendly reminder that this issue had no activity for 30 days. |
FWIW, I think that |
I'll take a stab at it. |
Scratch that ... these operations are batched. |
I was seeing this earlier this week in a slightly different context ( One thought I have is to print results as they come, instead of all at once when the command is done. This isn't perfect, but it would be a lot more clear to the user what is happening (at least, it will be obvious that the commands are not deadlocked) |
Other possible thought: randomize the order in which we act on containers. |
I added a bit of randomization to the ordering, but it wasn't enough - no appreciable increase in performance, there are still too many collisions (25 parallel jobs over 200 test containers meant |
@mheon, that is a great trail your on. Maybe we should think in terms of a work pool rather in terms of workers per caller. Could we have a global shared semaphore to limit the number of parallel batch workers? That would limit lock contention etc. AFAIK the locks are already fair. |
We do have a semaphore right now, but it's per-process, not global. Making it global is potentially interesting, if we can get a MP-safe shared-memory semaphore. |
Shared semaphore looks viable. My only concern is making sure that crashes and SIGKILL don't affect us - if, say, |
@mheon Any movement on this? |
Negative. Might be worth discussing at the cabal if we have time? I don't have a solid feel on how to fix this. |
I have investigated this issue (it reproduces in my case too). Simple program based on shm_lock code shows the following picture:
462207 is process, that is started when restart is occurred - And these processes are deadlocked because they are waiting each other (lock order problem).
where
|
So it looks like we should lock of a container's Pod before lock a container. Is it a good idea? |
That is definitely a separate issue, please file a new bug for it
…On Tue, Jul 12, 2022 at 15:20 tyler92 ***@***.***> wrote:
So it looks like we should lock of a container's Pod before lock a
container. Is it a good idea?
—
Reply to this email directly, view it on GitHub
<#11940 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3AOCH6CPEVHX7W5MKDT23VTXAQNANCNFSM5F2WUEAA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
No problem: #14921 |
A friendly reminder that this issue had no activity for 30 days. |
@mheon Any progress on this? |
Negative. I don't think we have a good solution yet. |
A friendly reminder that this issue had no activity for 30 days. |
@gcs278, running
FWIW, I had another look at the issue. I couldn't see any deadlocks and |
Its funny that we just has a discussion with BU Student where restart always might come in handy. Imagine you have two or more containers in a pod or multiple pods that require services from each other. In Compose you can set what containers need to come up first before a second container starts. In podman we sequentially start the containers, and if Container A requires Container B, then when container A fails we failed, without starting Container B. If they all started simultaneously then Container A could fail, container B would succeed, and when Container A restarted Container B would be running, and we would get to a good state. I think current design is Contaner A keeps restarting, and Container B does not ever get a chance. I think if we fix this simultanious start, then restart always will make some sense. |
I'll take a stab and close the issue. As mentioned in #11940 (comment), things have improved considerably since the initial report in Oct '21. Feel free to drop a comment or reopen if you think otherwise. |
Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind bug
Description
With a restart policy as
always
oron-failed
, podman seems to really struggle and potentially deadlock when it is restarting multiple containers that are constantly exiting. I first noticed this problem with usingpodman play kube
where a couple containers wereconstantly dying and the restart policy wasalways
. I then added an script with justexit 1
as the entrypoint and watched podman commands being to hang longer.I started 8 instances of
exit 1
and--restart=always
containers viapodman run
and podman commands took around 60 seconds to return. After about a minute, podman seemed to deadlock. Podman commands weren't returning and I couldn't stop any of the dying containers. Irm -f /dev/shm/libpod_lock
and did apkill podman
to release the deadlock.This is a big problem for us, as we can't trust podman to restart containers without deadlocking. This seems related to #11589, but I thought it would be better to separately track since it's a different situation.
Steps to reproduce the issue:
podman ps
. See if podman deadlocksDescribe the results you received:
Podman gets extremely sluggish and then deadlocks
Describe the results you expected:
Podman wouldn't deadlock
Additional information you deem important (e.g. issue happens only occasionally):
Output of
podman version
:Output of
podman info --debug
:Package info (e.g. output of
rpm -q podman
orapt list podman
):Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)
Yes
Additional environment details (AWS, VirtualBox, physical, etc.):
The text was updated successfully, but these errors were encountered: