-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlocks using the gitlab runner docker executor on the docker-compatible socket API #10090
Comments
Thanks for reaching out and apologies for the longer silence, we've been very busy in the past weeks. @mheon, do you have a suspicion on this one? |
No. We need further debug information - full output of |
Ok, here we go: this is a log wile running exactly one build in the gitlab pipeline. The pipeline eventually finished. However, The containers |
Problem persists with |
Apologies for not getting back to this issue earlier. The archive seems to be empty. I unpacked it, but there's just an empty directory. Do you still have the logs? |
It's a single packed file. This worked for me:
|
Thanks! I'll look into it. |
What I see in the logs below is the the containers in question are not getting deleted. At least, the handlers do not finish deletion in the two minutes.
|
It also seems that sigkills are failing to the container:
Before the deadlock, I see 5 consecutive requests to
But only two of these requests return:
The others do not return. A lot to unpack. |
That corresponds to the observation that a bunch of temporary containers created by the CI are not cleaned up (also not after the two minutes) and have to be removed manually. |
Only shows up 3 times but I expect to see 5. |
It seems like some logs are not displayed. The "shares network namespace, retrieving network info of container" suggest that all 5 network lists ran through. |
Yet, looking at the logs, the only theory I came up with so far is that the network listings ran into some deadlock. @mheon @Luap99 is there anything in the networking code that would support this theory? Some functions look to be recursive (e..g, getContainerNetworkInfo()). |
There is definitely a lot of runtime locking but I do not see anything obvious where it could deadlock. podman/libpod/networking_linux.go Lines 885 to 898 in ed511d2
According to the comment syncContainer() should only be called when the container is locked. @mheon Could this be the cause?podman/libpod/container_internal.go Lines 325 to 329 in ed511d2
|
That probably won't deadlock, but it is definitely undefined behavior - could cause really weird results if 2 operations were happening to the container at the same time. Since our API handlers are multithreaded, that could well be happening. |
I think that's the best shot we have so far. This code is executed concurrently (see logs) can yield undefined behavior which would reflect the image I got from staring at the logs. Maybe the deadlock is a symptom of that undefined behavior/state? @Luap99, mind opening a fix for that? |
Copied this static binary over
|
A friendly reminder that this issue had no activity for 30 days. |
@mheon @vrothberg @Luap99 Looks like this is still an issue. |
A friendly reminder that this issue had no activity for 30 days. |
@mheon @vrothberg @Luap99 any progress |
Be advised that this bug is a blocker to getting Podman supported by the GitLab Pipeline runner. Is there at least a work-around that could be suggested? |
Can you get us the stack trace? BTW We have had a release in the last month, has anyone tried this on podman 3.4? |
A friendly reminder that this issue had no activity for 30 days. |
I hope I'll be able to try this today at work. |
I observed such a deadlock-like behavior once recently, but couldn't reproduce it. Although I also saw two other issues that might be linked to this while I was using Podman inside a GitLab Runner Custom Executor:
|
A friendly reminder that this issue had no activity for 30 days. |
A friendly reminder that this issue had no activity for 30 days. |
@rhatdan I haven't seen the deadlock again, but I am also now very aware of the two issues I mentioned above that cause too small storage. So if my hypothesis is right that for me the freeze came from too low storage space, I didn't yet have an opportunity to see it again since I'm now frequently running Now that I think about this, the GitLab Docker executor @thmo is using probably never calls |
Yes, at least with Attached is a SIGQUIT dump from the (NB: I am pretty sure I am not having a disk full problem.) |
If this was really an issue with the network endpoints then I am confident this is is fixed in 4.0 However if I interpret your dump correctly This part is suspicious it hangs in container_inspect because it is waiting for the runtime lock.
Because it is already inside inspect it should hold a container lock. Maybe a ABBA deadlock between a container and the runtime lock? |
Honestly, the runtime in-memory lock doesn't really protect us against anything, given that it's purely a per-process lock and so much of what we want to protect against is multiprocess. I think we could tear it out completely without negative consequence. |
Then tear it out. |
A friendly reminder that this issue had no activity for 30 days. |
In which release is this going to be? |
Definitely 4.1 |
/kind bug
Description
Trying to use gitlab-ci with the docker executor and the podman 3.x docker-compatible API socket very frequently yields to deadlock-like situations. These manifest in the Gitlab CI pipelines hanging for a while, and in addition,
podman ps
and other CLI commands hanging indefinitely. When this happens, only asystemctl restart podman
helps.The setup uses podman 3.1.0 from the
container-tools
module stream of CentOS8. The Gitlab runner itself is run as a podman container, using the latestdocker.io/gitlab/gitlab-runner:alpine
image, with/run/docker.sock
mounted inside the runner. The socket is also mounted into a traefik container on the same machine.As a side note, intermediate containers created by the CI are stopped, but not cleaned up.
Looking at the podman journal, I repeatedly see some messages like this:
not sure if these are really related or not.
Output of
podman version
:Output of
podman info --debug
:Package info (e.g. output of
rpm -q podman
orapt list podman
):The text was updated successfully, but these errors were encountered: