podman run -d: hangs when $NOTIFY_SOCKET is set #7316

edsantiago · 2020-08-13T20:54:13Z

# export NOTIFY_SOCKET=/tmp/mypodmansocket
# socat unix-recvfrom:"$NOTIFY_SOCKET",fork system:"(cat;echo)" &
[1] 104770
# podman run -d --sdnotify=container alpine sh -c 'sleep 10'
2020/08/13 16:50:03 socat[104857] E sendto(8, 0x55de50ff26b0, 14, 0, AF=1 "<anon>", 0): Transport endpoint is not connected
95fb97931bbf3c3b773dddff07f9fea80b341293fb39e6d63e3becbd3937099f
2020/08/13 16:50:13 socat[104864] E sendto(8, 0x55de50ff26b0, 14, 0, AF=1 "<anon>", 0): Transport endpoint is not connected

Note the ten-second difference in the socat timestamps; that is because the container is sleep 10. Change it to sleep 30, you get a 30-second delay.

What I expected: since this is run -d (detached), I expected podman to detach immediately and let the container deal with sdnotify.

(How I found this: trying to run systemd-notify in a fedora:latest container. Ha ha, silly me, they removed systemd-notify from that image). Ergo, I think this is counterintuitive behavior if a user has a container that never makes it to sdnotify. I really don't expect podman run -d to hang forever.

The text was updated successfully, but these errors were encountered:

edsantiago · 2020-08-13T21:28:18Z

Oooh! Try running podman ps or even podman info while there's a hung podman run -d. It too will hang. (podman images is fine).

mheon · 2020-08-13T22:41:26Z

Is this the same issue as #6688

edsantiago · 2020-08-13T23:15:02Z

Yes, it looks like a different manifestation of the same problem. I'd say "feel free to close", but given how badly #6688 has been neglected, I'm going to leave it open as a reminder that this really is an unpleasant and unacceptable bug,

Oops. PR containers#6693 (sdnotify) added tests, but they were disabled due to broken crun on f31. I tried for three weeks to get a magic CI:IMG PR to update crun on the CI VMs ... but in that time I forgot to actually enable those new tests. This PR removes a 'skip', replacing it with a check that systemd is running plus one more to make sure our runtime is crun. It looks like sdnotify just doesn't work on Ubuntu (it hangs), and my guess is that it's a crun/runc issue. I also changed the test image from fedora:latest to :31, because, sigh, fedora:latest removed the systemd-notify tool. WARNING WARNING WARNING: the symptom of a missing systemd-notify is that podman will hang forever, not even stopped by the timeout command in podman_run! (Filed: containers#7316). This means that if the sdnotify-in-container test ever fails, the symptom will be that Cirrus itself will time out (2 hours?). This is horrible. I don't know what to do about it other than push for a fix for 7316. Signed-off-by: Ed Santiago <[email protected]>

giuseppe · 2020-08-17T10:56:07Z

I don't think it is a bug. podman run waits for the container to notify when it is ready. If the container is never ready what should we do? It is not even Podman fault at this point, the OCI runtime is handling the NOTIFY_SOCKET

edsantiago · 2020-08-17T11:55:58Z

I'm fine with the container hanging. I'm not fine with podman ps or podman info or other podman commands hanging.

rhatdan · 2020-08-17T17:55:56Z

Yes, I think we need a way to work around the lock.

mheon · 2020-08-17T18:09:31Z

We have a way forward via containers/conmon#182 if we can get it landed

giuseppe · 2020-08-18T07:25:27Z

We have a way forward via containers/conmon#182 if we can get it landed

I am fine with the solution proposed there, but we would still have the locking issue when --sdnotify=podman is used though.

Some CI tests are hanging, timing out in 60 or 120 minutes. I wonder if it's containers#7316, the bug where all podman commands hang forever if NOTIFY_SOCKET is set? Signed-off-by: Ed Santiago <[email protected]>

rhatdan · 2020-09-11T10:42:32Z

Looks like containers/conmon#182 is ready to go in, but was not looked at for 20 days.

github-actions · 2020-10-12T00:19:24Z