-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
podman exec into a "-it" container: container create failed (no logs from conmon): EOF #10927
Comments
Podman exec [It] podman exec terminal doesn't hang
|
Podman exec [It] podman exec terminal doesn't hang
|
Hmmm, I wonder if this is the same problem, in a different test? Looks suspiciously close.
Podman network connect and disconnect [It] podman network connect
|
Another one, in yet another test. Looks like this is happening more often than I thought, because it happens in multiple tests: Podman exec [It] podman exec --detach
|
A friendly reminder that this issue had no activity for 30 days. |
Podman exec [It] podman exec terminal doesn't hang
Podman network connect and disconnect [It] podman network connect when not running
|
Podman network connect and disconnect [It] podman network disconnect and run with network ID
|
Podman exec [It] podman exec terminal doesn't hang
|
Still seeing this. int remote fedora-35 root |
I'll take a stab at it. Thanks for assembling the data, @edsantiago! |
while true; do
./bin/podman run --name=test --replace -dti quay.io/libpod/fedora-minimal:latest sleep +Inf
./bin/podman exec test true
./bin/podman rm -f -t0 test
done Ran over 30 minutes but no failure. I'll have a look at the code; maybe I can come up with a theory but a reproducer would be great. |
I can't reproduce on my laptop either, but on a 1minutetip f34 VM it fails instantly, on the very first try: # podman run -dti --name=test quay.io/libpod/fedora-minimal:latest sleep 20;podman exec -it test true
8ed6f60c9a8e38d2081ece7a5471cc1a931f402170a9b0ff8f149bffb434994b
Error: container create failed (no logs from conmon): EOF After that first time it still fails, but only once in 4-5 times. Note that it fails even without podman-3.4.1-1.fc34.x86_64 |
One more note: I think the |
mheon PTAL |
One would think this is a race between podman run creating the container and launching conmon, and podman exec gets to talk to conmon before it knows there is a container,causing some issues. |
Well, except that it's not always the first |
Very difficult to track this down without a repro - we need to know what's going on with Conmon such that it's blowing up (personally I think Conmon is probably either segfaulting or just printing the error to the journal and exiting without reporting the real error to Podman). Might be logs in the journal that will help us? |
@rhatdan It's not actually container create that's failing, that's a bad error message. We're trying to make a Conmon for the exec session but Conmon is failing with no logs as to why. |
@mheon see my |
Here's one in the brand-new ubuntu-2110 |
Podman network connect and disconnect [It] podman network disconnect when not running
Podman network connect and disconnect [It] podman network disconnect
|
Podman exec [It] podman exec terminal doesn't hang
Podman network connect and disconnect [It] podman network disconnect
|
Fresh one in ubuntu 2110 root. Curious thing: once it happens one time, it seems to happen on a bunch more tests afterward. |
@edsantiago Did you ever see this outside of podman exec? I am looking at the code path were the error is returned and it could also effect podman run but I don't see this in the flake log so it must be something only in exec that triggers it. |
Interesting. No, I've only seen it in |
Does this still reproduce on a 1minutetip VM? What we should do is to run the exec command with |
Yep, just reproduced on 1mt f37: # while :;do podman --log-level=debug exec -it test true || break;done
...took a while...
DEBU[0000] running conmon: /usr/bin/conmon args="[--api-version 1 -c 40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd -u 4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c -r /usr/bin/crun -b /var/lib/containers/storage/overlay-containers/40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd/userdata/4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c -p /var/lib/containers/storage/overlay-containers/40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd/userdata/4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c/exec_pid -n test --exit-dir /var/lib/containers/storage/overlay-containers/40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd/userdata/4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c/exit --full-attach -s -l none --log-level debug --syslog -t -i -e --exec-attach --exec-process-spec /var/lib/containers/storage/overlay-containers/40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd/userdata/4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c/exec-process-2394717713 --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /run/containers/storage --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/libpod --exit-command-arg --network-config-dir --exit-command-arg --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /var/lib/containers/storage/volumes --exit-command-arg --db-backend --exit-command-arg boltdb --exit-command-arg --transient-store=false --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev,metacopy=on --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg --syslog --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --exec --exit-command-arg 4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c --exit-command-arg 40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd]"
INFO[0000] Running conmon under slice machine.slice and unitName libpod-conmon-40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd.scope
DEBU[0000] Sending resize events to exec session 4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c
DEBU[0000] Attaching to container 40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd exec session 4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c
DEBU[0000] Received: 0
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...
DEBU[0000] Shutting down engines I think, but am not sure, that this might be the relevant journal log:
|
Ok great, I can see where I go from there. I can see in the conmon code that the error message causes an early exit without writing a status back to us so it would explain the podman error. |
No wait there's a lot more that might be relevant. |
The |
So here is what I see, this patch can be used to 100% reproduce this. Of course this patch is invalid but it shows where the problem seems to be. diff --git a/src/conmon.c b/src/conmon.c
index 71f4d49..0fe7a3e 100644
--- a/src/conmon.c
+++ b/src/conmon.c
@@ -338,7 +338,7 @@ int main(int argc, char *argv[])
g_unix_fd_add(seccomp_socket_fd, G_IO_IN, seccomp_accept_cb, csname);
if (csname != NULL) {
- g_unix_fd_add(console_socket_fd, G_IO_IN, terminal_accept_cb, csname);
+ //g_unix_fd_add(console_socket_fd, G_IO_IN, terminal_accept_cb, csname);
/* Process any SIGCHLD we may have missed before the signal handler was in place. */
if (!opt_exec || !opt_terminal || container_status < 0) {
GHashTable *exit_status_cache = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free); I have a hard time understanding how these g_unix_fd_add actually work. I just am going to assume that this is async code. The callback So how can we make sure the callback is run before that check? In any case this is how I undersatnd the situation. I could be totally wrong but the fact is that
is causing the error that we see in the jounal. |
great analysis! I've looked into that code in conmon and I've found some issues that could be related: containers/conmon#411
The OCI runtime blocks until the receiver (conmon in this case) receives the terminal fd, so I'd guess there is no race. What I've observed is that a failure in the I've opened a PR for conmon, not sure if it has any impact for the current issue though: containers/conmon#411 |
Thanks @giuseppe, given we have a somewhat solid reproducer on 1mt VMs I compiled your patch and currently test it there. Previously it failed within ~15 mins for me, now it is already running for 15 mins, if it doesn't fail in the next hour I would be willing to call this fixed. |
Update: I was unable to reproduce with either the patched or unpatched version today. So I cannot say if this fixed anything. I guess I need @edsantiago's special hands to prep the VM in order to reproduce. |
Hi. Sorry for neglecting this. Problem reproduces on f38, with podman @ 0357881 and conmon @ d5564f8c3693d2e54b45555fd8f3b1596b0e6d77 (which includes PR 411). Also with conmon-2.1.7-2.fc38.x86_64 but that's expected. # bin/podman run -d --name test quay.io/libpod/testimage:20221018 top
# while :;do bin/podman --log-level=debug exec -it test true || break;done
....takes a while, but eventually stops with the usual error With @Luap99's patch above (comment out Each of my different-conmon attempts involved first |
By changing INFO[0000] Running conmon under slice ...
DEBU[0000] Sending resize events to exec session
DEBU[0000] Attaching to container
DEBU[0000] Received: 0
- Error: container create failed (no logs from conmon
+ DEBU[0000] Received: 352094
+ DEBU[0000] Successfully started exec session ... in container ....
+ Thu May 25 19:09:42 UTC 2023
+ [conmon:d]: exec with attach is waiting for start message from parent
+ [indented because of \r] [conmon:d]: exec with attach got start message from parent
+ [indented more] DEBU[0000] Container ... exec session ... completed with exit code 0
... lots more ... This suggests to me that the error happens when setting up fds, not in closing them. (This may have already been obvious to you all). |
Quick reminder that this is still happening, even if we don't see it in normal CI (because of flake retries). Last 60 days:
Seen in: podman/remote fedora-37/fedora-38/fedora-39/rawhide root/rootless container/host boltdb/sqlite |
ARGH! I added This is a horrible flake. It is going to start hitting us once a day or more. Is there any chance at all that someone can fix it? Pretty please??? The only other choice is to remove the |
Here's a surprise: same thing, but on run, not exec f39 root container:
|
I don't run/exec matter much, also this error can be triggered in many ways it is juts a symptom of conmon exiting or crashing without sending us the proper data we want AFAICT
|
Observation, possibly meaningless: this is triggering often in parallel bats tests. That's unusual: this has almost always been an e2e flake.
|
I would assume it correlates to the system load, given you run in parallel now scheduling delays with be most likely be higher between the individual threads/processes thus detecting more race conditions. Looking in the journal for the container name I found this:
and also this:
I wonder if this is related: containers/crun#1524 |
Common thread seems to be:
Podman exec [It] podman exec terminal doesn't hang
And also just now in a still-live PR (my flake-xref does not handle live PRs): int podman ubuntu-2104 root host
Note: the March and April logs above have been garbagecollected, I can't confirm that the error is the same one. I'm leaving them in the report deliberately, in case it helps to have a timestamp for the start of this flake (i.e. it might not be new in June).
Edit: this is podman, not podman-remote, so it's unlikely to be the same as #7360
The text was updated successfully, but these errors were encountered: