Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman exec into a "-it" container: container create failed (no logs from conmon): EOF #10927

Open
edsantiago opened this issue Jul 13, 2021 · 77 comments
Labels
bugweek flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug.

Comments

@edsantiago
Copy link
Member

edsantiago commented Jul 13, 2021

Common thread seems to be:

Running: podman [options] run -dti --name test1 quay.io/libpod/fedora-minimal:latest sleep +Inf
time="2021-06-16T19:33:53-05:00" level=warning msg="The input device is not a TTY. The --tty and --interactive flags might not work properly"
99e3b419a97aa408a4d0d3072bbd00579d5edd7c97790aa06d61f233cfdc1b4c
Running: podman [options] exec -ti test1 true
Running: podman [options] exec -ti test1 true       ! sometimes it fails on the first, sometimes on the third
Error: container create failed (no logs from conmon): EOF

Podman exec [It] podman exec terminal doesn't hang

And also just now in a still-live PR (my flake-xref does not handle live PRs): int podman ubuntu-2104 root host

Note: the March and April logs above have been garbagecollected, I can't confirm that the error is the same one. I'm leaving them in the report deliberately, in case it helps to have a timestamp for the start of this flake (i.e. it might not be new in June).

Edit: this is podman, not podman-remote, so it's unlikely to be the same as #7360

@edsantiago edsantiago added flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. labels Jul 13, 2021
@edsantiago
Copy link
Member Author

Podman exec [It] podman exec terminal doesn't hang

@edsantiago
Copy link
Member Author

Podman exec [It] podman exec terminal doesn't hang

@edsantiago
Copy link
Member Author

Hmmm, I wonder if this is the same problem, in a different test? Looks suspiciously close.

  podman network connect
Running: podman [options] exec -it test ip addr show eth1
Error: container create failed (no logs from conmon): EOF

Podman network connect and disconnect [It] podman network connect

@edsantiago
Copy link
Member Author

Another one, in yet another test. Looks like this is happening more often than I thought, because it happens in multiple tests:

Podman exec [It] podman exec --detach

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

Podman exec [It] podman exec terminal doesn't hang

Podman network connect and disconnect [It] podman network connect when not running

@edsantiago
Copy link
Member Author

Podman network connect and disconnect [It] podman network disconnect and run with network ID

@edsantiago
Copy link
Member Author

Podman exec [It] podman exec terminal doesn't hang

@edsantiago
Copy link
Member Author

Still seeing this. int remote fedora-35 root

@vrothberg vrothberg self-assigned this Nov 10, 2021
@vrothberg
Copy link
Member

I'll take a stab at it. Thanks for assembling the data, @edsantiago!

@vrothberg
Copy link
Member

while true; do                                                                                        
        ./bin/podman run --name=test --replace -dti quay.io/libpod/fedora-minimal:latest sleep +Inf   
        ./bin/podman exec test true                                                                   
        ./bin/podman rm -f -t0 test                                                                   
done                                                                                                  

Ran over 30 minutes but no failure. I'll have a look at the code; maybe I can come up with a theory but a reproducer would be great.

@edsantiago
Copy link
Member Author

I can't reproduce on my laptop either, but on a 1minutetip f34 VM it fails instantly, on the very first try:

# podman run -dti --name=test quay.io/libpod/fedora-minimal:latest sleep 20;podman exec -it test true
8ed6f60c9a8e38d2081ece7a5471cc1a931f402170a9b0ff8f149bffb434994b
Error: container create failed (no logs from conmon): EOF

After that first time it still fails, but only once in 4-5 times. Note that it fails even without < /dev/null on either podman command.

podman-3.4.1-1.fc34.x86_64
conmon-2.0.30-2.fc34.x86_64

@edsantiago
Copy link
Member Author

One more note: I think the -it is needed on exec. Without it, I can't reproduce the failure.

@rhatdan
Copy link
Member

rhatdan commented Nov 10, 2021

mheon PTAL

@rhatdan
Copy link
Member

rhatdan commented Nov 10, 2021

One would think this is a race between podman run creating the container and launching conmon, and podman exec gets to talk to conmon before it knows there is a container,causing some issues.

@edsantiago
Copy link
Member Author

Well, except that it's not always the first exec. This log shows the first three execs working, then it fails on the fourth.

@mheon
Copy link
Member

mheon commented Nov 10, 2021

Very difficult to track this down without a repro - we need to know what's going on with Conmon such that it's blowing up (personally I think Conmon is probably either segfaulting or just printing the error to the journal and exiting without reporting the real error to Podman). Might be logs in the journal that will help us?

@mheon
Copy link
Member

mheon commented Nov 10, 2021

@rhatdan It's not actually container create that's failing, that's a bad error message. We're trying to make a Conmon for the exec session but Conmon is failing with no logs as to why.

@edsantiago
Copy link
Member Author

@mheon see my 1minutetip f34 VM comment above. It reproduces reliably.

@edsantiago
Copy link
Member Author

Here's one in the brand-new ubuntu-2110

@edsantiago
Copy link
Member Author

Podman network connect and disconnect [It] podman network disconnect when not running

Podman network connect and disconnect [It] podman network disconnect

@edsantiago
Copy link
Member Author

Podman exec [It] podman exec terminal doesn't hang

Podman network connect and disconnect [It] podman network disconnect

@edsantiago
Copy link
Member Author

Fresh one in ubuntu 2110 root. Curious thing: once it happens one time, it seems to happen on a bunch more tests afterward.

@vrothberg vrothberg removed their assignment Dec 9, 2021
@Luap99
Copy link
Member

Luap99 commented Apr 27, 2023

@edsantiago Did you ever see this outside of podman exec? I am looking at the code path were the error is returned and it could also effect podman run but I don't see this in the flake log so it must be something only in exec that triggers it.

@edsantiago
Copy link
Member Author

Interesting. No, I've only seen it in exec. And I'm seeing it multiple times per day, too, and have developed a strong feel for the failure pattern. I would notice if it were anything different than usual.

@Luap99
Copy link
Member

Luap99 commented Apr 27, 2023

Does this still reproduce on a 1minutetip VM? What we should do is to run the exec command with --log-level=debug that tells conmon to log to the journal. In such case if conmon really errors, we should see a error message there.

@edsantiago
Copy link
Member Author

Yep, just reproduced on 1mt f37:

#  while :;do podman --log-level=debug exec -it test true || break;done
...took a while...
DEBU[0000] running conmon: /usr/bin/conmon               args="[--api-version 1 -c 40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd -u 4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c -r /usr/bin/crun -b /var/lib/containers/storage/overlay-containers/40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd/userdata/4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c -p /var/lib/containers/storage/overlay-containers/40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd/userdata/4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c/exec_pid -n test --exit-dir /var/lib/containers/storage/overlay-containers/40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd/userdata/4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c/exit --full-attach -s -l none --log-level debug --syslog -t -i -e --exec-attach --exec-process-spec /var/lib/containers/storage/overlay-containers/40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd/userdata/4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c/exec-process-2394717713 --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /run/containers/storage --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/libpod --exit-command-arg --network-config-dir --exit-command-arg  --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /var/lib/containers/storage/volumes --exit-command-arg --db-backend --exit-command-arg boltdb --exit-command-arg --transient-store=false --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev,metacopy=on --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg --syslog --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --exec --exit-command-arg 4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c --exit-command-arg 40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd]"
INFO[0000] Running conmon under slice machine.slice and unitName libpod-conmon-40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd.scope 
DEBU[0000] Sending resize events to exec session 4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c 
DEBU[0000] Attaching to container 40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd exec session 4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c 
DEBU[0000] Received: 0                                  
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...
DEBU[0000] Shutting down engines

I think, but am not sure, that this might be the relevant journal log:

... conmon[13913]: conmon 40f7406aeb2b58854ae0 <error>: Runtime did not set up terminal
... /usr/bin/podman[13900]: time="2023-04-27T09:35:06-04:00" level=error msg="Running container cleanup: container 40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd has no exec session with ID 4a64c746e7dc1a13e63d8292d271a2e6cc1745a64c5cd23e62c61d3e3c3b2559: no such exec session"

@Luap99
Copy link
Member

Luap99 commented Apr 27, 2023

Ok great, I can see where I go from there. I can see in the conmon code that the error message causes an early exit without writing a status back to us so it would explain the podman error.

@edsantiago
Copy link
Member Author

No wait there's a lot more that might be relevant.
journal-exec-ti.log

@edsantiago
Copy link
Member Author

The resize entries make me wonder if this is related to #10710

@Luap99
Copy link
Member

Luap99 commented Apr 27, 2023

So here is what I see, this patch can be used to 100% reproduce this. Of course this patch is invalid but it shows where the problem seems to be.

diff --git a/src/conmon.c b/src/conmon.c
index 71f4d49..0fe7a3e 100644
--- a/src/conmon.c
+++ b/src/conmon.c
@@ -338,7 +338,7 @@ int main(int argc, char *argv[])
                g_unix_fd_add(seccomp_socket_fd, G_IO_IN, seccomp_accept_cb, csname);
 
        if (csname != NULL) {
-               g_unix_fd_add(console_socket_fd, G_IO_IN, terminal_accept_cb, csname);
+               //g_unix_fd_add(console_socket_fd, G_IO_IN, terminal_accept_cb, csname);
                /* Process any SIGCHLD we may have missed before the signal handler was in place.  */
                if (!opt_exec || !opt_terminal || container_status < 0) {
                        GHashTable *exit_status_cache = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);

I have a hard time understanding how these g_unix_fd_add actually work. I just am going to assume that this is async code. The callback terminal_accept_cb() sets the mainfd_stdout var to the actual fd. And the check here (which is the one failing) errors out.
So In my understanding there is a simple race between the callback and then continuing below and making the check against the mainfd_stdout fd.

So how can we make sure the callback is run before that check?

In any case this is how I undersatnd the situation. I could be totally wrong but the fact is that

if (opt_terminal && mainfd_stdout == -1)
	nexit("Runtime did not set up terminal");

is causing the error that we see in the jounal.

@giuseppe
Copy link
Member

great analysis!

I've looked into that code in conmon and I've found some issues that could be related: containers/conmon#411

g_unix_fd_add should report when the FD is ready, even if it is already when we register the callback. I've done some tests by adding an artificial sleep() before g_unix_fd_add and it seems to be handled correctly.

The OCI runtime blocks until the receiver (conmon in this case) receives the terminal fd, so I'd guess there is no race. What I've observed is that a failure in the terminal_accept_cb is not handled and could lead to the same failure.

I've opened a PR for conmon, not sure if it has any impact for the current issue though: containers/conmon#411

@Luap99
Copy link
Member

Luap99 commented Apr 28, 2023

Thanks @giuseppe, given we have a somewhat solid reproducer on 1mt VMs I compiled your patch and currently test it there. Previously it failed within ~15 mins for me, now it is already running for 15 mins, if it doesn't fail in the next hour I would be willing to call this fixed.

@Luap99
Copy link
Member

Luap99 commented Apr 28, 2023

Update: I was unable to reproduce with either the patched or unpatched version today. So I cannot say if this fixed anything. I guess I need @edsantiago's special hands to prep the VM in order to reproduce.

@edsantiago
Copy link
Member Author

Hi. Sorry for neglecting this.

Problem reproduces on f38, with podman @ 0357881 and conmon @ d5564f8c3693d2e54b45555fd8f3b1596b0e6d77 (which includes PR 411). Also with conmon-2.1.7-2.fc38.x86_64 but that's expected.

# bin/podman run -d --name test quay.io/libpod/testimage:20221018 top
# while :;do bin/podman --log-level=debug exec -it test true || break;done
....takes a while, but eventually stops with the usual error

With @Luap99's patch above (comment out g_unix_fd_add) it reproduces 100% (or at least five out of five tries).

Each of my different-conmon attempts involved first podman rm -f -a -t0 to stop the running container.

@edsantiago
Copy link
Member Author

By changing podman exec ... true to ... date, the debug logs change as follows:

  INFO[0000] Running conmon under slice ...
  DEBU[0000] Sending resize events to exec session
  DEBU[0000] Attaching to container
  DEBU[0000] Received: 0
- Error: container create failed (no logs from conmon
+ DEBU[0000] Received: 352094
+ DEBU[0000] Successfully started exec session ... in container ....
+ Thu May 25 19:09:42 UTC 2023
+ [conmon:d]: exec with attach is waiting for start message from parent
+    [indented because of \r]   [conmon:d]: exec with attach got start message from parent
+     [indented more]      DEBU[0000] Container ... exec session ... completed with exit code 0
... lots more ...

This suggests to me that the error happens when setting up fds, not in closing them. (This may have already been obvious to you all).

@edsantiago
Copy link
Member Author

Quick reminder that this is still happening, even if we don't see it in normal CI (because of flake retries). Last 60 days:

  • fedora-37 : int podman fedora-37 root container sqlite
    • 09-22 12:15 in Podman run podman run --volumes-from flag mount conflicts with image volume
  • fedora-37 : int podman fedora-37 rootless host sqlite
    • 09-14 13:52 in Podman run podman run --volumes-from flag mount conflicts with image volume
    • 08-24-2023 11:21 in Podman run podman run --volumes-from flag mount conflicts with image volume
    • 08-23-2023 17:34 in Podman run podman run --volumes-from flag mount conflicts with image volume
    • 08-14-2023 13:04 in Podman run podman run --volumes-from flag mount conflicts with image volume
  • fedora-38 : podman fedora-38 root host sqlite
    • 08-11-2023 09:31 in Podman run podman run --volumes-from flag mount conflicts with image volume
  • fedora-38 : int podman fedora-38 rootless host sqlite
    • 08-30-2023 17:07 in Podman healthcheck run podman disable healthcheck with --no-healthcheck must not show starting on status
  • fedora-38 : int remote fedora-38 root host boltdb [remote]
    • 08-28-2023 11:36 in Podman run podman run --volumes-from flag mount conflicts with image volume
    • 08-14-2023 18:34 in Podman run podman run --volumes-from flag mount conflicts with image volume
  • fedora-39 : int podman fedora-39 rootless host boltdb
    • 09-28 11:54 in Podman run podman run --volumes-from flag mount conflicts with image volume
  • rawhide : int podman rawhide root host sqlite
    • 09-27 21:53 in Podman run podman run --volumes-from flag mount conflicts with image volume

Seen in: podman/remote fedora-37/fedora-38/fedora-39/rawhide root/rootless container/host boltdb/sqlite

@edsantiago
Copy link
Member Author

ARGH! I added exec -t tests in system tests -- which do not have FlakeRetry -- and, blam.

This is a horrible flake. It is going to start hitting us once a day or more. Is there any chance at all that someone can fix it? Pretty please??? The only other choice is to remove the exec -t system test.

@edsantiago
Copy link
Member Author

Here's a surprise: same thing, but on run, not exec f39 root container:

# podman [options] run -dt quay.io/libpod/alpine:latest
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...

@Luap99
Copy link
Member

Luap99 commented May 3, 2024

Here's a surprise: same thing, but on run, not exec f39 root container:

# podman [options] run -dt quay.io/libpod/alpine:latest
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...

I don't run/exec matter much, also this error can be triggered in many ways it is juts a symptom of conmon exiting or crashing without sending us the proper data we want AFAICT

$ podman --conmon /usr/bin/false  run --rm quay.io/libpod/testimage:20240123 true
Error: conmon failed: exit status 1
$ podman --conmon /usr/bin/true  run --rm quay.io/libpod/testimage:20240123 true
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...

@edsantiago
Copy link
Member Author

Observation, possibly meaningless: this is triggering often in parallel bats tests. That's unusual: this has almost always been an e2e flake.

x x x x x x
int(4) podman(7) debian-13(4) root(7) host(7) sqlite(6)
sys(4) remote(1) fedora-39(3) rootless(1) container(1) boltdb(2)
fedora-40(1)

@Luap99
Copy link
Member

Luap99 commented Aug 29, 2024

I would assume it correlates to the system load, given you run in parallel now scheduling delays with be most likely be higher between the individual threads/processes thus detecting more race conditions.

Looking in the journal for the container name I found this:

Aug 27 15:37:07 cirrus-task-6657029832441856 podman[135965]: 2024-08-27 15:37:07.90776318 -0500 CDT m=+0.096462576 container exec f079d2c2b5726194b21cfbf446052e3ec8cece8f28f0723652bc196d4c0759c5 (image=quay.io/libpod/testimage:20240123, name=c--explicit_RUN_term-t148-3n5hvd07, created_at=2024-01-24T18:38:28Z, created_by=test/system/build-testimage, io.buildah.version=1.34.1-dev)
Aug 27 15:37:07 cirrus-task-6657029832441856 conmon[135999]: conmon f079d2c2b5726194b21c <nwarn>: Failed to open cgroups file: /proc/136012/cgroup
Aug 27 15:37:07 cirrus-task-6657029832441856 conmon[135999]: conmon f079d2c2b5726194b21c <nwarn>: Failed to get cgroup path. Container may have exited
Aug 27 15:37:07 cirrus-task-6657029832441856 conmon[135999]: conmon f079d2c2b5726194b21c <nwarn>: Failed to open cgroups file:
Aug 27 15:37:07 cirrus-task-6657029832441856 podman[135965]: 2024-08-27 15:37:07.93562878 -0500 CDT m=+0.124328176 container exec_died f079d2c2b5726194b21cfbf446052e3ec8cece8f28f0723652bc196d4c0759c5 (image=quay.io/libpod/testimage:20240123, name=c--explicit_RUN_term-t148-3n5hvd07, created_at=2024-01-24T18:38:28Z, created_by=test/system/build-testimage, io.buildah.version=1.34.1-dev)

and also this:

Aug 27 15:37:07 cirrus-task-6657029832441856 conmon[135892]: conmon f079d2c2b5726194b21c <error>: Runtime did not set up terminal

I wonder if this is related: containers/crun#1524

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugweek flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

7 participants