Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flake: "timed out waiting for file" #5339

Closed
edsantiago opened this issue Feb 27, 2020 · 18 comments · Fixed by containers/conmon#128
Closed

flake: "timed out waiting for file" #5339

edsantiago opened this issue Feb 27, 2020 · 18 comments · Fixed by containers/conmon#128
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

Seeing this flake in CI periodically; so far, it always seems to be connected to 'podman exec':

timed out waiting for file /tmp/podman_test276144922/crio/vfs-containers/a5affd794b77bd57a0b5e950b5884175320d9c6360ba65bdad9ef72ce5b2979b/userdata/9b0a41da6f1d5d50c61f75ad57c7a531d5688446a80bccaf61852b4fad8a0451/exit/a5affd794b77bd57a0b5e950b5884175320d9c6360ba65bdad9ef72ce5b2979b: internal libpod error"

This is a placeholder so we can track the problem and gather info.

@edsantiago
Copy link
Member Author

@mheon
Copy link
Member

mheon commented Feb 27, 2020

Always the same test, seemingly - two execs, one after the other.

@edsantiago
Copy link
Member Author

And another one: https://api.cirrus-ci.com/v1/task/6533719171268608/logs/system_test.log

This one is "special_testing_rootless" (I don't know if that's Fedora or Ubuntu) and, more interestingly, it's in the BATS tests instead of ginkgo. Common factor is, as @mheon pointed out, two execs in quick succession.

@edsantiago
Copy link
Member Author

@edsantiago
Copy link
Member Author

Extra info: yesterday, at @cevich's suggestion, I tried switching to vfs and also some scheduler magic for compatibility with CI:

# grep ^driver /etc/containers/storage.conf
driver = "vfs"

# echo "mq-deadline" > /sys/block/vda/queue/scheduler

Tried an infinite-loop of the networking BATS test. No failures. But this is f31, and so far it's looking like all the flakes are happening on f30...?

@cevich
Copy link
Member

cevich commented Feb 27, 2020

Could this be related to the runc/crun thing I'm trying to fix in #5342

(we're using crun in F30 in CI vs earlier we were using runc IIRC)

@baude
Copy link
Member

baude commented Feb 27, 2020

@cevich yes are we ready to merge that?

@cevich
Copy link
Member

cevich commented Feb 28, 2020

This task: https://cirrus-ci.com/task/5474270998429696

@edsantiago more data is almost never a bummer, it means we're guessing less 😄

So mostly F30 but possibly F31 then. In that task I show:

$SCRIPT_BASE/logcollector.sh packages
conmon-2.0.10-2.fc31-x86_64
containernetworking-plugins-0.8.5-1.fc31-x86_64
containers-common-0.1.41-1.fc31-x86_64
container-selinux-2.124.0-3.fc31-noarch
criu-3.13-5.fc31-x86_64
crun-0.12.2.1-1.fc31-x86_64
golang-1.13.6-1.fc31-x86_64
package runc is not installed
podman-1.8.0-2.fc31-x86_64
skopeo-0.1.41-1.fc31-x86_64
slirp4netns-0.4.0-20.1.dev.gitbbd6f25.fc31-x86_64

@cevich
Copy link
Member

cevich commented Feb 28, 2020

An example of this in F30 using the new images from #5342 (with crun -> runc fixed)

https://cirrus-ci.com/task/4714742136700928

@cevich
Copy link
Member

cevich commented Feb 28, 2020

(implication being: the problem does not appear to be impacted by anything changed/fixed in that PR)

@edsantiago
Copy link
Member Author

We are wasting unbelievable amounts of time because of this bug. I just did a pass through submitted PRs, and a number of them are in red-X state because of this bug. (I restarted the tasks).

I have been unable to reproduce it in any test environment: f30, f31, overlay, vfs, --root /tmp/xx. I'm starting to think it might be something in how podman is compiled in the CI environment.

Here is a summar of the flakes and retries on one of my PRs today:

special_testing_bindings fedora-31

2020-03-04T12:18:00 integration_test

testing fedora-30 fedora-30

2020-03-04T12:37:06 integration_test
Podman healthcheck run [It] podman healthcheck good check results in healthy even in start-period
[same]
Podman healthcheck run [It] podman healthcheck single healthy result changes failed to healthy
Podman healthcheck run [It] podman healthcheck good check results in healthy even in start-period
Podman run networking [It] podman run --net container: copies hosts and resolv
[same]
[same]
2020-03-04T12:39:14 integration_test
Podman run networking [It] podman run --net container: copies hosts and resolv
[same]
[same]
Podman healthcheck run [It] podman healthcheck good check results in healthy even in start-period
[same]
Podman healthcheck run [It] podman healthcheck single healthy result changes failed to healthy
Podman healthcheck run [It] podman healthcheck good check results in healthy even in start-period
2020-03-04T13:08:40 integration_test
Podman healthcheck run [It] podman healthcheck good check results in healthy even in start-period
[same]
Podman run networking [It] podman run --net container: copies hosts and resolv
[same]
[same]
Podman network [It] podman network rm
2020-03-04T13:25:32 integration_test
Podman healthcheck run [It] podman healthcheck single healthy result changes failed to healthy
[same]
[same]
Podman run networking [It] podman run --net container: copies hosts and resolv
[same]
2020-03-04T13:27:47 integration_test
Podman network [It] podman network rm
Podman run networking [It] podman run --net container: copies hosts and resolv
[same]
[same]

@edsantiago
Copy link
Member Author

Oh wait, there's a new error now:

time="2020-03-04T15:01:37-05:00" level=error msg="container create failed (no logs from conmon): EOF"
Error: non zero exit code: -2147483649: OCI runtime error

Happening in the same place the "timed out waiting for file" error happens. It, too, is a flake (goes away on retry). Any ideas?

@rhatdan
Copy link
Member

rhatdan commented Mar 4, 2020

@haircommander @giuseppe Conmon? crun?

@haircommander
Copy link
Collaborator

hm this is from #5373, which was supposed to fix this flake. Maybe I missed a case

@edsantiago
Copy link
Member Author

Oh, thank you. It's a huge relief to know that this is getting attention.

@haircommander
Copy link
Collaborator

Unfortunately, making the change above only moved how podman failed here. There's still some issue where it seems conmon is crashing when consecutive execs happen on a system under load. That's all the details I have right now, unfortunately. I'll try to give this some love in the next couple of days, but there's a lot on my plate this week 😕

siretart pushed a commit to siretart/libpod that referenced this issue Nov 16, 2021
Versions earlier than 2.0.13 break `podman exec` due to
containers#5339
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
6 participants