Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: cgroup.freeze flake is back #7148

Closed
edsantiago opened this issue Jul 29, 2020 · 10 comments · Fixed by containers/crun#474
Closed

CI: cgroup.freeze flake is back #7148

edsantiago opened this issue Jul 29, 2020 · 10 comments · Fixed by containers/crun#474
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

The nightmareish cgroup.freeze flake is back:

# STEP 6: RUN ln -s /no/such/nonesuch /a/b/c/badsymlink
# error opening file `/sys/fs/cgroup//system.slice/crun-buildah-buildah113061327.scope/container/cgroup.freeze`: No such file or directory
# kill container: No such process

This is on Fedora 31, and yes, it has crun-0.14.1-1

@giuseppe PTAL. I know this is probably going to need to be fixed in crun but I'm filing here as a way to track future flakes.

@giuseppe
Copy link
Member

we need to revendor buildah to get: containers/buildah#2434

The race is in Buildah that is trying to kill a container that is already exited.

@rhatdan
Copy link
Member

rhatdan commented Jul 30, 2020

@TomSweeneyRedHat is working on a rebase for podman-1.15 to fix 2.0.* but we could separately declare a new master branch to vendor it in.

@mheon mheon added the flakes Flakes from Continuous Integration label Aug 7, 2020
@edsantiago
Copy link
Member Author

Ping, what is the status of this? I see #7203 bringing in a new buildah v1.15.1-0.20200731151214-29f4d01c621c, but the flake continues to happen:

46 podman build - stdin test

42 podman build - basic test

@edsantiago
Copy link
Member Author

Another one today

39 podman build - basic test

@edsantiago
Copy link
Member Author

Someone pretty please, help. This is flaking at least once per day, causing much lost effort. Most recently in the v2.0.5 build PR. @TomSweeneyRedHat @rhatdan @giuseppe @mheon anyone, please?

@giuseppe
Copy link
Member

I've been trying to reproduce locally without any luck so far, let's improve the buildah error message and hope it can show something more useful: containers/buildah#2559

This was referenced Aug 24, 2020
@edsantiago
Copy link
Member Author

Woot, I just reproduced it with podman @ master @ 8fdc116:

2020-08-25T20:28:03.000008576Z: error opening file `/sys/fs/cgroup//user.slice/user-0.slice/[email protected]/crun-buildah-buildah230253409.scope/container/cgroup.freeze`: No such file or directory
2020-08-25T20:28:03.000014193Z: kill container: No such process
error running container: error reading container state: exit status 1
Error: error building at STEP "RUN apk add nginx": error while running runtime: exit status 1

Unfortunately I don't see the 'got output' string.

@edsantiago
Copy link
Member Author

@giuseppe I hand-patched buildah per your 2559 above. The output is not helpful:

2020-08-25T21:19:21.000764686Z: error opening file `/sys/fs/cgroup//user.slice/user-0.slice/[email protected]/crun-buildah-buildah368188798.scope/container/cgroup.freeze`: No such file or directory
2020-08-25T21:19:21.000768720Z: kill container: No such process
error running container: error reading container state (got output: ""): exit status 1

What else do you suggest that I could try?

@edsantiago
Copy link
Member Author

My reproducer (please don't judge me):

# cat foo.sh
#!/bin/bash

set -e

timeout --foreground -v --kill=10 60 ../bin/podman build -t foo - <<EOF
FROM quay.io/libpod/alpine_labels:latest
RUN mkdir /workdir
WORKDIR /workdir
RUN /bin/echo hello
RUN apk add nginx
RUN echo jnycVMjJfiJTIZtnvhgIta9v359dTI7fiYFrWmeNbzg1zu5e6M > /v4LZDGa4woigKduvCVsX
EOF

../bin/podman rmi foo

# chmod 755 foo.sh
# while ./foo.sh;do echo;done

On a 1minutetip f32 VM, seems to fail within 15-25 minutes.

giuseppe added a commit to giuseppe/crun that referenced this issue Aug 27, 2020
by the time crun attempts to read from the cgroup, systemd might have
already cleaned it up.  When using systemd, on ENOENT state reports
the container as "stopped" instead of an error.

Closes: containers/podman#7148

Signed-off-by: Giuseppe Scrivano <[email protected]>
@giuseppe
Copy link
Member

Thanks for the hint. I wasn't yet able to reproduce it on my machine (and in a VM), but inspecting more the code, I've found the root issue of the race: containers/crun#474

It can be easily reproduced by adding a sleep just before crun tries to read the cgroup:

diff --git a/src/libcrun/container.c b/src/libcrun/container.c
index b4ffa7b..3efaf93 100644
--- a/src/libcrun/container.c
+++ b/src/libcrun/container.c
@@ -2192,19 +2192,10 @@ libcrun_get_container_state_string (const char *id, libcrun_container_status_t *
       if (cgroup_mode < 0)
         return cgroup_mode;
 
+      sleep (1);
       ret = libcrun_cgroup_is_container_paused (status->cgroup_path, cgroup_mode, &paused, err);
       if (UNLIKELY (ret < 0))

then I get:

...
fetch http://dl-cdn.alpinelinux.org/alpine/v3.8/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.8/community/x86_64/APKINDEX.tar.gz
(1/2) Installing pcre (8.42-r0)
(2/2) Installing nginx (1.14.2-r2)
Executing nginx-1.14.2-r2.pre-install
Executing busybox-1.28.4-r1.trigger
OK: 6 MiB in 15 packages
2020-08-27T10:13:49.000527147Z: error opening file `/sys/fs/cgroup//system.slice/crun-buildah-buildah129670298.scope/container/cgroup.freeze`: No such file or directory
2020-08-27T10:13:49.000536741Z: open pidfd: No such process

the different error open pidfd instead of kill is related to crun supporting pidfd now for killing processes.

I'll backport the patch to the Fedora package as soon as it is merged

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants