CI: cgroup.freeze flake is back #7148

edsantiago · 2020-07-29T18:26:58Z

The nightmareish cgroup.freeze flake is back:

# STEP 6: RUN ln -s /no/such/nonesuch /a/b/c/badsymlink
# error opening file `/sys/fs/cgroup//system.slice/crun-buildah-buildah113061327.scope/container/cgroup.freeze`: No such file or directory
# kill container: No such process

This is on Fedora 31, and yes, it has crun-0.14.1-1

@giuseppe PTAL. I know this is probably going to need to be fixed in crun but I'm filing here as a way to track future flakes.

The text was updated successfully, but these errors were encountered:

giuseppe · 2020-07-30T08:01:04Z

we need to revendor buildah to get: containers/buildah#2434

The race is in Buildah that is trying to kill a container that is already exited.

rhatdan · 2020-07-30T13:54:01Z

@TomSweeneyRedHat is working on a rebase for podman-1.15 to fix 2.0.* but we could separately declare a new master branch to vendor it in.

edsantiago · 2020-08-19T19:51:49Z

Ping, what is the status of this? I see #7203 bringing in a new buildah v1.15.1-0.20200731151214-29f4d01c621c, but the flake continues to happen:

46 podman build - stdin test

fedora-31 : test fedora-31
- PR Bump github.com/containers/image/v5 from 5.5.1 to 5.5.2 #7369
  - 08-19 06:48
- PR Fix handling of working dir #7239
  - 08-10 13:42
fedora-32 : test fedora-32
- PR Default .Repository and .Tag values to <none> #7240
  - 08-10 11:51

42 podman build - basic test

fedora-32 : test fedora-32
- PR Makefile: use full path for ginkgo #7296
  - 08-11 15:43

edsantiago · 2020-08-20T14:28:53Z

Another one today

39 podman build - basic test

fedora-31 : test fedora-31
- PR [2.0] vendor c/image v5.5.2 #7356
  - 08-20 05:05

edsantiago · 2020-08-20T19:50:32Z

Someone pretty please, help. This is flaking at least once per day, causing much lost effort. Most recently in the v2.0.5 build PR. @TomSweeneyRedHat @rhatdan @giuseppe @mheon anyone, please?

giuseppe · 2020-08-21T12:29:09Z

I've been trying to reproduce locally without any luck so far, let's improve the buildah error message and hope it can show something more useful: containers/buildah#2559

edsantiago · 2020-08-25T20:35:56Z

Woot, I just reproduced it with podman @ master @ 8fdc116:

2020-08-25T20:28:03.000008576Z: error opening file `/sys/fs/cgroup//user.slice/user-0.slice/[email protected]/crun-buildah-buildah230253409.scope/container/cgroup.freeze`: No such file or directory
2020-08-25T20:28:03.000014193Z: kill container: No such process
error running container: error reading container state: exit status 1
Error: error building at STEP "RUN apk add nginx": error while running runtime: exit status 1

Unfortunately I don't see the 'got output' string.

edsantiago · 2020-08-25T21:24:15Z

@giuseppe I hand-patched buildah per your 2559 above. The output is not helpful:

2020-08-25T21:19:21.000764686Z: error opening file `/sys/fs/cgroup//user.slice/user-0.slice/[email protected]/crun-buildah-buildah368188798.scope/container/cgroup.freeze`: No such file or directory
2020-08-25T21:19:21.000768720Z: kill container: No such process
error running container: error reading container state (got output: ""): exit status 1

What else do you suggest that I could try?

edsantiago · 2020-08-26T15:50:03Z

My reproducer (please don't judge me):

# cat foo.sh
#!/bin/bash

set -e

timeout --foreground -v --kill=10 60 ../bin/podman build -t foo - <<EOF
FROM quay.io/libpod/alpine_labels:latest
RUN mkdir /workdir
WORKDIR /workdir
RUN /bin/echo hello
RUN apk add nginx
RUN echo jnycVMjJfiJTIZtnvhgIta9v359dTI7fiYFrWmeNbzg1zu5e6M > /v4LZDGa4woigKduvCVsX
EOF

../bin/podman rmi foo

# chmod 755 foo.sh
# while ./foo.sh;do echo;done

On a 1minutetip f32 VM, seems to fail within 15-25 minutes.

by the time crun attempts to read from the cgroup, systemd might have already cleaned it up. When using systemd, on ENOENT state reports the container as "stopped" instead of an error. Closes: containers/podman#7148 Signed-off-by: Giuseppe Scrivano <[email protected]>

giuseppe · 2020-08-27T10:14:59Z

Thanks for the hint. I wasn't yet able to reproduce it on my machine (and in a VM), but inspecting more the code, I've found the root issue of the race: containers/crun#474

It can be easily reproduced by adding a sleep just before crun tries to read the cgroup:

diff --git a/src/libcrun/container.c b/src/libcrun/container.c
index b4ffa7b..3efaf93 100644
--- a/src/libcrun/container.c
+++ b/src/libcrun/container.c
@@ -2192,19 +2192,10 @@ libcrun_get_container_state_string (const char *id, libcrun_container_status_t *
       if (cgroup_mode < 0)
         return cgroup_mode;
 
+      sleep (1);
       ret = libcrun_cgroup_is_container_paused (status->cgroup_path, cgroup_mode, &paused, err);
       if (UNLIKELY (ret < 0))

then I get:

...
fetch http://dl-cdn.alpinelinux.org/alpine/v3.8/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.8/community/x86_64/APKINDEX.tar.gz
(1/2) Installing pcre (8.42-r0)
(2/2) Installing nginx (1.14.2-r2)
Executing nginx-1.14.2-r2.pre-install
Executing busybox-1.28.4-r1.trigger
OK: 6 MiB in 15 packages
2020-08-27T10:13:49.000527147Z: error opening file `/sys/fs/cgroup//system.slice/crun-buildah-buildah129670298.scope/container/cgroup.freeze`: No such file or directory
2020-08-27T10:13:49.000536741Z: open pidfd: No such process

the different error open pidfd instead of kill is related to crun supporting pidfd now for killing processes.

I'll backport the patch to the Fedora package as soon as it is merged

edsantiago mentioned this issue Jul 29, 2020

bindings: skip flaky pause/unpause test #7143

Merged

edsantiago mentioned this issue Aug 5, 2020

system tests: podman-remote, image tree #7238

Merged

mheon added the flakes Flakes from Continuous Integration label Aug 7, 2020

edsantiago mentioned this issue Aug 11, 2020

Makefile: use full path for ginkgo #7296

Merged

edsantiago mentioned this issue Aug 20, 2020

Lets try this again: v2.0.5 backports, round 2 #7363

Merged

This was referenced Aug 24, 2020

Final v2.0.5 backports #7402

Merged

podman images: panic #7444

Closed

giuseppe mentioned this issue Aug 27, 2020

state: fix race condition when reading cgroup containers/crun#474

Merged

giuseppe mentioned this issue Aug 27, 2020

build: honor --runtime setting #7473

Merged

giuseppe closed this as completed in containers/crun#474 Aug 27, 2020

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: cgroup.freeze flake is back #7148

CI: cgroup.freeze flake is back #7148

edsantiago commented Jul 29, 2020

giuseppe commented Jul 30, 2020

rhatdan commented Jul 30, 2020

edsantiago commented Aug 19, 2020

edsantiago commented Aug 20, 2020

edsantiago commented Aug 20, 2020

giuseppe commented Aug 21, 2020

edsantiago commented Aug 25, 2020

edsantiago commented Aug 25, 2020

edsantiago commented Aug 26, 2020

giuseppe commented Aug 27, 2020

CI: cgroup.freeze flake is back #7148

CI: cgroup.freeze flake is back #7148

Comments

edsantiago commented Jul 29, 2020

giuseppe commented Jul 30, 2020

rhatdan commented Jul 30, 2020

edsantiago commented Aug 19, 2020

edsantiago commented Aug 20, 2020

edsantiago commented Aug 20, 2020

giuseppe commented Aug 21, 2020

edsantiago commented Aug 25, 2020

edsantiago commented Aug 25, 2020

edsantiago commented Aug 26, 2020

giuseppe commented Aug 27, 2020