Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: podman stop: timeout #23685

Closed
edsantiago opened this issue Aug 20, 2024 · 11 comments · Fixed by #24024
Closed

CI: podman stop: timeout #23685

edsantiago opened this issue Aug 20, 2024 · 11 comments · Fixed by #24024
Labels
flakes Flakes from Continuous Integration

Comments

@edsantiago
Copy link
Member

One-off. I think I've seen this one before, but have no way to find it because I didn't file it.

[+0242s] • [FAILED] [92.098 seconds]
         TOP-LEVEL [AfterEach]
           Podman UserNS support
             podman --userns=container:CTR
         
....
           $ podman [options] run --rm --userns=container:userns-ctr --net=container:userns-ctr alpine cat /proc/self/uid_map
                    0          0          1
                    1          2       4998
           ← Exit  [It] podman --userns=container:CTR - /var/tmp/go/src/github.com[/containers/podman/test/e2e/run_userns_test.go:324](https://github.com/containers/podman/blob/fc30620cdb11b011ee396222795af415feda8b03/test/e2e/run_userns_test.go#L324) @ 08/19/24 15:33:35.557 (1.09s)
           → Enter [AfterEach] TOP-LEVEL - /var/tmp/go/src/github.com[/containers/podman/test/e2e/common_test.go:122](https://github.com/containers/podman/blob/fc30620cdb11b011ee396222795af415feda8b03/test/e2e/common_test.go#L122) @ 08/19/24 15:33:35.558
           $ podman [options] stop --all -t 0
           SIGABRT: abort
           PC=0x4820e1 m=0 sigcode=0
         
           goroutine 0 gp=0x2d7c440 m=0 mp=0x2d7f5e0 [idle]:
           runtime.futex(0x2d7f720, 0x80, 0x0, 0x0, 0x0, 0x0)
           	/usr/lib/go-1.22/src/runtime/sys_linux_amd64.s:557 +0x21 fp=0x7ffd4f5dc8a0 sp=0x7ffd4f5dc898 pc=0x4820e1
           runtime.futexsleep(0x7ffd4f5dc918?, 0x417030?, 0x455607?)
           	/usr/lib/go-1.22/src/runtime/os_linux.go:69 +0x30 fp=0x7ffd4f5dc8f0 sp=0x7ffd4f5dc8a0 pc=0x441610
           runtime.notesleep(0x2d7f720)
           	/usr/lib/go-1.22/src/runtime/lock_futex.go:170 +0x87 fp=0x7ffd4f5dc928 sp=0x7ffd4f5dc8f0 pc=0x4171c7
           runtime.mPark(...)
           	/usr/lib/go-1.22/src/runtime/proc.go:1761
           runtime.stopm()

In debian rootless, and not in any of my test PRs

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Aug 20, 2024
@Luap99
Copy link
Member

Luap99 commented Aug 22, 2024

seen again

There is no panic, see #23551

This is an intentionally send SIGABRT on timeout, see the red error message. If it doesn't say panic or SIGSEGV it is not a panic but rather us sending the signal (see SIGABRT: abort as first line)

@Luap99 Luap99 changed the title CI: podman stop: panic CI: podman stop: timeout Aug 22, 2024
@Luap99
Copy link
Member

Luap99 commented Aug 22, 2024

           goroutine 53 gp=0xc000500540 m=nil [sleep, locked to thread]:
           runtime.gopark(0x51f51a04b2?, 0xc0005cd840?, 0x25?, 0x61?, 0x10?)
           	/usr/lib/go-1.22/src/runtime/proc.go:402 +0xce fp=0xc0005cd810 sp=0xc0005cd7f0 pc=0x447c2e
           time.Sleep(0x989680)
           	/usr/lib/go-1.22/src/runtime/time.go:195 +0x115 fp=0xc0005cd850 sp=0xc0005cd810 pc=0x47d275
           github.com/containers/common/pkg/netns.UnmountNS({0xc0002bb4c0, 0x3f})
           	/var/tmp/go/src/github.com/containers/podman/vendor/github.com[/containers/common/pkg/netns/netns_linux.go:199](https://github.com/containers/common/blob/fc30620cdb11b011ee396222795af415feda8b03/pkg/netns/netns_linux.go#L199) +0x1cf fp=0xc0005cd8f8 sp=0xc0005cd850 pc=0x12331ef
           github.com/containers/podman/v5/libpod.(*Runtime).teardownNetNS(0xc000403340, 0xc00070d040)
           	/var/tmp/go/src/github.com[/containers/podman/libpod/networking_linux.go:156](https://github.com/containers/podman/blob/fc30620cdb11b011ee396222795af415feda8b03/libpod/networking_linux.go#L156) +0xd4 fp=0xc0005cd9a0 sp=0xc0005cd8f8 pc=0x14769b4
           github.com/containers/podman/v5/libpod.(*Container).cleanupNetwork(0xc00070d040)
           	/var/tmp/go/src/github.com[/containers/podman/libpod/container_internal_linux.go:190](https://github.com/containers/podman/blob/fc30620cdb11b011ee396222795af415feda8b03/libpod/container_internal_linux.go#L190) +0x6f fp=0xc0005cda28 sp=0xc0005cd9a0 pc=0x144df2f
           github.com/containers/podman/v5/libpod.(*Container).cleanup(0xc00070d040, {0x1fd2b50, 0x2de4c00})
           	/var/tmp/go/src/github.com[/containers/podman/libpod/container_internal.go:2074](https://github.com/containers/podman/blob/fc30620cdb11b011ee396222795af415feda8b03/libpod/container_internal.go#L2074) +0x20a fp=0xc0005cdcd8 sp=0xc0005cda28 pc=0x142d70a
           github.com/containers/podman/v5/libpod.(*Container).Cleanup(0xc00070d040, {0x1fd2b50, 0x2de4c00})
           	/var/tmp/go/src/github.com[/containers/podman/libpod/container_api.go:841](https://github.com/containers/podman/blob/fc30620cdb11b011ee396222795af415feda8b03/libpod/container_api.go#L841) +0x6c6 fp=0xc0005cde50 sp=0xc0005cdcd8 pc=0x1406f06
           github.com/containers/podman/v5/pkg/domain/infra/abi.(*ContainerEngine).ContainerStop.func1(0xc00070d040)
           	/var/tmp/go/src/github.com[/containers/podman/pkg/domain/infra/abi/containers.go:294](https://github.com/containers/podman/blob/fc30620cdb11b011ee396222795af415feda8b03/pkg/domain/infra/abi/containers.go#L294) +0x23c fp=0xc0005cdf08 sp=0xc0005cde50 pc=0x15475bc
           github.com/containers/podman/v5/pkg/parallel/ctr.ContainerOp.func1()
           	/var/tmp/go/src/github.com[/containers/podman/pkg/parallel/ctr/ctr.go:30](https://github.com/containers/podman/blob/fc30620cdb11b011ee396222795af415feda8b03/pkg/parallel/ctr/ctr.go#L30) +0x1b fp=0xc0005cdf20 sp=0xc0005cdf08 pc=0x152595b
           github.com/containers/podman/v5/pkg/parallel.Enqueue.func1()
           	/var/tmp/go/src/github.com[/containers/podman/pkg/parallel/parallel.go:67](https://github.com/containers/podman/blob/fc30620cdb11b011ee396222795af415feda8b03/pkg/parallel/parallel.go#L67) +0x198 fp=0xc0005cdfe0 sp=0xc0005cdf20 pc=0x1323a98
           runtime.goexit({})
           	/usr/lib/go-1.22/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0005cdfe8 sp=0xc0005cdfe0 pc=0x4802e1
           created by github.com/containers/podman/v5/pkg/parallel.Enqueue in goroutine 1
           	/var/tmp/go/src/github.com[/containers/podman/pkg/parallel/parallel.go:56](https://github.com/containers/podman/blob/fc30620cdb11b011ee396222795af415feda8b03/pkg/parallel/parallel.go#L56) +0xa5

This seems to be the relevant hang, it is failing to unmount the netns fail (there is a infinity loop in there wait for the netns fail to be unmounted by the kernel as we called unmount with MNT_DETACH)

@Luap99
Copy link
Member

Luap99 commented Aug 27, 2024

FYI: I run ginkgo with --until-it-fails for over 600 tries without failure and I see this is still the only logged instance in your flake log so I am not spend more time on this right now until it starts failing more often...

@edsantiago
Copy link
Member Author

Fine with me. I opened the issue because some flakes are easy to keep track of (when they all fail in the same test) and some are not (non-unique failure message across multiple tests).

@liangdy678
Copy link

Hi sir,

I have also encountered a similar situation.

When the tasks in the container are completed, the container cannot exit.

image

If no parameters (--rm) are used, it can exit normally.

image

thanks

@Luap99
Copy link
Member

Luap99 commented Sep 18, 2024

@edsantiago Another case in https://api.cirrus-ci.com/v1/artifact/task/6443976285224960/html/int-podman-rawhide-rootless-host-sqlite.log.html but more importantly as the process was killed the netns file was leaked so the new netns leak seems to work as expected I would say.

@Luap99
Copy link
Member

Luap99 commented Sep 19, 2024

Ok maybe we are getting somewhere here, QE hit this in RHEL 9.5 testing and I have access to that VM.

[test@kvm-08-guest26 ~]$ podman unshare cat /proc/self/mountinfo | grep netns
490 707 0:4 net:[4026532338] /run/user/1000/netns/netns-7eb79636-16b9-8180-5836-e672a21ebc1e rw - nsfs nsfs rw
487 707 0:45 /netns /run/user/1000/netns rw,nosuid,nodev,relatime shared:252 master:240 - tmpfs tmpfs rw,seclabel,size=374644k,nr_inodes=93661,mode=700,uid=1000,gid=1000,inode64
[test@kvm-08-guest26 ~]$ podman unshare umount /run/user/1000/netns/netns-7eb79636-16b9-8180-5836-e672a21ebc1e
umount: /run/user/1000/netns/netns-7eb79636-16b9-8180-5836-e672a21ebc1e: not mounted.
[test@kvm-08-guest26 ~]$ podman unshare rm /run/user/1000/netns/netns-7eb79636-16b9-8180-5836-e672a21ebc1e
rm: cannot remove '/run/user/1000/netns/netns-7eb79636-16b9-8180-5836-e672a21ebc1e': Device or resource busy

This doesn't make any sense to me, mount says it is mounted, umount says it is not mounted and rm is failing with EBUSY (which is why podman keeps looping until we remove successfully or a different error is hit, I am starting to regret that I didn't add a max timeout before we give up trying to remove the file)

@edsantiago
Copy link
Member Author

Weird. In case it helps, here's the list so far. All rootless, and all in the same test:

x x x x x x
int(4) podman(4) debian-13(2) rootless(4) host(4) sqlite(4)
rawhide(1)
fedora-40(1)

@Luap99
Copy link
Member

Luap99 commented Sep 19, 2024

strace umount says EINVAL

umount2("/run/user/1000/netns/netns-7eb79636-16b9-8180-5836-e672a21ebc1e", MNT_DETACH) = -1 EINVAL (Invalid argument)

Luap99 added a commit to Luap99/libpod that referenced this issue Sep 23, 2024
The netns dir has a special logic to bind mout itself and make itslef
shared. This code here didn't which lead to catastrophic bug during
netns unmounting as we were unable to unmount the netns as the mount got
duplicated and had the wrong parent mount. This caused us to loop forever
trying to remove the file.

Fixes https://issues.redhat.com/browse/RHEL-59620
Fixes containers#23685

Signed-off-by: Paul Holzinger <[email protected]>
Luap99 added a commit to Luap99/libpod that referenced this issue Sep 23, 2024
The netns dir has a special logic to bind mout itself and make itslef
shared. This code here didn't which lead to catastrophic bug during
netns unmounting as we were unable to unmount the netns as the mount got
duplicated and had the wrong parent mount. This caused us to loop forever
trying to remove the file.

Fixes https://issues.redhat.com/browse/RHEL-59620
Fixes containers#23685
Fixes https://issues.redhat.com/browse/RHEL-59703 (backport)

Signed-off-by: Paul Holzinger <[email protected]>
@edsantiago
Copy link
Member Author

Seen yesterday in a v5.3 PR, debian root. Is it possible that this fix didn't get backported to v5.3?

@Luap99
Copy link
Member

Luap99 commented Nov 12, 2024

No that is most likely something else. It is the remote hang so we do not see the server stack trace which would be the interesting thing here. I really should get #23631 in a good state so we can capture this info on all tests as this has been needed several times by now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flakes Flakes from Continuous Integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants