update to ginkgo v2 #18163

Luap99 · 2023-04-12T14:07:32Z

Does this PR introduce a user-facing change?

none

vrothberg · 2023-04-12T15:16:28Z

Looks like the issue persists with rootless hanging/not completing.

edsantiago · 2023-04-12T15:29:36Z

Not just rootless: debian-12 and fedora-37, root, are at 54 minutes right now. I expect they will time out.

edsantiago · 2023-04-12T16:57:17Z

There's something horribly weird and broken in the new ginkgo: it does not seem to show command stderr. Example: f37 root log, search in-page for "--filter takes no arg". It shows up jumbled together in a block labeled "Timeline", but does not show up next to the actual command that generated it. This makes it impossible to diagnose actual errors. The relevant code, in ginkgo/v2/reporters/default_reporter.go, suggests that this is triggered when --nodes > 1 but at first glance I don't see a way to fix it.

Obviously this renders the entire package useless, so there must be a way to override it, and I'm sure one of you lovely people will find the solution much more quickly than I can. I'll keep looking anyway though.

Luap99 · 2023-04-12T17:12:50Z

@edsantiago There are changes how output now works, I am aware of this. Benchmarks are also broken.
This is not mergable like that, thus marked as WIP. For now I want to figure out the hang. I can repoduce locally, the problem is that we are leaking podman processes and the new ginkgo waits for them to be completed.

Example processes that hang:

pholzing  704177  0.1  0.1 1789984 41924 pts/4   Sl+  18:54   0:00 /home/pholzing/go/src/github.com/containers/podman/bin/podman --storage-opt vfs.imagestore=/tmp/imagecachedir --root /tmp/podman_test3608744664/root --runroot /tmp/podman_test3608744664/runroot --runtime crun --conmon /usr/bin/conmon --network-config-dir /home/pholzing/.config/cni/net.d --network-backend cni --cgroup-manager cgroupfs --tmpdir /tmp/podman_test3608744664 --events-backend file --db-backend boltdb --storage-driver vfs run --uts private quay.io/libpod/alpine:latest ls -l /proc/self/ns/uts
pholzing  704463  0.1  0.1 1863716 41976 pts/4   Sl+  18:54   0:00 /home/pholzing/go/src/github.com/containers/podman/bin/podman --storage-opt vfs.imagestore=/tmp/imagecachedir --root /tmp/podman_test2981800605/root --runroot /tmp/podman_test2981800605/runroot --runtime crun --conmon /usr/bin/conmon --network-config-dir /home/pholzing/.config/cni/net.d --network-backend cni --cgroup-manager cgroupfs --tmpdir /tmp/podman_test2981800605 --events-backend file --db-backend boltdb --storage-driver vfs run --pid private quay.io/libpod/alpine:latest ls -l /proc/self/ns/pid
pholzing  704662  0.1  0.1 1863460 40512 pts/4   Sl+  18:54   0:00 /home/pholzing/go/src/github.com/containers/podman/bin/podman --storage-opt vfs.imagestore=/tmp/imagecachedir --root /tmp/podman_test3588093402/root --runroot /tmp/podman_test3588093402/runroot --runtime crun --conmon /usr/bin/conmon --network-config-dir /home/pholzing/.config/cni/net.d --network-backend cni --cgroup-manager cgroupfs --tmpdir /tmp/podman_test3588093402 --events-backend file --db-backend boltdb --storage-driver vfs run --cgroupns private quay.io/libpod/alpine:latest ls -l /proc/self/ns/cgroup

As soon as I kill them manually ginkgo returns.
So now I need to figure out why this happens.

edsantiago · 2023-04-12T17:32:23Z

diff --git a/test/e2e/containers_conf_test.go b/test/e2e/containers_conf_test.go
index 125a7bff1..72095d109 100644
--- a/test/e2e/containers_conf_test.go
+++ b/test/e2e/containers_conf_test.go
@@ -233,6 +233,7 @@ var _ = Describe("Verify podman containers.conf usage", func() {
 		Expect(hostNS).To(Equal(ctrNS))
 
 		session = podmanTest.Podman([]string{"run", option, "private", ALPINE, "ls", "-l", nspath})
+		Expect(session).Should(Exit(0))
 		fields = strings.Split(session.OutputToString(), " ")
 		ctrNS = fields[len(fields)-1]
 		Expect(hostNS).ToNot(Equal(ctrNS))

???

Luap99 · 2023-04-12T17:48:51Z

That is a good catch but I don't it is the problem, the podman process is completely stucked and does not respond to any signal (well kill work obv) and it is not listed as zombie.

I can attach with a debugger and it looks like we are deadlocked on a container lock

(dlv) bt
0  0x00007f977e506fd3 in ???
   at ?:-1
1  0x000000000047c2e0 in runtime.systemstack_switch
   at /usr/lib/golang/src/runtime/asm_amd64.s:459
2  0x0000000000415a05 in runtime.cgocall
   at /usr/lib/golang/src/runtime/cgocall.go:168
3  0x00000000010aedac in github.com/containers/podman/v4/libpod/lock/shm._Cfunc_lock_semaphore
   at _cgo_gotypes.go:227
4  0x00000000010affd8 in github.com/containers/podman/v4/libpod/lock/shm.(*SHMLocks).LockSemaphore
   at ./go/src/github.com/containers/podman/libpod/lock/shm/shm_lock.go:229
5  0x00000000010b0ea5 in github.com/containers/podman/v4/libpod/lock.(*SHMLock).Lock
   at ./go/src/github.com/containers/podman/libpod/lock/shm_lock_manager_linux.go:114
6  0x000000000123af8e in github.com/containers/podman/v4/libpod.(*Container).Kill
   at ./go/src/github.com/containers/podman/libpod/container_api.go:271
7  0x0000000001375b32 in github.com/containers/podman/v4/pkg/domain/infra/abi/terminal.ProxySignals.func1
   at ./go/src/github.com/containers/podman/pkg/domain/infra/abi/terminal/sigproxy_commn.go:36
8  0x000000000047e501 in runtime.goexit
   at /usr/lib/golang/src/runtime/asm_amd64.s:1594

But the thing is how the hell can it deadlock here, I have zero other podman processes running on the system which could be holding a lock, so what is holding the lock.

edsantiago · 2023-04-12T17:54:16Z

Yeah... I noticed. Even with WaitWithDefaultTimeout it still hangs for me, unkillable except with -9.

zero other podman processes

Same here. But this is rootless... there's probably no pause process needed when running private namespaces, but could there be something that doesn't know that, and is waiting to join the pause process ns? (Ignore if this is a stupid comment. No need even to respond. Just thinking out loud.)

Luap99 · 2023-04-12T17:58:46Z

Actually adding

session.WaitWithDefaultTimeout()
Expect(session).Should(Exit(0))

seems to fix it for me now. I only run containers_conf_test.go right now and before this change it hang every time now it passed three times in a row.

edsantiago · 2023-04-12T18:01:20Z

Weird: I have WaitWithDefaultTimeout() also, and Should(), just like you cite above .... and I still see the hang. But I'll move on to other things.

edsantiago · 2023-04-12T18:11:44Z

Quick suggestion for next time you push: try changing the .cirrus.yml timeouts from 90m to 50 or even 45. Maybe not permanently, but it sure would save a lot of time in catching hangs.

vrothberg · 2023-04-13T07:10:52Z

Thanks for looking into this, folks!

Luap99 · 2023-04-13T11:21:54Z

Ok, hang is definitely fixed. Root cause is podman rm --fa (#18180) not working in all circumstances. Therefore it starts leaking processes and the ginkgo v2 doesn't like that. I assume the processes keep some std streams open which is tries to read.

edsantiago · 2023-04-13T13:20:07Z

Remote tests failing to compile: undefined: GinkgoWriter. But @Luap99 please don't re-push yet, I want some sample logs to download so I can fix logformatter. Should just be a few more minutes.

Luap99 · 2023-04-13T13:27:09Z

Remote tests failing to compile: undefined: GinkgoWriter. But @Luap99 please don't re-push yet, I want some sample logs to download so I can fix logformatter. Should just be a few more minutes.

https://api.cirrus-ci.com/v1/artifact/task/5080120319279104/html/int-podman-fedora-37-root-container-boltdb.log.html

edsantiago · 2023-04-13T13:29:28Z

Got em. (I need raw logs, the Cirrus ones. The logformatted ones are utter garbage right now, that's what I'm trying to fix)

Luap99 · 2023-04-13T13:34:22Z

with the added -p instead of --nodes 3, make localintegration only takes 4-5 mins on my laptop (12 threads)

edsantiago · 2023-04-13T14:44:22Z

Hey, are we expecting PANICs? In the process of updating logformatter, I see a bunch of these panics in podman unshare tests:

S [SKIPPED] [0.001 seconds]
         Podman unshare [BeforeEach]
           podman unshare
           Enter [AfterEach] Podman unshare - /var/tmp/go/src/github.com[/containers/podman/test/e2e/unshare_test.go:37](https://github.com/containers/podman/blob/de55faaecd839ec5f2bbc270ae27f37514de8ea8/test/e2e/unshare_test.go#L37) @ 04/13/23 13:04:41.586

           [PANICKED] Test Panicked
           In [AfterEach] at: /usr/lib/go-1.19/src/runtime/panic.go:260 @ 04/13/23 13:04:41.586
         
           runtime error: invalid memory address or nil pointer dereference

See debian log, search in-page for "PANIC". Compare to current ginkgo log for a PR that merged today.

The very weird thing is that the test shows as SKIPPED, and there is no link (at bottom) to the panic. No indication anywhere that there was a panic.

Anyhow, not my highest priority right now but someone should maybe look at it.

Luap99 · 2023-04-13T14:46:28Z

Anyhow, not my highest priority right now but someone should maybe look at it.

I take a look.

Luap99 · 2023-04-13T14:52:15Z

Looks like ginkgo now still runs AfterEach() when we call Skip() in BeforeEach()

Super-complicated and ugly code merely to make in-page anchors link three lines above the subtest name, to the subtest status and timing line. Signed-off-by: Ed Santiago <[email protected]>

This never worked when ginkgo runs with more than one thread, we use 3 in CI. The problem is that the SynchronizedAfterSuite() function accepts two functions. The first one is run for each ginkgo node while the second one is only run once for the whole suite. Because the timings are stored as slice thus in memory we loose all timings from the other nodes as they were only reported on node 1. Moving the printing in the first function solves this but causes the problem that the result is now no longer sorted. To fix this we let each node write the result to a tmp file and only then let the final after suite function collect the timings from all these files, then sort them and print the output like we did before. Signed-off-by: Paul Holzinger <[email protected]>

Add a workaround for containers#18180 so the ginkgo work can be merged without being blocked by the issue. Please revert this commit when the issue is fixed. Signed-off-by: Paul Holzinger <[email protected]>

Directly writing to stdout/err is not safe when run in parallel. Ginkgo v2 fixed this buffering the output and syncing the output so it is not mangled between tests. This is the same as for the podman integration tests. Signed-off-by: Paul Holzinger <[email protected]>

while reworking ginkgo to use -p by default we also forced the machine tests to be run in parallel. Right now this does not work at all (something that should be fixed). Using -p is easier becuase that will let ginkgo decide how many parallel nodes to use so it much faster on high core counts. So use some makefile magic to instaed of using `GINKGONODES` use `GINKGO_PARALLEL` and set it to `y` by default. The machine tests will then use that var to disable it. Signed-off-by: Paul Holzinger <[email protected]>

edsantiago · 2023-05-02T11:56:31Z

All int tests passed, with times of 26:44 to 38:24, but one timed out at 50m. Would presumably also time out at 90m. I don't understand why only one test would time out.

Luap99 · 2023-05-02T12:19:58Z

I restarted it, my best guess is some flake leaving processes running...

Luap99 · 2023-05-02T12:23:58Z

[+0490s]   Running: /var/tmp/go/src/github.com/containers/podman/bin/podman --root /tmp/podman_test1335685339/root --runroot /tmp/podman_test1335685339/runroot --runtime crun --conmon /usr/bin/conmon --network-config-dir /tmp/podman_test1335685339/root/etc/networks --network-backend netavark --cgroup-manager systemd --tmpdir /tmp/podman_test1335685339 --events-backend file --db-backend sqlite --storage-driver vfs ps -q
[+0490s]   Error: beginning transaction: database is locked

This is the only failure logged by ginkgo, passed on the retry. However the AfterEach cleanup still fired correctly and looks successful so I don't think it leaked a process.

edsantiago · 2023-05-02T13:26:33Z

OMG.... CI is green. I'm still uncomfortable about that one timeout, and expect it to recur... but I am 100% OK with fixing that later if needed. @containers/podman-maintainers PTAL, this will potentially fix many flakes, and make debugging easier (better logs), and much more. (@Luap99 I'm still doing my final review, so, almost-lgtm but not quite yet).

edsantiago · 2023-05-02T13:42:19Z

Makefile

@@ -558,8 +557,8 @@ test: localunit localintegration remoteintegration localsystem remotesystem  ##

 .PHONY: ginkgo-run
 ginkgo-run: .install.ginkgo
-	ACK_GINKGO_RC=true $(GINKGO) version
-	ACK_GINKGO_RC=true $(GINKGO) -v $(TESTFLAGS) -tags "$(TAGS) remote" $(GINKGOTIMEOUT) -cover -flakeAttempts 3 -progress -trace -noColor -nodes $(GINKGONODES) -debug $(GINKGOWHAT) $(HACK)


Removing -cover looks correct to me: it's handled in the localunit target above. I've traced the presence of -cover in this line all the way back to 2018, and believe it to be a copy/paste from the 2017 CRI-O import. But since there's a slight possibility that I'm wrong, I'm calling attention to it anyway. If anyone sees a reason to keep this, please speak now.

Luap99 · 2023-05-02T13:51:14Z

I'm still uncomfortable about that one timeout, and expect it to recur...

Yeah the unfortunate thing is this is really hard to track down. I currently test locally if the ginkgo timeout is helpful here, right now cirrus will loose some output on timeout. However if I can set a little bit lower timeout for ginkgo and it exits (not confirmed yet) then we can add a ps auxww to the cirrus script to check for running processes which should help debug what is hanging.

edsantiago · 2023-05-02T13:54:38Z

we can add a ps auxww to the cirrus script

AIUI that's not the way Cirrus works: the Cirrus timeout is a hard stop, kill everything, no grace period or chance for cleanup or handler. @cevich do you know if there's a soft timeout option, where an emergency handler could be run?

Luap99 · 2023-05-02T13:57:50Z

AIUI that's not the way Cirrus works: the Cirrus timeout is a hard stop, kill everything, no grace period or chance for cleanup or handler. @cevich do you know if there's a soft timeout option, where an emergency handler could be run?

I am well aware that is why I am talking about the ginkgo timeout: ginkgo --timeout 45m .... In this case ginkgo should exit after that time assuming it is not effected by the hang as well. If ginkgo timeout 45 min and cirrus timeout 50 min then I have a small window afterwards to run debug commands.

edsantiago · 2023-05-02T14:01:23Z

Oops! My bad, I didn't properly understand your use of "ginkgo timeout". Yes, using ginkgo --timeout SHORTERVAL sounds perfect (but for a later PR). Thanks for the clarification.

edsantiago · 2023-05-02T14:19:29Z

test/e2e/common_test.go

 	func() {
-		sort.Sort(testResultsSortedLength{testResults})
+		testTimings := make(testResultsSorted, 0, 2000)


Thank you! I hope this finally fixes #8358

edsantiago

Final LGTM. Thanks so much for persevering.

rhatdan · 2023-05-02T14:24:12Z

/lgtm

cevich · 2023-05-02T14:28:11Z

cleanup or handler. @cevich do you know if there's a soft timeout option

No, it's always hard. If you want soft, you have to do it via soft-ware 😁 (like ginkgo or timeout or something).

cevich · 2023-05-02T14:31:33Z

Makefile

@@ -571,7 +570,7 @@ ginkgo-remote:

 .PHONY: testbindings
 testbindings: .install.ginkgo
-	ACK_GINKGO_RC=true $(GINKGO) -v $(TESTFLAGS) -tags "$(TAGS) remote" $(GINKGOTIMEOUT) -progress -trace -noColor -debug -timeout 30m  -v -r ./pkg/bindings/test
+	$(GINKGO) -v $(TESTFLAGS) --tags "$(TAGS) remote" $(GINKGOTIMEOUT) --trace --no-color --timeout 30m  -v -r ./pkg/bindings/test


There's a $(GINKGOTIMEOUT) specified here in addition to a --timeout 30m, not sure if that was intentional / doing some tricky thing with ginkgo or not. Consider removing one or the other or adding a comment detailing why there are two timeout options.

cevich · 2023-05-02T14:35:03Z

Makefile

@@ -126,8 +126,7 @@ LIBSECCOMP_COMMIT := v2.3.3
 GINKGOTIMEOUT ?= -timeout=90m


I would suggest setting this timeout LOWER than the one in .cirrus.yml. As you've seen, when the cirrus-ci one hits, it just kills the VM. We get no logs or anything. It would be good to allow ginkgo's timeout to fire first so potentially helpful output and logs can be collected.

yes I am working on it but the hang bug still happens, i.e. ginkgo thinks it is done but some goroutine causes the process to hang and never exit.

Previously when there was a ginkgo hang, you could find out what it was by examining the ginkgo-node log files collected as CI-artifacts IIRC. However, if Cirrus times out the task, the logs won't be collected.

cevich

LGTM - Two non-blocking timeout related comments that can be fixed later.

openshift-ci · 2023-05-02T14:35:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cevich, edsantiago, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99,edsantiago]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 12, 2023

Luap99 force-pushed the ginkgo-v2 branch from 5e0277c to d30853d Compare April 12, 2023 14:12

Luap99 mentioned this pull request Apr 12, 2023

pod stats: unknown FS magic on "/run/user/4902/netns/netns-etc-etc" #17903

Closed

Luap99 force-pushed the ginkgo-v2 branch 2 times, most recently from 2e40a52 to 6dbb60f Compare April 13, 2023 08:59

Luap99 mentioned this pull request Apr 13, 2023

podman rm --all not working with containers that have dependencies #18180

Closed

Luap99 force-pushed the ginkgo-v2 branch from 6dbb60f to 395a9cc Compare April 13, 2023 10:20

Luap99 force-pushed the ginkgo-v2 branch from 395a9cc to de55faa Compare April 13, 2023 12:29

Luap99 force-pushed the ginkgo-v2 branch from de55faa to 212ee2c Compare April 13, 2023 15:45

edsantiago and others added 5 commits May 2, 2023 11:27

logformatter: anchors: link to test summary, not name

1bff010

Super-complicated and ugly code merely to make in-page anchors link three lines above the subtest name, to the subtest status and timing line. Signed-off-by: Ed Santiago <[email protected]>

ginkgo-v2 cleanup workaround for containers#18180

c4b9f4b

Add a workaround for containers#18180 so the ginkgo work can be merged without being blocked by the issue. Please revert this commit when the issue is fixed. Signed-off-by: Paul Holzinger <[email protected]>

Luap99 force-pushed the ginkgo-v2 branch from ebbb6ba to 3858a83 Compare May 2, 2023 10:34

edsantiago reviewed May 2, 2023

View reviewed changes

edsantiago approved these changes May 2, 2023

View reviewed changes

openshift-ci bot assigned rhatdan May 2, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 2, 2023

openshift-merge-robot merged commit f173db7 into containers:main May 2, 2023

cevich reviewed May 2, 2023

View reviewed changes

cevich approved these changes May 2, 2023

View reviewed changes

Luap99 deleted the ginkgo-v2 branch May 2, 2023 14:36

edsantiago mentioned this pull request May 10, 2023

unlinkat/EBUSY/hosed is back (Jan 2023) #17216

Closed

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 25, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 25, 2023

		@@ -126,8 +126,7 @@ LIBSECCOMP_COMMIT := v2.3.3
		GINKGOTIMEOUT ?= -timeout=90m

update to ginkgo v2 #18163

update to ginkgo v2 #18163

Conversation

Luap99 commented Apr 12, 2023

Does this PR introduce a user-facing change?

vrothberg commented Apr 12, 2023

edsantiago commented Apr 12, 2023

edsantiago commented Apr 12, 2023

Luap99 commented Apr 12, 2023

edsantiago commented Apr 12, 2023

Luap99 commented Apr 12, 2023

edsantiago commented Apr 12, 2023

Luap99 commented Apr 12, 2023

edsantiago commented Apr 12, 2023

edsantiago commented Apr 12, 2023

vrothberg commented Apr 13, 2023

Luap99 commented Apr 13, 2023

edsantiago commented Apr 13, 2023

Luap99 commented Apr 13, 2023

edsantiago commented Apr 13, 2023

Luap99 commented Apr 13, 2023

edsantiago commented Apr 13, 2023

Luap99 commented Apr 13, 2023

Luap99 commented Apr 13, 2023

edsantiago commented May 2, 2023

Luap99 commented May 2, 2023

Luap99 commented May 2, 2023

edsantiago commented May 2, 2023

edsantiago May 2, 2023

Choose a reason for hiding this comment

Luap99 commented May 2, 2023

edsantiago commented May 2, 2023

Luap99 commented May 2, 2023

edsantiago commented May 2, 2023

edsantiago May 2, 2023

Choose a reason for hiding this comment

edsantiago left a comment

Choose a reason for hiding this comment

rhatdan commented May 2, 2023

cevich commented May 2, 2023

cevich May 2, 2023

Choose a reason for hiding this comment

cevich May 2, 2023

Choose a reason for hiding this comment

Luap99 May 2, 2023

Choose a reason for hiding this comment

cevich May 2, 2023 • edited Loading

Choose a reason for hiding this comment

cevich left a comment

Choose a reason for hiding this comment

openshift-ci bot commented May 2, 2023

cevich May 2, 2023 •

edited

Loading