Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update to ginkgo v2 #18163

Merged
merged 23 commits into from
May 2, 2023
Merged

update to ginkgo v2 #18163

merged 23 commits into from
May 2, 2023

Conversation

Luap99
Copy link
Member

@Luap99 Luap99 commented Apr 12, 2023

Does this PR introduce a user-facing change?

none

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 12, 2023
@vrothberg
Copy link
Member

Looks like the issue persists with rootless hanging/not completing.

@edsantiago
Copy link
Member

Not just rootless: debian-12 and fedora-37, root, are at 54 minutes right now. I expect they will time out.

@edsantiago
Copy link
Member

There's something horribly weird and broken in the new ginkgo: it does not seem to show command stderr. Example: f37 root log, search in-page for "--filter takes no arg". It shows up jumbled together in a block labeled "Timeline", but does not show up next to the actual command that generated it. This makes it impossible to diagnose actual errors. The relevant code, in ginkgo/v2/reporters/default_reporter.go, suggests that this is triggered when --nodes > 1 but at first glance I don't see a way to fix it.

Obviously this renders the entire package useless, so there must be a way to override it, and I'm sure one of you lovely people will find the solution much more quickly than I can. I'll keep looking anyway though.

@Luap99
Copy link
Member Author

Luap99 commented Apr 12, 2023

@edsantiago There are changes how output now works, I am aware of this. Benchmarks are also broken.
This is not mergable like that, thus marked as WIP. For now I want to figure out the hang. I can repoduce locally, the problem is that we are leaking podman processes and the new ginkgo waits for them to be completed.

Example processes that hang:

pholzing  704177  0.1  0.1 1789984 41924 pts/4   Sl+  18:54   0:00 /home/pholzing/go/src/github.com/containers/podman/bin/podman --storage-opt vfs.imagestore=/tmp/imagecachedir --root /tmp/podman_test3608744664/root --runroot /tmp/podman_test3608744664/runroot --runtime crun --conmon /usr/bin/conmon --network-config-dir /home/pholzing/.config/cni/net.d --network-backend cni --cgroup-manager cgroupfs --tmpdir /tmp/podman_test3608744664 --events-backend file --db-backend boltdb --storage-driver vfs run --uts private quay.io/libpod/alpine:latest ls -l /proc/self/ns/uts
pholzing  704463  0.1  0.1 1863716 41976 pts/4   Sl+  18:54   0:00 /home/pholzing/go/src/github.com/containers/podman/bin/podman --storage-opt vfs.imagestore=/tmp/imagecachedir --root /tmp/podman_test2981800605/root --runroot /tmp/podman_test2981800605/runroot --runtime crun --conmon /usr/bin/conmon --network-config-dir /home/pholzing/.config/cni/net.d --network-backend cni --cgroup-manager cgroupfs --tmpdir /tmp/podman_test2981800605 --events-backend file --db-backend boltdb --storage-driver vfs run --pid private quay.io/libpod/alpine:latest ls -l /proc/self/ns/pid
pholzing  704662  0.1  0.1 1863460 40512 pts/4   Sl+  18:54   0:00 /home/pholzing/go/src/github.com/containers/podman/bin/podman --storage-opt vfs.imagestore=/tmp/imagecachedir --root /tmp/podman_test3588093402/root --runroot /tmp/podman_test3588093402/runroot --runtime crun --conmon /usr/bin/conmon --network-config-dir /home/pholzing/.config/cni/net.d --network-backend cni --cgroup-manager cgroupfs --tmpdir /tmp/podman_test3588093402 --events-backend file --db-backend boltdb --storage-driver vfs run --cgroupns private quay.io/libpod/alpine:latest ls -l /proc/self/ns/cgroup

As soon as I kill them manually ginkgo returns.
So now I need to figure out why this happens.

@edsantiago
Copy link
Member

diff --git a/test/e2e/containers_conf_test.go b/test/e2e/containers_conf_test.go
index 125a7bff1..72095d109 100644
--- a/test/e2e/containers_conf_test.go
+++ b/test/e2e/containers_conf_test.go
@@ -233,6 +233,7 @@ var _ = Describe("Verify podman containers.conf usage", func() {
 		Expect(hostNS).To(Equal(ctrNS))
 
 		session = podmanTest.Podman([]string{"run", option, "private", ALPINE, "ls", "-l", nspath})
+		Expect(session).Should(Exit(0))
 		fields = strings.Split(session.OutputToString(), " ")
 		ctrNS = fields[len(fields)-1]
 		Expect(hostNS).ToNot(Equal(ctrNS))

???

@Luap99
Copy link
Member Author

Luap99 commented Apr 12, 2023

That is a good catch but I don't it is the problem, the podman process is completely stucked and does not respond to any signal (well kill work obv) and it is not listed as zombie.

I can attach with a debugger and it looks like we are deadlocked on a container lock

(dlv) bt
0  0x00007f977e506fd3 in ???
   at ?:-1
1  0x000000000047c2e0 in runtime.systemstack_switch
   at /usr/lib/golang/src/runtime/asm_amd64.s:459
2  0x0000000000415a05 in runtime.cgocall
   at /usr/lib/golang/src/runtime/cgocall.go:168
3  0x00000000010aedac in github.com/containers/podman/v4/libpod/lock/shm._Cfunc_lock_semaphore
   at _cgo_gotypes.go:227
4  0x00000000010affd8 in github.com/containers/podman/v4/libpod/lock/shm.(*SHMLocks).LockSemaphore
   at ./go/src/github.com/containers/podman/libpod/lock/shm/shm_lock.go:229
5  0x00000000010b0ea5 in github.com/containers/podman/v4/libpod/lock.(*SHMLock).Lock
   at ./go/src/github.com/containers/podman/libpod/lock/shm_lock_manager_linux.go:114
6  0x000000000123af8e in github.com/containers/podman/v4/libpod.(*Container).Kill
   at ./go/src/github.com/containers/podman/libpod/container_api.go:271
7  0x0000000001375b32 in github.com/containers/podman/v4/pkg/domain/infra/abi/terminal.ProxySignals.func1
   at ./go/src/github.com/containers/podman/pkg/domain/infra/abi/terminal/sigproxy_commn.go:36
8  0x000000000047e501 in runtime.goexit
   at /usr/lib/golang/src/runtime/asm_amd64.s:1594

But the thing is how the hell can it deadlock here, I have zero other podman processes running on the system which could be holding a lock, so what is holding the lock.

@edsantiago
Copy link
Member

Yeah... I noticed. Even with WaitWithDefaultTimeout it still hangs for me, unkillable except with -9.

zero other podman processes

Same here. But this is rootless... there's probably no pause process needed when running private namespaces, but could there be something that doesn't know that, and is waiting to join the pause process ns? (Ignore if this is a stupid comment. No need even to respond. Just thinking out loud.)

@Luap99
Copy link
Member Author

Luap99 commented Apr 12, 2023

Actually adding

session.WaitWithDefaultTimeout()
Expect(session).Should(Exit(0))

seems to fix it for me now. I only run containers_conf_test.go right now and before this change it hang every time now it passed three times in a row.

@edsantiago
Copy link
Member

Weird: I have WaitWithDefaultTimeout() also, and Should(), just like you cite above .... and I still see the hang. But I'll move on to other things.

@edsantiago
Copy link
Member

Quick suggestion for next time you push: try changing the .cirrus.yml timeouts from 90m to 50 or even 45. Maybe not permanently, but it sure would save a lot of time in catching hangs.

@vrothberg
Copy link
Member

Thanks for looking into this, folks!

@Luap99
Copy link
Member Author

Luap99 commented Apr 13, 2023

Ok, hang is definitely fixed. Root cause is podman rm --fa (#18180) not working in all circumstances. Therefore it starts leaking processes and the ginkgo v2 doesn't like that. I assume the processes keep some std streams open which is tries to read.

@edsantiago
Copy link
Member

Remote tests failing to compile: undefined: GinkgoWriter. But @Luap99 please don't re-push yet, I want some sample logs to download so I can fix logformatter. Should just be a few more minutes.

@Luap99
Copy link
Member Author

Luap99 commented Apr 13, 2023

Remote tests failing to compile: undefined: GinkgoWriter. But @Luap99 please don't re-push yet, I want some sample logs to download so I can fix logformatter. Should just be a few more minutes.

https://api.cirrus-ci.com/v1/artifact/task/5080120319279104/html/int-podman-fedora-37-root-container-boltdb.log.html

@edsantiago
Copy link
Member

Got em. (I need raw logs, the Cirrus ones. The logformatted ones are utter garbage right now, that's what I'm trying to fix)

@Luap99
Copy link
Member Author

Luap99 commented Apr 13, 2023

with the added -p instead of --nodes 3, make localintegration only takes 4-5 mins on my laptop (12 threads)

@edsantiago
Copy link
Member

Hey, are we expecting PANICs? In the process of updating logformatter, I see a bunch of these panics in podman unshare tests:

S [SKIPPED] [0.001 seconds]
         Podman unshare [BeforeEach]
           podman unshare
           Enter [AfterEach] Podman unshare - /var/tmp/go/src/github.com[/containers/podman/test/e2e/unshare_test.go:37](https://github.com/containers/podman/blob/de55faaecd839ec5f2bbc270ae27f37514de8ea8/test/e2e/unshare_test.go#L37) @ 04/13/23 13:04:41.586

           [PANICKED] Test Panicked
           In [AfterEach] at: /usr/lib/go-1.19/src/runtime/panic.go:260 @ 04/13/23 13:04:41.586
         
           runtime error: invalid memory address or nil pointer dereference

See debian log, search in-page for "PANIC". Compare to current ginkgo log for a PR that merged today.

The very weird thing is that the test shows as SKIPPED, and there is no link (at bottom) to the panic. No indication anywhere that there was a panic.

Anyhow, not my highest priority right now but someone should maybe look at it.

@Luap99
Copy link
Member Author

Luap99 commented Apr 13, 2023

Anyhow, not my highest priority right now but someone should maybe look at it.

I take a look.

@Luap99
Copy link
Member Author

Luap99 commented Apr 13, 2023

Looks like ginkgo now still runs AfterEach() when we call Skip() in BeforeEach()

edsantiago and others added 5 commits May 2, 2023 11:27
Super-complicated and ugly code merely to make in-page anchors
link three lines above the subtest name, to the subtest status
and timing line.

Signed-off-by: Ed Santiago <[email protected]>
This never worked when ginkgo runs with more than one thread, we use 3
in CI. The problem is that the SynchronizedAfterSuite() function accepts
two functions. The first one is run for each ginkgo node while the
second one is only run once for the whole suite.

Because the timings are stored as slice thus in memory we loose all
timings from the other nodes as they were only reported on node 1.
Moving the printing in the first function solves this but causes the
problem that the result is now no longer sorted. To fix this we let
each node write the result to a tmp file and only then let the final
after suite function collect the timings from all these files, then
sort them and print the output like we did before.

Signed-off-by: Paul Holzinger <[email protected]>
Add a workaround for containers#18180 so the ginkgo work can be merged without
being blocked by the issue. Please revert this commit when the issue
is fixed.

Signed-off-by: Paul Holzinger <[email protected]>
Directly writing to stdout/err is not safe when run in parallel.
Ginkgo v2 fixed this buffering the output and syncing the output so it
is not mangled between tests.

This is the same as for the podman integration tests.

Signed-off-by: Paul Holzinger <[email protected]>
while reworking ginkgo to use -p by default we also forced the machine
tests to be run in parallel. Right now this does not work at all
(something that should be fixed).

Using -p is easier becuase that will let ginkgo decide how many parallel
nodes to use so it much faster on high core counts.

So use some makefile magic to instaed of using `GINKGONODES` use
`GINKGO_PARALLEL` and set it to `y` by default. The machine tests will
then use that var to disable it.

Signed-off-by: Paul Holzinger <[email protected]>
@edsantiago
Copy link
Member

All int tests passed, with times of 26:44 to 38:24, but one timed out at 50m. Would presumably also time out at 90m. I don't understand why only one test would time out.

@Luap99
Copy link
Member Author

Luap99 commented May 2, 2023

I restarted it, my best guess is some flake leaving processes running...

@Luap99
Copy link
Member Author

Luap99 commented May 2, 2023

[+0490s]   Running: /var/tmp/go/src/github.com/containers/podman/bin/podman --root /tmp/podman_test1335685339/root --runroot /tmp/podman_test1335685339/runroot --runtime crun --conmon /usr/bin/conmon --network-config-dir /tmp/podman_test1335685339/root/etc/networks --network-backend netavark --cgroup-manager systemd --tmpdir /tmp/podman_test1335685339 --events-backend file --db-backend sqlite --storage-driver vfs ps -q
[+0490s]   Error: beginning transaction: database is locked

This is the only failure logged by ginkgo, passed on the retry. However the AfterEach cleanup still fired correctly and looks successful so I don't think it leaked a process.

@edsantiago
Copy link
Member

OMG.... CI is green. I'm still uncomfortable about that one timeout, and expect it to recur... but I am 100% OK with fixing that later if needed. @containers/podman-maintainers PTAL, this will potentially fix many flakes, and make debugging easier (better logs), and much more. (@Luap99 I'm still doing my final review, so, almost-lgtm but not quite yet).

@@ -558,8 +557,8 @@ test: localunit localintegration remoteintegration localsystem remotesystem ##

.PHONY: ginkgo-run
ginkgo-run: .install.ginkgo
ACK_GINKGO_RC=true $(GINKGO) version
ACK_GINKGO_RC=true $(GINKGO) -v $(TESTFLAGS) -tags "$(TAGS) remote" $(GINKGOTIMEOUT) -cover -flakeAttempts 3 -progress -trace -noColor -nodes $(GINKGONODES) -debug $(GINKGOWHAT) $(HACK)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing -cover looks correct to me: it's handled in the localunit target above. I've traced the presence of -cover in this line all the way back to 2018, and believe it to be a copy/paste from the 2017 CRI-O import. But since there's a slight possibility that I'm wrong, I'm calling attention to it anyway. If anyone sees a reason to keep this, please speak now.

@Luap99
Copy link
Member Author

Luap99 commented May 2, 2023

I'm still uncomfortable about that one timeout, and expect it to recur...

Yeah the unfortunate thing is this is really hard to track down. I currently test locally if the ginkgo timeout is helpful here, right now cirrus will loose some output on timeout. However if I can set a little bit lower timeout for ginkgo and it exits (not confirmed yet) then we can add a ps auxww to the cirrus script to check for running processes which should help debug what is hanging.

@edsantiago
Copy link
Member

we can add a ps auxww to the cirrus script

AIUI that's not the way Cirrus works: the Cirrus timeout is a hard stop, kill everything, no grace period or chance for cleanup or handler. @cevich do you know if there's a soft timeout option, where an emergency handler could be run?

@Luap99
Copy link
Member Author

Luap99 commented May 2, 2023

AIUI that's not the way Cirrus works: the Cirrus timeout is a hard stop, kill everything, no grace period or chance for cleanup or handler. @cevich do you know if there's a soft timeout option, where an emergency handler could be run?

I am well aware that is why I am talking about the ginkgo timeout: ginkgo --timeout 45m .... In this case ginkgo should exit after that time assuming it is not effected by the hang as well. If ginkgo timeout 45 min and cirrus timeout 50 min then I have a small window afterwards to run debug commands.

@edsantiago
Copy link
Member

Oops! My bad, I didn't properly understand your use of "ginkgo timeout". Yes, using ginkgo --timeout SHORTERVAL sounds perfect (but for a later PR). Thanks for the clarification.

func() {
sort.Sort(testResultsSortedLength{testResults})
testTimings := make(testResultsSorted, 0, 2000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I hope this finally fixes #8358

Copy link
Member

@edsantiago edsantiago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final LGTM. Thanks so much for persevering.

@rhatdan
Copy link
Member

rhatdan commented May 2, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 2, 2023
@cevich
Copy link
Member

cevich commented May 2, 2023

cleanup or handler. @cevich do you know if there's a soft timeout option

No, it's always hard. If you want soft, you have to do it via soft-ware 😁 (like ginkgo or timeout or something).

@openshift-merge-robot openshift-merge-robot merged commit f173db7 into containers:main May 2, 2023
@@ -571,7 +570,7 @@ ginkgo-remote:

.PHONY: testbindings
testbindings: .install.ginkgo
ACK_GINKGO_RC=true $(GINKGO) -v $(TESTFLAGS) -tags "$(TAGS) remote" $(GINKGOTIMEOUT) -progress -trace -noColor -debug -timeout 30m -v -r ./pkg/bindings/test
$(GINKGO) -v $(TESTFLAGS) --tags "$(TAGS) remote" $(GINKGOTIMEOUT) --trace --no-color --timeout 30m -v -r ./pkg/bindings/test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a $(GINKGOTIMEOUT) specified here in addition to a --timeout 30m, not sure if that was intentional / doing some tricky thing with ginkgo or not. Consider removing one or the other or adding a comment detailing why there are two timeout options.

@@ -126,8 +126,7 @@ LIBSECCOMP_COMMIT := v2.3.3
GINKGOTIMEOUT ?= -timeout=90m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest setting this timeout LOWER than the one in .cirrus.yml. As you've seen, when the cirrus-ci one hits, it just kills the VM. We get no logs or anything. It would be good to allow ginkgo's timeout to fire first so potentially helpful output and logs can be collected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I am working on it but the hang bug still happens, i.e. ginkgo thinks it is done but some goroutine causes the process to hang and never exit.

Copy link
Member

@cevich cevich May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously when there was a ginkgo hang, you could find out what it was by examining the ginkgo-node log files collected as CI-artifacts IIRC. However, if Cirrus times out the task, the logs won't be collected.

Copy link
Member

@cevich cevich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Two non-blocking timeout related comments that can be fixed later.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 2, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cevich, edsantiago, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Luap99 Luap99 deleted the ginkgo-v2 branch May 2, 2023 14:36
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 25, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants