-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update to ginkgo v2 #18163
update to ginkgo v2 #18163
Conversation
Looks like the issue persists with rootless hanging/not completing. |
Not just rootless: debian-12 and fedora-37, root, are at 54 minutes right now. I expect they will time out. |
There's something horribly weird and broken in the new ginkgo: it does not seem to show command stderr. Example: f37 root log, search in-page for " Obviously this renders the entire package useless, so there must be a way to override it, and I'm sure one of you lovely people will find the solution much more quickly than I can. I'll keep looking anyway though. |
@edsantiago There are changes how output now works, I am aware of this. Benchmarks are also broken. Example processes that hang:
As soon as I kill them manually ginkgo returns. |
diff --git a/test/e2e/containers_conf_test.go b/test/e2e/containers_conf_test.go
index 125a7bff1..72095d109 100644
--- a/test/e2e/containers_conf_test.go
+++ b/test/e2e/containers_conf_test.go
@@ -233,6 +233,7 @@ var _ = Describe("Verify podman containers.conf usage", func() {
Expect(hostNS).To(Equal(ctrNS))
session = podmanTest.Podman([]string{"run", option, "private", ALPINE, "ls", "-l", nspath})
+ Expect(session).Should(Exit(0))
fields = strings.Split(session.OutputToString(), " ")
ctrNS = fields[len(fields)-1]
Expect(hostNS).ToNot(Equal(ctrNS)) ??? |
That is a good catch but I don't it is the problem, the podman process is completely stucked and does not respond to any signal (well kill work obv) and it is not listed as zombie. I can attach with a debugger and it looks like we are deadlocked on a container lock
But the thing is how the hell can it deadlock here, I have zero other podman processes running on the system which could be holding a lock, so what is holding the lock. |
Yeah... I noticed. Even with
Same here. But this is rootless... there's probably no pause process needed when running private namespaces, but could there be something that doesn't know that, and is waiting to join the pause process ns? (Ignore if this is a stupid comment. No need even to respond. Just thinking out loud.) |
Actually adding
seems to fix it for me now. I only run containers_conf_test.go right now and before this change it hang every time now it passed three times in a row. |
Weird: I have |
Quick suggestion for next time you push: try changing the |
Thanks for looking into this, folks! |
2e40a52
to
6dbb60f
Compare
Ok, hang is definitely fixed. Root cause is |
Remote tests failing to compile: |
|
Got em. (I need raw logs, the Cirrus ones. The logformatted ones are utter garbage right now, that's what I'm trying to fix) |
with the added |
Hey, are we expecting PANICs? In the process of updating logformatter, I see a bunch of these panics in
See debian log, search in-page for "PANIC". Compare to current ginkgo log for a PR that merged today. The very weird thing is that the test shows as SKIPPED, and there is no link (at bottom) to the panic. No indication anywhere that there was a panic. Anyhow, not my highest priority right now but someone should maybe look at it. |
I take a look. |
Looks like ginkgo now still runs AfterEach() when we call Skip() in BeforeEach() |
Super-complicated and ugly code merely to make in-page anchors link three lines above the subtest name, to the subtest status and timing line. Signed-off-by: Ed Santiago <[email protected]>
This never worked when ginkgo runs with more than one thread, we use 3 in CI. The problem is that the SynchronizedAfterSuite() function accepts two functions. The first one is run for each ginkgo node while the second one is only run once for the whole suite. Because the timings are stored as slice thus in memory we loose all timings from the other nodes as they were only reported on node 1. Moving the printing in the first function solves this but causes the problem that the result is now no longer sorted. To fix this we let each node write the result to a tmp file and only then let the final after suite function collect the timings from all these files, then sort them and print the output like we did before. Signed-off-by: Paul Holzinger <[email protected]>
Add a workaround for containers#18180 so the ginkgo work can be merged without being blocked by the issue. Please revert this commit when the issue is fixed. Signed-off-by: Paul Holzinger <[email protected]>
Directly writing to stdout/err is not safe when run in parallel. Ginkgo v2 fixed this buffering the output and syncing the output so it is not mangled between tests. This is the same as for the podman integration tests. Signed-off-by: Paul Holzinger <[email protected]>
while reworking ginkgo to use -p by default we also forced the machine tests to be run in parallel. Right now this does not work at all (something that should be fixed). Using -p is easier becuase that will let ginkgo decide how many parallel nodes to use so it much faster on high core counts. So use some makefile magic to instaed of using `GINKGONODES` use `GINKGO_PARALLEL` and set it to `y` by default. The machine tests will then use that var to disable it. Signed-off-by: Paul Holzinger <[email protected]>
All |
I restarted it, my best guess is some flake leaving processes running... |
This is the only failure logged by ginkgo, passed on the retry. However the AfterEach cleanup still fired correctly and looks successful so I don't think it leaked a process. |
OMG.... CI is green. I'm still uncomfortable about that one timeout, and expect it to recur... but I am 100% OK with fixing that later if needed. @containers/podman-maintainers PTAL, this will potentially fix many flakes, and make debugging easier (better logs), and much more. (@Luap99 I'm still doing my final review, so, almost-lgtm but not quite yet). |
@@ -558,8 +557,8 @@ test: localunit localintegration remoteintegration localsystem remotesystem ## | |||
|
|||
.PHONY: ginkgo-run | |||
ginkgo-run: .install.ginkgo | |||
ACK_GINKGO_RC=true $(GINKGO) version | |||
ACK_GINKGO_RC=true $(GINKGO) -v $(TESTFLAGS) -tags "$(TAGS) remote" $(GINKGOTIMEOUT) -cover -flakeAttempts 3 -progress -trace -noColor -nodes $(GINKGONODES) -debug $(GINKGOWHAT) $(HACK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing -cover
looks correct to me: it's handled in the localunit
target above. I've traced the presence of -cover
in this line all the way back to 2018, and believe it to be a copy/paste from the 2017 CRI-O import. But since there's a slight possibility that I'm wrong, I'm calling attention to it anyway. If anyone sees a reason to keep this, please speak now.
Yeah the unfortunate thing is this is really hard to track down. I currently test locally if the ginkgo timeout is helpful here, right now cirrus will loose some output on timeout. However if I can set a little bit lower timeout for ginkgo and it exits (not confirmed yet) then we can add a |
AIUI that's not the way Cirrus works: the Cirrus timeout is a hard stop, kill everything, no grace period or chance for cleanup or handler. @cevich do you know if there's a soft timeout option, where an emergency handler could be run? |
I am well aware that is why I am talking about the ginkgo timeout: |
Oops! My bad, I didn't properly understand your use of "ginkgo timeout". Yes, using |
func() { | ||
sort.Sort(testResultsSortedLength{testResults}) | ||
testTimings := make(testResultsSorted, 0, 2000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I hope this finally fixes #8358
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final LGTM. Thanks so much for persevering.
/lgtm |
No, it's always hard. If you want soft, you have to do it via soft-ware 😁 (like ginkgo or |
@@ -571,7 +570,7 @@ ginkgo-remote: | |||
|
|||
.PHONY: testbindings | |||
testbindings: .install.ginkgo | |||
ACK_GINKGO_RC=true $(GINKGO) -v $(TESTFLAGS) -tags "$(TAGS) remote" $(GINKGOTIMEOUT) -progress -trace -noColor -debug -timeout 30m -v -r ./pkg/bindings/test | |||
$(GINKGO) -v $(TESTFLAGS) --tags "$(TAGS) remote" $(GINKGOTIMEOUT) --trace --no-color --timeout 30m -v -r ./pkg/bindings/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a $(GINKGOTIMEOUT)
specified here in addition to a --timeout 30m
, not sure if that was intentional / doing some tricky thing with ginkgo or not. Consider removing one or the other or adding a comment detailing why there are two timeout options.
@@ -126,8 +126,7 @@ LIBSECCOMP_COMMIT := v2.3.3 | |||
GINKGOTIMEOUT ?= -timeout=90m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest setting this timeout LOWER than the one in .cirrus.yml
. As you've seen, when the cirrus-ci one hits, it just kills the VM. We get no logs or anything. It would be good to allow ginkgo's timeout to fire first so potentially helpful output and logs can be collected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I am working on it but the hang bug still happens, i.e. ginkgo thinks it is done but some goroutine causes the process to hang and never exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously when there was a ginkgo hang, you could find out what it was by examining the ginkgo-node log files collected as CI-artifacts IIRC. However, if Cirrus times out the task, the logs won't be collected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Two non-blocking timeout
related comments that can be fixed later.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cevich, edsantiago, Luap99 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Does this PR introduce a user-facing change?