-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
misc/cgo/testsanitizers: occasional hangs in TestTSAN/tsan12 #52998
Comments
Unfortunately in the above trace the test is blocked on running a subprocess in a table-driven test: We probably need to send SIGQUIT to that subprocess to figure out what's going on (#50436), although just having information about which subtest stalled would help somewhat (#39038). It's odd that the test itself doesn't time out, though. |
cmd/dist runs the test by first running I will send a CL for that. |
Change https://go.dev/cl/407374 mentions this issue: |
Thanks Ian. For the record, my upcoming coordinator change was going to include a 5 minute timeout on the x/build side via // DistTestsExecTimeout returns how long the coordinator should wait
// for a cmd/dist test execution to run the provided dist test names.
func (c *BuildConfig) DistTestsExecTimeout(distTests []string) time.Duration {
+ if len(distTests) == 1 && distTests[0] == "testsanitizers" {
+ // This test is known to complete in under a minute normally,
+ // but can sometimes stall indefinitely; see go.dev/issue/52998.
+ // This cap may help a bit until that issue is resolved.
+ return 5 * time.Minute
+ }
+
... Since CL 407374 does the same in the test, that won't be necessary anymore (I had it just in case progress on this issue wasn't going to be as quick). |
For a host test we build the test using "go test -c" and then run the test binary. A test binary run in this way has no default timeout. This CL gives it a timeout of 5 minutes, scaled for the target. We can adjust the timeout if necessary. For #52998 Change-Id: Ib759142f3e71cbb37ec858182998fc5d4fba7ab6 Reviewed-on: https://go-review.googlesource.com/c/go/+/407374 TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Bryan Mills <[email protected]> Run-TryBot: Ian Lance Taylor <[email protected]>
Change https://go.dev/cl/407954 mentions this issue: |
I modified the test itself to use the
That test program is the regression test for https://go.dev/cl/50251. It calls I don't see any runnable goroutines in the stack dump, so this smells like a |
TSAN intercepts signals and invokes the signal handler from TSAN at some point(?). Maybe that never happens? Not sure if we actively trigger TSAN to deliver the signal in this case. |
Generally in TSAN mode we call |
Possible.
This stack is weird. I don't think |
If the test hangs due to a deadlock in a subprocess, we want a goroutine dump of that process to figure out the nature of the deadlock. SIGQUIT causes the Go runtime to produce exactly such a dump (unless the runtime itself is badly deadlocked). For #52998. Change-Id: Id9b3ba89d8f705e14f6cd789353fc2b7f4774ad3 Reviewed-on: https://go-review.googlesource.com/c/go/+/407954 Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: Bryan Mills <[email protected]> Auto-Submit: Bryan Mills <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
I upped the timeout and ran more iterations of
|
This failure reproduces fairly readily (well within a day's CPU time) on However, given that we have evidence that it may have existed for a long time (masked by #42699), I think it would be ok to unblock the release by adding a skip to the
|
I'll look into it. |
Change https://go.dev/cl/407888 mentions this issue: |
@gopherbot, please backport to Go 1.18 and 1.17. This is a test-only fix for a test that occasionally deadlocks, which can cause TryBot stalls and spurious failures on release branches. |
Backport issue(s) opened: #53042 (for 1.17), #53043 (for 1.18). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
Change https://go.dev/cl/408114 mentions this issue: |
Change https://go.dev/cl/408115 mentions this issue: |
… in tsan12.go os/signal.Notify requires that "the caller must ensure that c has sufficient buffer space to keep up with the expected signal rate" as it does a nonblocking send when it receives a signal. The test currently using a unbuffered channel, which means it may miss the signal if the signal arrives before the channel receive operation. Fixes #53043. Updates #52998. Change-Id: Icdcab9396d735506480ef880fb45a4669fa7cc8f Reviewed-on: https://go-review.googlesource.com/c/go/+/407888 Reviewed-by: Ian Lance Taylor <[email protected]> Reviewed-by: Bryan Mills <[email protected]> Run-TryBot: Cherry Mui <[email protected]> Reviewed-by: Cuong Manh Le <[email protected]> TryBot-Result: Gopher Robot <[email protected]> (cherry picked from commit 62e1302) Reviewed-on: https://go-review.googlesource.com/c/go/+/408114
… in tsan12.go os/signal.Notify requires that "the caller must ensure that c has sufficient buffer space to keep up with the expected signal rate" as it does a nonblocking send when it receives a signal. The test currently using a unbuffered channel, which means it may miss the signal if the signal arrives before the channel receive operation. Fixes #53042. Updates #52998. Change-Id: Icdcab9396d735506480ef880fb45a4669fa7cc8f Reviewed-on: https://go-review.googlesource.com/c/go/+/407888 Reviewed-by: Ian Lance Taylor <[email protected]> Reviewed-by: Bryan Mills <[email protected]> Run-TryBot: Cherry Mui <[email protected]> Reviewed-by: Cuong Manh Le <[email protected]> TryBot-Result: Gopher Robot <[email protected]> (cherry picked from commit 62e1302) Reviewed-on: https://go-review.googlesource.com/c/go/+/408115
(This is mostly a continuation of #33249, with some new information.)
There's evidence of a problem where this test can sometimes hang indefinitely, though in some environments there is an external timeout, causing it to be eventually retried and then succeed. The hang delays success by however long the timeout for the retry was.
Additional background
I've observed an occurrence this week in a SlowBot run in https://go.dev/406897 (linux-amd64-longtest took almost 2 hrs), another one last week. There are mentions of it in #42699 (comment) and in #36629 (comment). So I think it's been around for a while.
Issue #33249 was closed because there weren't failures since late 2019, but that is likely due to an unintentional behavior change in the coordinator in Nov 2019 leading to #42699, rather than due to a change in testsanitizers test.
I was able to reproduce the problem locally (using tip commit c6965ad) by running the test in a loop up to a 1000 times (it reproduced on iteration 189 in the first run, on iteration 42 in another, on iteration 53 in yet another):
Sending a SIQQUIT during the hang produced the following output:
Output
This is the tracking issue for improving the test not to hang indefinitely, if possible.
The text was updated successfully, but these errors were encountered: