-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/coverage: TestCoverageApis failures #56197
Comments
Found new dashboard test flakes for:
2022-10-06 05:21 openbsd-arm64-jsing go@af668c68 runtime/coverage.TestCoverageApis (log)
|
Found new dashboard test flakes for:
2022-10-14 12:48 netbsd-amd64-9_0 go@a4b4717f runtime/coverage.TestCoverageApis (log)
|
For this last Oct 14 flake, it looks as though it timed out just building the harness, which is not especially far into the test (hmm). I'll send another CL to skip all the API tests in short mode. |
Change https://go.dev/cl/443376 mentions this issue: |
Add more skips if short mode testing, since some of these tests still seem to be timing out on smaller and more underpowered builders. Updates #56197. Change-Id: I469d9fd3a6be5602243234562fa3fe6263968b56 Reviewed-on: https://go-review.googlesource.com/c/go/+/443376 TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: David Chase <[email protected]> Run-TryBot: Than McIntosh <[email protected]>
Found new dashboard test flakes for:
2022-10-17 15:26 linux-ppc64-sid-buildlet go@c83d4fba runtime/coverage.TestCoverageApis (log)
|
Add more skips if short mode testing, since some of these tests still seem to be timing out on smaller and more underpowered builders. Updates golang#56197. Change-Id: I469d9fd3a6be5602243234562fa3fe6263968b56 Reviewed-on: https://go-review.googlesource.com/c/go/+/443376 TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: David Chase <[email protected]> Run-TryBot: Than McIntosh <[email protected]>
@thanm, is this fixed, or is there more to do here? That last failure on 2022-10-17 doesn't particularly look like a timeout to me... |
I agree that last failure does not look like a timeout. I will see if I can reproduce it. |
Well, so far no luck reproducing it:
Looking at the code that triggers the panic, it looks like a corrupt coverage counter section would do it, assuming that is what happened. I will wait and see if any other similar flakes turn up. |
OK looks like I just needed to be patient and wait a bit longer (I upped the -test.count to 10000 and eventually hit it), albeit with slightly different slice index violation. This should be interesting to work on :-/. |
As near as I can tell this looks to be a relaxed memory model problem, in which two stores to the coverage counter array are being performed out of program order, and this winds up being exposed to the code that is trying to read the counter array while the program is running. In the 3 or 4 instances of the bug that I've seen, the "number of counters" slot is always very large, e.g. 4294969076, etc. This value looks suspiciously like one of the hard-coded package IDs used by runtime packages. What this says to me is that in the bad scenario we have two threads, A and B. Thread A is walking through the counter array reading values, looking for a non-zero "number of counters" slot. Thread B happens to be executing a function in the runtime that up until this point hasn't been executed, and it's in the prolog section of the function where it is writing the summary values for the function, e.g. numcounters + pkgid + funcid. The program issues three stores for these values, but the processor for whatever reason decides to reorder the stores and it finished the pkgid store first, before the numcounters store. Since it's a runtime package, this explains the bogus value (since runtime packages use pkg IDs that are encoded as negative numbers). Also, we're seeing runtime packages since harness is built with -coverpkg=all. For the coverage APIs we already restrict the use of coverage.ClearCounters() to atomic mode, but what this bug says to me is that we need to extend this restriction to the other APIs that deal with counter data. |
Change https://go.dev/cl/463695 mentions this issue: |
Found new dashboard test flakes for:
2023-03-08 21:46 linux-arm64-longtest go@4df95d51 runtime/coverage.TestCoverageApis (log)
|
Found new dashboard test flakes for:
2023-03-25 10:16 linux-amd64-longtest go@9768f736 runtime/coverage.TestCoverageApis (log)
|
Found new dashboard test flakes for:
2023-04-12 20:52 windows-amd64-longtest go@12e65dbd runtime/coverage.TestCoverageApis (log)
|
Found new dashboard test flakes for:
2023-04-21 21:05 darwin-amd64-longtest go@cedf5008 runtime/coverage.TestCoverageApis (log)
|
Found new dashboard test flakes for:
2023-07-10 19:19 darwin-amd64-longtest go@07ede7a5 runtime/coverage.TestCoverageApis (log)
|
Found new dashboard test flakes for:
2024-03-23 04:07 linux-arm64-longtest go@83a6c13e runtime/coverage.TestCoverageApis (log)
|
Change https://go.dev/cl/588235 mentions this issue: |
Real programs can call os.Exit concurrently from multiple goroutines. Make internal/runtime/exithook not crash in that case. The throw on panic also now runs in the deferred context, so that we will see the full stack trace that led to the panic. That should give us more visibility into the flaky failures on bugs #55167 and #56197 as well. Fixes #67631. Change-Id: Iefdf71b3a3b52a793ca88d89a9c270eb50ece094 Reviewed-on: https://go-review.googlesource.com/c/go/+/588235 Reviewed-by: Than McIntosh <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Auto-Submit: Russ Cox <[email protected]>
Issue created automatically to collect these failures.
Example (log):
— watchflakes
The text was updated successfully, but these errors were encountered: