Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/relui: advisory TryBots can handle or avoid timeouts better #60443

Open
dmitshur opened this issue May 25, 2023 · 5 comments
Open
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@dmitshur
Copy link
Contributor

dmitshur commented May 25, 2023

Advisory TryBots are causing much less friction compared to previously now that #57725 has been resolved. Filing this for some followup improvements that we can apply to how they handle timeouts.

Consider the following logs from linux-amd64-longtest-race and darwin-amd64-12_0 advisory trybot runs that timed out after all 3 tries:

Timeout after one slow package test
2023/04/26 15:32:06 ======== Trybot Attempt 1 of 3 ========
2023/04/26 15:32:06 Creating buildlet linux-amd64-longtest-race.
2023/04/26 15:35:26 Buildlet ready.
2023/04/26 15:35:26 Pushing source to buildlet.
2023/04/26 15:36:09 Installing go1.4.
2023/04/26 15:36:20 Testing
2023/04/26 15:36:46 
Building Go cmd/dist using /workdir/go1.4. (go1.17.13 linux/amd64)
Building Go toolchain1 using /workdir/go1.4.
2023/04/26 15:36:56 
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
2023/04/26 15:37:06 
Building Go toolchain2 using go_bootstrap and Go toolchain1.
2023/04/26 15:37:26 
Building Go toolchain3 using go_bootstrap and Go toolchain2.
2023/04/26 15:37:56 
Building packages and commands for linux/amd64.
2023/04/26 15:38:46 

##### Test execution environment.
# GOARCH: amd64
# CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
# GOOS: linux
# OS Version: Linux 5.15.89+ #1 SMP Fri Apr 14 09:13:06 UTC 2023 x86_64

##### Testing packages.
2023/04/26 15:38:56 
ok  	archive/tar	1.717s
2023/04/26 15:42:26 
ok  	archive/zip	218.491s
ok  	bufio	0.497s
ok  	bytes	18.052s
ok  	compress/bzip2	0.551s
2023/04/26 15:44:16 
ok  	compress/flate	319.931s
ok  	compress/gzip	24.677s
ok  	compress/lzw	1.146s
ok  	compress/zlib	17.343s
ok  	container/heap	0.052s
ok  	container/list	0.068s
ok  	container/ring	0.068s
ok  	context	0.077s
ok  	crypto	0.057s
ok  	crypto/aes	0.108s
ok  	crypto/cipher	2.234s
ok  	crypto/des	0.094s
ok  	crypto/dsa	15.913s
ok  	crypto/ecdh	1.304s
ok  	crypto/ecdsa	7.967s
ok  	crypto/ed25519	2.225s
ok  	crypto/elliptic	3.493s
ok  	crypto/hmac	0.102s
ok  	crypto/internal/alias	0.076s
ok  	crypto/internal/bigmod	4.823s
ok  	crypto/internal/boring	0.105s
ok  	crypto/internal/boring/bcache	0.916s
ok  	crypto/internal/edwards25519	76.131s
ok  	crypto/internal/edwards25519/field	28.774s
ok  	crypto/internal/nistec	53.465s
ok  	crypto/internal/nistec/fiat	0.049s [no tests to run]
ok  	crypto/md5	0.099s
ok  	crypto/rand	7.026s
ok  	crypto/rc4	1.032s
ok  	crypto/rsa	15.508s
ok  	crypto/sha1	0.151s
ok  	crypto/sha256	0.110s
ok  	crypto/sha512	0.124s
ok  	crypto/subtle	44.809s
ok  	crypto/tls	9.110s
ok  	crypto/x509	4.257s
ok  	database/sql	1.804s
ok  	database/sql/driver	0.086s
ok  	debug/buildinfo	11.564s
ok  	debug/dwarf	0.185s
ok  	debug/elf	3.727s
ok  	debug/gosym	1.404s
ok  	debug/macho	0.162s
ok  	debug/pe	0.183s
ok  	debug/plan9obj	0.050s
ok  	embed	0.092s [no tests to run]
ok  	embed/internal/embedtest	0.074s
ok  	encoding/ascii85	0.088s
ok  	encoding/asn1	0.087s
ok  	encoding/base32	0.252s
ok  	encoding/base64	0.097s
ok  	encoding/binary	0.064s
ok  	encoding/csv	0.118s
ok  	encoding/gob	35.954s
ok  	encoding/hex	0.061s
ok  	encoding/json	20.127s
ok  	encoding/pem	4.550s
ok  	encoding/xml	81.810s
ok  	errors	0.131s
ok  	expvar	0.123s
ok  	flag	0.428s
ok  	fmt	1.247s
ok  	go/ast	0.054s
ok  	go/build	9.312s
ok  	go/build/constraint	0.072s
ok  	go/constant	0.083s
ok  	go/doc	0.522s
ok  	go/doc/comment	3.843s
ok  	go/format	0.089s
ok  	go/importer	0.947s
ok  	go/internal/gccgoimporter	0.114s
ok  	go/internal/gcimporter	84.599s
ok  	go/internal/srcimporter	31.265s
ok  	go/parser	93.094s
ok  	go/printer	3.377s
ok  	go/scanner	0.080s
ok  	go/token	0.317s
ok  	go/types	254.439s
ok  	hash	0.043s
ok  	hash/adler32	0.159s
ok  	hash/crc32	0.164s
ok  	hash/crc64	0.085s
ok  	hash/fnv	0.064s
ok  	hash/maphash	0.053s
ok  	html	0.073s
ok  	html/template	1.341s
ok  	image	3.077s
ok  	image/color	0.112s
ok  	image/draw	1.355s
ok  	image/gif	197.721s
ok  	image/jpeg	47.669s
ok  	image/png	11.062s
2023/04/26 15:54:16 testing failed: Command go/src/race.bash [] failed: rpc error: code = Canceled desc = context canceled
2023/04/26 15:54:16 ======== Trybot Attempt 2 of 3 ========
2023/04/26 15:54:16 Creating buildlet linux-amd64-longtest-race.
2023/04/26 15:54:16 Trybot Attempt failed: rpc error: code = Canceled desc = context canceled
2023/04/26 15:54:17 ======== Trybot Attempt 3 of 3 ========
2023/04/26 15:54:17 Creating buildlet linux-amd64-longtest-race.
2023/04/26 15:54:17 Trybot Attempt failed: rpc error: code = Canceled desc = context canceled
2023/04/26 15:54:17 Advisory TryBot timed out or was canceled
{"Name": "linux-amd64-longtest-race", "Passed": false}
Timeout after one slow buildlet creation
2023/04/26 15:32:06 ======== Trybot Attempt 1 of 3 ========
2023/04/26 15:32:06 Creating buildlet darwin-amd64-12_0.
2023/04/26 15:42:06 Trybot Attempt failed: rpc error: code = Canceled desc = context canceled
2023/04/26 15:42:06 ======== Trybot Attempt 2 of 3 ========
2023/04/26 15:42:06 Creating buildlet darwin-amd64-12_0.
2023/04/26 15:42:06 Trybot Attempt failed: rpc error: code = Canceled desc = context canceled
2023/04/26 15:42:07 ======== Trybot Attempt 3 of 3 ========
2023/04/26 15:42:07 Creating buildlet darwin-amd64-12_0.
2023/04/26 15:42:07 Trybot Attempt failed: rpc error: code = Canceled desc = context canceled
2023/04/26 15:42:07 Advisory TryBot timed out or was canceled
{"Name": "darwin-amd64-12_0", "Passed": false}

There are at least two areas for improvement visible.

First, it timed out because the single "index/suffixarray" tests happen to sometimes take longer than the default watchdog delay (currently 10 minutes), and so the watchdog mechanism stopped it. This can be avoided by injecting our own watchdog reset every 5 minutes as we do in some other tasks that run for a long time. (Increasing the watchdog delay is also a viable workaround.)

Second, the trybot attempts 2 and 3 don't end up actually trying because they start out with a canceled context. So there's effectively only one attempt. Issue #60444 may be relevant to this.

Filing for tracking purposes; it's not a high priority. CC @golang/release.

@dmitshur dmitshur added Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 25, 2023
@dmitshur dmitshur added this to the Unreleased milestone May 25, 2023
@dmitshur dmitshur moved this to Planned in Go Release May 30, 2023
@dmitshur
Copy link
Contributor Author

CC @prattmic.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/503516 mentions this issue: internal/releasetargets: revert "set LongTestBuilder for darwin-amd64"

gopherbot pushed a commit to golang/build that referenced this issue Jun 15, 2023
This reverts CL 503758.

This newly added longtest builder isn't as fast as the others, and ends
up running into the same problem as advisory trybots did in issue 60443,
that there may be periods without logging that exceed the watchdog limit
of 10 mins.

Rather than bumping the watchdog timer to a value we can't predict would
be enough, revert this and try it again later.

For golang/go#60443.

Change-Id: I42e9eaee8643cc2eb2e077bc3010663b62b94f2b
Reviewed-on: https://go-review.googlesource.com/c/build/+/503516
Reviewed-by: Heschi Kreinick <[email protected]>
Auto-Submit: Dmitri Shuralyov <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Run-TryBot: Dmitri Shuralyov <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/504523 mentions this issue: internal/{task,workflow}: scale watchdog delay by GO_TEST_TIMEOUT_SCALE

gopherbot pushed a commit to golang/build that referenced this issue Jun 22, 2023
Longtest builders are known to apply a scale factor on the default
go test timeout (10 minutes). Do the same in relui.

For golang/go#60443.

Change-Id: If0e7c21fdf7916745ce65db0c73a730091a506f7
Reviewed-on: https://go-review.googlesource.com/c/build/+/504523
Auto-Submit: Dmitri Shuralyov <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
Run-TryBot: Dmitri Shuralyov <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Heschi Kreinick <[email protected]>
@dmitshur
Copy link
Contributor Author

dmitshur commented Nov 9, 2023

First, it timed out because the single "index/suffixarray" tests happen to sometimes take longer [...] (Increasing the watchdog delay is also a viable workaround.)

It's interesting to check if this case still happens with any regularity by now. It might not be since we started scaling the watchdog delay on longtest builders (CL 504523).

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/541380 mentions this issue: internal/relui: simplify advisory tests and don't accept timeouts

gopherbot pushed a commit to golang/build that referenced this issue Nov 16, 2023
Give up on distinguishing between infrastructure failures and test
failures. At the end of the day someone has to look at the errors either
way.

At the same time, don't treat context errors as successes. We can't
afford to assume that timeouts and stuff aren't related to the tests
we're running. Even if we could, we'd be missing coverage.

For golang/go#63147.
For golang/go#60443.

Change-Id: I5fe78616b4c8f85b6b1e901b509d291884cf9a27
Reviewed-on: https://go-review.googlesource.com/c/build/+/541380
Auto-Submit: Heschi Kreinick <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: Planned
Development

No branches or pull requests

2 participants