x/build/cmd/coordinator: transparent TryBot retries mask stalls that should be fixed #42699

bcmills · 2020-11-18T16:03:51Z

I noticed a stall on a SlowBot (#39665 (comment)), and when I went to check on it again I noticed that the counter had reset and the builder had restarted.

This behavior closely mimics the js-wasm stall-and-restart behavior reported in #35170, and I wonder if it explains other observed stalls too (probably #39349; perhaps #36482 and/or #32632 as well).

This timeout-and-retry behavior seems to paper over a systemic problem to make it less noticeable, but if it were instead more noticeable we might have a better picture of how and why it occurs — and might be better equipped to fix the underlying problem (and speed up the perceived TryBot speed by eliminating the spurious timeouts entirely).

CC @golang/release

The text was updated successfully, but these errors were encountered:

bcmills · 2021-01-05T15:28:46Z

See previously #13026.

dmitshur · 2021-12-07T17:28:05Z

This timeout-and-retry behavior

@bcmills Is the issue here that there's an unbounded number of transparent retries, or that even a bounded number of retries—like 3—is considered problematic in this context? (Or is it not known whether these are bounded and needs investigation?)

If they're unbounded, we should definitely cap it at a fixed number and promote to fatal error. That should be comparatively easy and high value. If already bounded, then improving this situation is more involved, especially if there isn't a good way to differentiate a network communication issue from a build stall.

bcmills · 2021-12-07T17:42:05Z

I would consider even a bounded number of retries to be problematic for the GCE builders. If we've tuned the builders properly no retries should actually ever be needed, and even a single retry doubles the latency of its shard. (If there is a network communication issue to the GCE builders that occurs frequently enough to matter, we should diagnose and fix that.)

I'm more sympathetic to having a few bounded retries to the reverse builders, since those could reasonably have less-reliable consumer network connections — but even then, I think it would be a better experience to have an explicit SlowBot failure (for which I can remove the missing SlowBot) than to have a whole SlowBot run stalled on some particular builder that has gone AWOL.

bcmills · 2022-01-28T16:25:58Z

Now that the GopherBot leaves more detailed status comments, the problematic retry behavior is becoming clearer.

For example, on https://go.dev/cl/381734 I requested two plan9 SlowBots. The Gopher Robot has now commented three times on “SlowBots beginning”, and has “Superseded” itself twice without further explanation.

bcmills · 2022-03-18T22:13:08Z

This caused a 20-minute SlowBot delay on https://go.dev/cl/391805 PS15.

The linux-arm64-aws TryBot got stuck in testsanitizers for 20 minutes, and instead of producing a diagnosable failure it timed out the machine and started over. 😠

https://farmer.golang.org/temporarylogs?name=linux-arm64-aws&rev=6553c0d1d16000f564fc64b13600792f8662dcd5&st=0xc039bf4680:

  builder: linux-arm64-aws
      rev: 6553c0d1d16000f564fc64b13600792f8662dcd5
 buildlet: https://3.17.152.2 EC2 VM: buildlet-linux-arm64-aws-rnc35ea30
  started: 2022-03-18 21:47:04.595869047 +0000 UTC m=+18819.820715051
   status: still running

Events:
…
  2022-03-18T22:10:37Z finish_run_test:testsanitizers after 20m0s; err=error copying response: context deadline exceeded; 52.15.88.103:443
  2022-03-18T22:10:37Z test_helper_is_broken 52.15.88.103:443
  2022-03-18T22:10:37Z closed_helper 52.15.88.103:443
  2022-03-18T22:10:40Z run_test:testsanitizers 3.17.152.2:443

bcmills · 2022-04-07T17:16:10Z

I noticed a 1-minute delay on https://go.dev/cl/360354 PS9 due to an early failure on the windows-amd64-2016 builder.

I didn't catch the error message in the logs before the builder restarted.

gopherbot · 2022-05-13T18:47:32Z

Change https://go.dev/cl/406216 mentions this issue: cmd/coordinator: consolidate and increase global VM deletion timeout

We had a lot of flexibility over timeouts, making their maintenance harder. Consolidate it to a single timeout in the pool package, and modify it from 45 minutes to 2 hours. There's room for improvement in how we maintain this timeout, but I'm leaving that for future work (with a tracking issue). Fixes golang/go#52591. Updates golang/go#52929. Updates golang/go#49666. Updates golang/go#42699. Change-Id: I2ad92648d89a714397bd8b0e1ec490fc9f6d6790 Reviewed-on: https://go-review.googlesource.com/c/build/+/406216 Run-TryBot: Dmitri Shuralyov <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Reviewed-by: Carlos Amedee <[email protected]> Reviewed-by: Heschi Kreinick <[email protected]>

dmitshur · 2022-05-19T15:42:49Z

I ran into this most recently while sending a trivial fix in https://go.dev/cl/406897, where a linux-amd64-longtest SlowBot request took almost 2 hrs instead of the usual 10-15 min, and while waiting I spent that time making progress on investigating this issue. I found that the new behavior was introduced during a cmd/coordinator refactor in late 2019, and it was a bug rather than an intentional change. We also filed issue #36629 in 2020 that was trying to investigate what happened, which I've now closed in favor of this issue. I've also opened #52998 for tracking progress on the testsanitizers hang specifically (also pointed out above in #42699 (comment)).

Moving this issue to NeedsFix because I don't believe further investigation is needed here—I know where the problem is. I plan to discuss this with the team and will update this issue soon after that.

gopherbot · 2022-05-20T18:09:48Z

Change https://go.dev/cl/407555 mentions this issue: buildlet: fix Exec to return ErrTimeout on timeout

The coordinator relies on Exec reporting that the given timeout was exceeded in order to mark a build as failed instead of retrying it. A refactor resulted in Exec no longer doing that, despite what its documentation promises, so fix that. Also add a test since evidence shows that catching a regression can be helpful. For golang/go#42699. Updates golang/go#35707. Change-Id: Iacef90b83e7b81fad88a33baa6489d5157e3528f Reviewed-on: https://go-review.googlesource.com/c/build/+/407555 Reviewed-by: Carlos Amedee <[email protected]> Reviewed-by: Bryan Mills <[email protected]> Run-TryBot: Dmitri Shuralyov <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Auto-Submit: Dmitri Shuralyov <[email protected]> TryBot-Result: Gopher Robot <[email protected]>

dmitshur · 2022-05-31T17:12:29Z

I believe the change in CL 407555 should be sufficient to resolve this issue. If there's any evidence to the contrary, it'd be helpful to know.

There were some additional timeout efficiency improvements I was considering and discussed with the team, and though there's agreement that we can make those improvements, they're not necessary for resolving this problem.

bcmills · 2022-05-31T17:21:20Z

Hmm. As far as I can tell CL 407555 fixes the main source of the failures that led to retries, but doesn't change the fact that retries would still be attempted (and would still mask stalls) for other issues, such as communication errors with reverse SlowBots.

bcmills · 2022-05-31T17:22:34Z

At the very least, though, I would say that with that CL in we can probably deprioritize further work on this issue. Reverse-builder SlowBot failures are much more rare than js-wasm TryBots or linux-amd64-longtest SlowBots.

dmitshur · 2024-01-29T23:58:17Z

The LUCI infrastructure doesn't have this problem. For a recent instance and comparison, see here.

At this point of the LUCI migration I don't expect we'll resolve this issue in the old coordinator infrastructure.

gopherbot · 2024-05-16T22:03:49Z

Change https://go.dev/cl/586077 mentions this issue: buildlet: don't rely on io.Copy error to be context.DeadlineExceeded

gopherbot added the Builders x/build issues (builders, bots, dashboards) label Nov 18, 2020

gopherbot added this to the Unreleased milestone Nov 18, 2020

bcmills changed the title ~~x/build: transparent retries mask stalls that should be fixed~~ x/build: transparent TryBot retries mask stalls that should be fixed Nov 18, 2020

dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 26, 2020

dmitshur mentioned this issue Nov 30, 2020

x/build: frequent timeouts running js-wasm TryBots #35170

Closed

dmitshur changed the title ~~x/build: transparent TryBot retries mask stalls that should be fixed~~ x/build/cmd/coordinator: transparent TryBot retries mask stalls that should be fixed Apr 30, 2021

This was referenced Dec 2, 2021

proxy.golang.org: single intermittent failure of vanity URL causes stale cache #49916

Open

x/build: OpenBSD 7.0 386 builder is slow #49666

Open

bcmills mentioned this issue Apr 27, 2022

x/build: windows-amd64-longtest builder not keeping up with commits #52591

Closed

bcmills mentioned this issue May 4, 2022

x/build: frequent "communication error to buildlet" failures on plan9-arm #52677

Closed

dmitshur mentioned this issue May 16, 2022

x/build/cmd/coordinator: decide what to do about the global VM/pod timeout #52929

Open

This was referenced May 19, 2022

misc/cgo/testsanitizers: occasional hangs in TestTSAN/tsan12 #52998

Closed

x/build/cmd/coordinator: some linux-amd64 trybot builds getting stuck at the end #36629

Closed

dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 19, 2022

bcmills mentioned this issue May 23, 2022

build dashboard triage log #52653

Closed

dmitshur mentioned this issue May 31, 2022

x/build/cmd/coordinator: failures are sometimes missing error output #39349

Closed

heschi moved this to Planned in Go Release Sep 27, 2022

heschi added this to Go Release Sep 27, 2022

dmitshur closed this as not planned Won't fix, can't repro, duplicate, stale Jan 29, 2024

github-project-automation bot moved this from Planned to Done in Go Release Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/build/cmd/coordinator: transparent TryBot retries mask stalls that should be fixed #42699

x/build/cmd/coordinator: transparent TryBot retries mask stalls that should be fixed #42699

bcmills commented Nov 18, 2020

bcmills commented Jan 5, 2021

dmitshur commented Dec 7, 2021 •

edited

Loading

bcmills commented Dec 7, 2021

bcmills commented Jan 28, 2022

bcmills commented Mar 18, 2022

bcmills commented Apr 7, 2022

gopherbot commented May 13, 2022

dmitshur commented May 19, 2022

gopherbot commented May 20, 2022

dmitshur commented May 31, 2022

bcmills commented May 31, 2022

bcmills commented May 31, 2022

dmitshur commented Jan 29, 2024

gopherbot commented May 16, 2024

x/build/cmd/coordinator: transparent TryBot retries mask stalls that should be fixed #42699

x/build/cmd/coordinator: transparent TryBot retries mask stalls that should be fixed #42699

Comments

bcmills commented Nov 18, 2020

bcmills commented Jan 5, 2021

dmitshur commented Dec 7, 2021 • edited Loading

bcmills commented Dec 7, 2021

bcmills commented Jan 28, 2022

bcmills commented Mar 18, 2022

bcmills commented Apr 7, 2022

gopherbot commented May 13, 2022

dmitshur commented May 19, 2022

gopherbot commented May 20, 2022

dmitshur commented May 31, 2022

bcmills commented May 31, 2022

bcmills commented May 31, 2022

dmitshur commented Jan 29, 2024

gopherbot commented May 16, 2024

dmitshur commented Dec 7, 2021 •

edited

Loading