-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/build/cmd/coordinator: transparent TryBot retries mask stalls that should be fixed #42699
Comments
See previously #13026. |
@bcmills Is the issue here that there's an unbounded number of transparent retries, or that even a bounded number of retries—like 3—is considered problematic in this context? (Or is it not known whether these are bounded and needs investigation?) If they're unbounded, we should definitely cap it at a fixed number and promote to fatal error. That should be comparatively easy and high value. If already bounded, then improving this situation is more involved, especially if there isn't a good way to differentiate a network communication issue from a build stall. |
I would consider even a bounded number of retries to be problematic for the GCE builders. If we've tuned the builders properly no retries should actually ever be needed, and even a single retry doubles the latency of its shard. (If there is a network communication issue to the GCE builders that occurs frequently enough to matter, we should diagnose and fix that.) I'm more sympathetic to having a few bounded retries to the reverse builders, since those could reasonably have less-reliable consumer network connections — but even then, I think it would be a better experience to have an explicit SlowBot failure (for which I can remove the missing SlowBot) than to have a whole SlowBot run stalled on some particular builder that has gone AWOL. |
Now that the GopherBot leaves more detailed status comments, the problematic retry behavior is becoming clearer. For example, on https://go.dev/cl/381734 I requested two |
This caused a 20-minute SlowBot delay on https://go.dev/cl/391805 PS15. The
|
I noticed a 1-minute delay on https://go.dev/cl/360354 PS9 due to an early failure on the I didn't catch the error message in the logs before the builder restarted. |
Change https://go.dev/cl/406216 mentions this issue: |
We had a lot of flexibility over timeouts, making their maintenance harder. Consolidate it to a single timeout in the pool package, and modify it from 45 minutes to 2 hours. There's room for improvement in how we maintain this timeout, but I'm leaving that for future work (with a tracking issue). Fixes golang/go#52591. Updates golang/go#52929. Updates golang/go#49666. Updates golang/go#42699. Change-Id: I2ad92648d89a714397bd8b0e1ec490fc9f6d6790 Reviewed-on: https://go-review.googlesource.com/c/build/+/406216 Run-TryBot: Dmitri Shuralyov <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Reviewed-by: Carlos Amedee <[email protected]> Reviewed-by: Heschi Kreinick <[email protected]>
I ran into this most recently while sending a trivial fix in https://go.dev/cl/406897, where a linux-amd64-longtest SlowBot request took almost 2 hrs instead of the usual 10-15 min, and while waiting I spent that time making progress on investigating this issue. I found that the new behavior was introduced during a cmd/coordinator refactor in late 2019, and it was a bug rather than an intentional change. We also filed issue #36629 in 2020 that was trying to investigate what happened, which I've now closed in favor of this issue. I've also opened #52998 for tracking progress on the Moving this issue to NeedsFix because I don't believe further investigation is needed here—I know where the problem is. I plan to discuss this with the team and will update this issue soon after that. |
Change https://go.dev/cl/407555 mentions this issue: |
The coordinator relies on Exec reporting that the given timeout was exceeded in order to mark a build as failed instead of retrying it. A refactor resulted in Exec no longer doing that, despite what its documentation promises, so fix that. Also add a test since evidence shows that catching a regression can be helpful. For golang/go#42699. Updates golang/go#35707. Change-Id: Iacef90b83e7b81fad88a33baa6489d5157e3528f Reviewed-on: https://go-review.googlesource.com/c/build/+/407555 Reviewed-by: Carlos Amedee <[email protected]> Reviewed-by: Bryan Mills <[email protected]> Run-TryBot: Dmitri Shuralyov <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Auto-Submit: Dmitri Shuralyov <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
I believe the change in CL 407555 should be sufficient to resolve this issue. If there's any evidence to the contrary, it'd be helpful to know. There were some additional timeout efficiency improvements I was considering and discussed with the team, and though there's agreement that we can make those improvements, they're not necessary for resolving this problem. |
Hmm. As far as I can tell CL 407555 fixes the main source of the failures that led to retries, but doesn't change the fact that retries would still be attempted (and would still mask stalls) for other issues, such as communication errors with reverse SlowBots. |
At the very least, though, I would say that with that CL in we can probably deprioritize further work on this issue. Reverse-builder SlowBot failures are much more rare than |
The LUCI infrastructure doesn't have this problem. For a recent instance and comparison, see here. At this point of the LUCI migration I don't expect we'll resolve this issue in the old coordinator infrastructure. |
Change https://go.dev/cl/586077 mentions this issue: |
I noticed a stall on a SlowBot (#39665 (comment)), and when I went to check on it again I noticed that the counter had reset and the builder had restarted.
This behavior closely mimics the
js-wasm
stall-and-restart behavior reported in #35170, and I wonder if it explains other observed stalls too (probably #39349; perhaps #36482 and/or #32632 as well).This timeout-and-retry behavior seems to paper over a systemic problem to make it less noticeable, but if it were instead more noticeable we might have a better picture of how and why it occurs — and might be better equipped to fix the underlying problem (and speed up the perceived TryBot speed by eliminating the spurious timeouts entirely).
CC @golang/release
The text was updated successfully, but these errors were encountered: