-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/build: frequent "communication error to buildlet" failures on plan9-arm
#52677
Comments
I've reconfigured the plan9-arm cluster with a filesystem version which I hope is more stable. Looking at the local log for the most recent of these failures, I see it ends this way:
It seems that a Wouldn't it be better if the buildlet sent back an explicit error to the coordinator in such cases, so the test run could be retried? |
Hrm. Is there a way to set that timeout to something longer?
IMO it would not be appropriate to retry the test — if it times out on one run, what's to stop it from timing out on the next one? And, moreover, if we do 20 minutes of work and then time out and retry, we've just wasted 20 minutes of builder time that could have been put to more productive use. (#42699 is closely related.) |
Bad idea, I think. The test of {fmt, go/ast, go/build} on plan9-arm normally takes about 50 seconds. If it's timing out after 20 minutes, that's not just slow, it's stalled. Lengthening the timeout would just waste more time. |
Whenever I do a manual retry using the I will set up a process on my local builders to monitor progress on the log output file. If nothing is emitted for say 15 minutes, it will send an alert so I can go in with the debugger and try to find out what's stalled. |
|
I've found a likely cause: an assertion failure in the Plan 9 filesystem. While working on diagnosing that, I can try making it retry instead of halting, so it won't stall the builder. |
Is the retry in place? The builder seems a little more stable, but there's still a recent one of these.
|
There was another failure mode: one of the raspberry pi builders had only 1GB of RAM and no swap configured. I've added some swap space so it should be more stable now. |
One more of these after the swap change: |
I'd like to propose excusing the plan9-arm builders from the But the plan9-arm builders don't run on a virtual resource, but on real hardware. That hardware is not particularly enterprise-quality: a cluster of Raspberry Pi boards sharing a power supply. Sometimes a glitch of the hardware or the Plan 9 filesystem causes a builder to crash or reboot, leading to this "communication error" failure. In my experience, retrying a test after such a failure will invariably succeed. Therefore this check is not saving resources by preventing a "retry forever", but just acting as a nuisance preventing a successful automatic retry. If this is acceptable, I'll submit a CL to x/build/cmd/coordinator to remove plan9-arm from this check. |
Found new dashboard test flakes for:
2022-12-01 21:00 plan9-arm go@93587d35 (log)
2022-12-08 18:29 plan9-arm go@7973b0e5 (log)
|
Found new dashboard test flakes for:
2023-01-17 18:21 plan9-arm go@9088c691 (log)
2023-01-17 19:53 plan9-arm go@526b8956 (log)
|
Found new dashboard test flakes for:
2023-02-01 21:30 plan9-arm go@cda461bb (log)
|
Change https://go.dev/cl/470355 mentions this issue: |
Found new dashboard test flakes for:
2023-02-22 21:40 plan9-arm go@06b67591 (log)
2023-02-22 23:19 plan9-arm go@e7cfcda6 (log)
|
This is probably pending a redeploy of |
Found new dashboard test flakes for:
2023-02-28 01:11 plan9-arm go@7a0799b2 (log)
|
Has the |
Found new dashboard test flakes for:
2023-01-31 19:45 plan9-arm go@780db9a6 (log)
|
I think so, yes. |
Timed out in state WaitingForInfo. Closing. (I am just a bot, though. Please speak up if this is a mistake or you have the requested information.) |
greplogs --dashboard -md -l -e '\Aplan9-arm.*(\n.*)*communication error to buildlet' --since=2022-01-01
2022-05-02T14:54:05-349cc83/plan9-arm
2022-04-27T14:23:28-f0c0e0f/plan9-arm
2022-04-26T02:28:58-17d7983/plan9-arm
2022-04-11T16:31:53-0179331/plan9-arm
2022-04-07T23:06:24-c451a02/plan9-arm
2022-04-05T14:15:59-62bceae/plan9-arm
2022-03-31T05:34:15-2b8178c/plan9-arm
2022-03-31T00:27:01-0a6ddcc/plan9-arm
2022-03-31T00:26:58-0775730/plan9-arm
2022-03-30T01:12:57-8fefeab/plan9-arm
2022-03-21T19:10:16-efbff6e/plan9-arm
2022-03-07T18:17:40-dcb6547/plan9-arm
2022-03-03T21:19:37-87a345c/plan9-arm
2022-03-01T19:32:51-44e92e1/plan9-arm
2022-02-25T00:25:34-b8b3196/plan9-arm
2022-02-01T18:15:07-125c5a3/plan9-arm
2022-01-27T21:25:18-ad345c2/plan9-arm
2022-01-19T16:33:11-985d97e/plan9-arm
2022-01-10T22:49:07-4ceb5a9/plan9-arm
@millerresearch, can something be done to prevent this builder from getting wedged?
(Compare #49756.)
The text was updated successfully, but these errors were encountered: