x/build: add LUCI openbsd-ppc64 builder #63480

n2vi · 2023-10-10T03:56:30Z

Following the instructions at Dashboard builders:

hostname openbsd-ppc64-n2vi

CSR is attached after renaming since Github doesn't seem to allow attaching with the name openbsd-ppc64-n2vi.csr you asked for.

gopherbot · 2023-10-12T17:00:38Z

Change https://go.dev/cl/534976 mentions this issue: main.star: add openbsd-ppc64, linux-riscv64, freebsd-riscv64 builders

dmitshur · 2023-10-12T17:00:40Z

Thanks. Here's the resulting certificate: openbsd-ppc64-n2vi-1697128325.cert.txt.

I've mailed CLs to define your new builder in LUCI and will comment once that's done.

n2vi · 2023-10-12T18:44:41Z

Thank you; I confirm that using the cert I get a plausible looking luci_machine_tokend/token.json.

Since the list of BUILDER_TYPES is nearly sorted, keep that up, and sort (using 'Sort Lines' in $EDITOR) two of Linux run mods. For golang/go#63480. For golang/go#63481. For golang/go#63482. Change-Id: Icef633ab7a0d53b5807c2ab4a076d74c291dc0ea Reviewed-on: https://go-review.googlesource.com/c/build/+/534976 TryBot-Bypass: Dmitri Shuralyov <[email protected]> Reviewed-by: Carlos Amedee <[email protected]> Auto-Submit: Dmitri Shuralyov <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Reviewed-by: Heschi Kreinick <[email protected]>

dmitshur · 2023-10-12T21:25:03Z

Glad to hear!

Step 3 is complete (CL 534976 is submitted), so you should be able to proceed with the next steps. Please feel free to comment here if you see something unexpected, or run into a problem that the documentation doesn't cover. Thanks.

n2vi · 2023-10-13T18:39:45Z

I have not read the code yet to diagnose this; leaving assigned to me.

2023/10/13 18:29:39 Bootstrapping the swarming bot with certificate authentication
2023/10/13 18:29:39 retrieving the luci-machine-token from the token file
2023/10/13 18:29:39 Downloading the swarming bot
2023/10/13 18:29:39 Starting the swarming bot /home/swarming/.swarming/swarming_bot.zip
72354 2023-10-13 18:29:47.331 E: ts_mon monitoring is disabled because the endpoint provided is invalid or not supported:
72354 2023-10-13 18:29:48.890 E: Request to https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake failed with HTTP status code 403: 403 Client Error: Forbidden for url: https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake
72354 2023-10-13 18:29:48.891 E: Failed to contact for handshake, retrying in 0 sec...

n2vi · 2023-10-13T20:35:08Z

I don't see anything in the code or logs here that help me diagnose. It just looks like the server didn't like the token.json that had been refreshed just a minute before.

Maybe someone there can check server-side luci logs? Unable to reassign to dmitshur; hope someone there sees this.

dmitshur · 2023-10-13T20:44:36Z

Thanks for the update.

I recall there was a similar looking error in #61666 (comment). We'll take a look.

n2vi · 2023-10-13T20:57:59Z

In case it helps... I set both -token-file-path on the bootstrapswarm command line and also LUCI_MACHINE_TOKEN in the environment. The logs don't indicate any trouble reading the token.json file, though they're not very explicit.

I appreciate that there have been serious security flaws in the past from too-detailed error messages. But I'd venture that it is safe for luci to say more than "403".

I recognize I'm a guinea pig for the Go LUCI stuff, so happy to give you a login on t.n2vi.com if you would find it easier to debug directly or hop on a video call with screen sharing.

Finally, I recognize I'm a newcomer to Go Builders. So it could well be user error here.

dmitshur · 2023-10-13T21:50:40Z

Thanks for your patience as we work through this and smooth out the builder onboarding process.

I set both -token-file-path on the bootstrapswarm command line and also LUCI_MACHINE_TOKEN in the environment.

To confirm, are both of them set to the same value, which is the file path location of the token.json file? If you don't mind experimenting on your side, you can check if anything is different if you leave LUCI_MACHINE_TOKEN unset and instead rely on the default location for your OS (/var/lib/luci_machine_tokend/token.json I believe).

We'll keep looking into this on our side. Though next week we might be somewhat occupied by a team event, so please expect some delays. Thanks again.

n2vi · 2023-10-14T15:40:30Z

Yes, both are set to the same value /home/luci/luci_machine_tokend/token.json. (My OS doesn't have /var/lib and anyway not a fan of leaving cleartext credentials in obscure corners of the filesystem.)

This morning I've retried the same invocation of bootstrapswarm as before and don't get the 403 Client Error. So maybe there was just a transient issue.

Happy to set this effort on the shelf for a week or two; enjoy the team event!

dmitshur · 2023-10-24T20:27:41Z

CC @golang/release.

n2vi · 2023-11-03T00:59:39Z

Over the last week I tried swarm a few more times with no problems, so whatever issue I saw before indeed seems transient. I never saw swarm do any actual work, presumably because some server-side table is still pointing to my machine as in the old-builder state rather than new-builder. Fine by me.

I'll have limited ability to work on it from November 8 - 20, but happy to work on it during the next few days if you're waiting on me.

dmitshur · 2023-12-07T22:54:01Z

The builder is currently in a "Quarantined—Had 6 consecutive BOT_DIED tasks" state. @n2vi Can you please restart the swarming bot on your side and see if that's enough to get it out of that state?

We've applied changes on our side (e.g., CL 546715) that should help avoid this repeating, but it's possible more work will be needed. Let's see what happens after you restart it next time. Thanks.

dmitshur · 2024-05-13T17:44:12Z

Thanks. I think you should let the LUCI version of the builder run for some time, and when it seems stable, feel free to stop the coordinator instance on your side to free up the resources. The only reason to keep the coordinator instance is if you're not quite ready to switch yet, but it needs to happen at some point since the coordinator will be going away.

I'll update CL 585217 to give it a timeout scale for now, especially since it's running builds for both LUCI and coordinator, and we can adjust it later on as it becomes more clear what the optimal value is.

n2vi · 2024-05-13T18:16:10Z

As of 18:10 UTC, rebooted openbsd-ppc64-n2vi with datasize-max=8192M for swarming. If 8GB of RAM is not enough we have other problems.
Did not restart gopher buildlet yet. Let's see how high it ramps up with nothing but swarming.

n2vi · 2024-05-13T20:25:39Z

This eventually panic'd the kernel with an allocation failure.
Restarting now (20:22 UTC) to see how reproducible this is.

{But the tests are not automatically restarting. "Retry Build" button on the Builder Dashboard is gray'd out for me; perhaps someone there can kick it?}

The port wasn't added until Go 1.22, so no need to test it with Go 1.21. Also set a timeout scale factor of 2 for now, while the LUCI builder is running alongside the coordinator builder on the same hardware. This is fine to adjust later as it becomes more clear what the optimal value is. For golang/go#63480. For golang/go#56001. Change-Id: I707ffe7d15afa6a70d6d8789f959a5835259df3f Reviewed-on: https://go-review.googlesource.com/c/build/+/585217 Reviewed-by: Cherry Mui <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Auto-Submit: Dmitri Shuralyov <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]>

n2vi · 2024-05-15T16:31:12Z

Still wasn't seeing anything running, so killed off the python and bootstrapswarm processes and restarted.
They just report status code 401 and immediately exit. I'm not illuminated by looking at the Ended Builds list either.

dmitshur · 2024-05-15T18:22:22Z

Thanks for working on this.

I'm not illuminated by looking at the Ended Builds list either.

I failed to realizer this sooner, but our configuration intends to make it possible for you to see the machine pool (see "poolViewer" granted to group "all" here).

I believe it currently requires you to sign in (any account will work), then you can view the contents of the "Machine Pool" links such as https://chromium-swarm.appspot.com/botlist?f=cipd_platform%3Aopenbsd-ppc64&f=pool%3Aluci.golang.shared-workers. You should see something like this:

And clicking on the bot name will take you to https://chromium-swarm.appspot.com/bot?id=openbsd-ppc64-n2vi where you'll find more information about its current state from LUCI's perspective. Apologies about the additional overhead at this time to get to this information.

Since you've done some restarts, it might help to confirm that luci_machine_tokend process still working as described in step 4 of https://go.dev/wiki/DashboardBuilders#how-to-set-up-a-builder-1, and that the token file it writes to has new content, which is propagated to bootstrapswarm.

If that isn't where the problem is, is there more information included in the status code 401 message, beyond "Downloading the swarming bot" and "status code 401"? Also, is there more useful context in the local swarming bot log?

n2vi · 2024-05-15T19:17:28Z

sign in (any account will work)
Thanks, that was a crucial clue. I'd tried signing in before, but was put off by the "grant write access to all your git repositories" warning. Doing it with a less powerful account is fine.

Now that the bot is getting work again we'll see if we can reproduce the pagedaemon kernel panic. Not that I'm a kernel developer by any means, but gotta learn sometime! I recognize that this is a sufficiently unusual platform and workload that it is not inconceivable that we step on a new corner case.

n2vi · 2024-05-15T22:08:13Z

No kernel crashes yet, just running all the way to Failure. :)

I'm still trying to understand more about the build output, in particular the details of what "resource temporarily unavailable" means specifically. Is it running into a user process limit for forking? The login.conf here sets maxproc-max=256, maxproc-cur=128. Do the tests need more processes than that?

One probably unrelated item caught my eye: /var/log/secure reports

May 15 20:48:40 t doas: command not permitted for swarming: chmod 0777 /home/swarming/.swarming/w/ir/x/t/go-build3552889220

All those files are already owned by user "swarming" so why would the software be trying to become root?
I do recall seeing (and being horrified by) all.bash trying to become root. That's when I switched to only running make.bash on most of my machines. It is ok here on t.n2vi.net=openbsd-ppc64-n2vi for you to be root if you have to; I'm assuming arbitrarily bad stuff may happen when running a builder machine. Just let me know if you really need it.

n2vi · 2024-05-16T13:59:23Z

Overnight, we captured another kernel panic that closely resembles the earlier one. I'll get back to you when I make progress on this; may be quite a while. LUCI appropriately marks me as offline for the duration.

n2vi · 2024-05-17T23:47:52Z

status update; no need to respond...

Found a recent patch to openbsd powerpc64 pagedaemon pmac.c that may be relevant, so upgraded t.n2vi.net from -stable to -snapshot.

Now the previously-ok luci_machine_tokend dumps core with a pinsyscalls error on the console, so rebuilt with the nineteen-line install sequence from https://pkg.go.dev/go.chromium.org/luci and a freshly compiled go1.22.3. This now seems to be generating a new token.json ok.

Rebuilt and restarted bootstrapswarm. The LUCI Builders dashboard shows the machine now as Idle; based on past experience, in an hour or two it will actually start delivering work without further attention. I'll periodically monitor to be sure that happens, and then over the next couple days we'll see if the kernel panic re-occurs.

n2vi · 2024-05-18T23:58:09Z

I do suspect we're stepping on a pagedaemon bug that occasionally crashes the machine, but it is getting LUCI work enough done that perhaps Gophers can make their own independent progress while I pursue the OpenBSD issue.

…

On Fri, May 17, 2024, 16:48 Eric Grosse ***@***.***> wrote: status update; no need to respond... Found a recent patch to openbsd powerpc64 pagedaemon pmac.c that may be relevant, so upgraded t.n2vi.net from -stable to -snapshot. Now the previously-ok luci_machine_tokend dumps core with a pinsyscalls error on the console, so rebuilt with the nineteen-line install sequence from https://pkg.go.dev/go.chromium.org/luci and a freshly compiled go1.22.3. This now seems to be generating a new token.json ok. Rebuilt and restarted bootstrapswarm. The LUCI Builders dashboard shows the machine now as Idle; based on past experience, in an hour or two it will actually start delivering work without further attention. I'll periodically monitor to be sure that happens, and then over the next couple days we'll see if the kernel panic re-occurs. — Reply to this email directly, view it on GitHub <#63480 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACADPOXZT27J2RRAXAVRBXLZC2JL7AVCNFSM6AAAAAA5ZVVAZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJYGQ4TQNJZHA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

n2vi · 2024-05-20T20:05:39Z

Restarted swarm with twice the process ulimit. Let's see if that reduces the number of fork/exec failures.
[for the record: maxproc-max=512, maxproc-cur=256 suffices]

No recent kernel crashes.

n2vi · 2024-05-23T16:02:00Z

My builder machine is fine, no crashes, but I see that the dashboard thinks it is offline. Here is a tail -50 nohup.out. I believe the ball is back in your court...

Traceback (most recent call last):
  File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/bot_main.py", line 1831, in rbe_poll
    self._rbe_session = remote_client.RBESession(
  File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/remote_client.py", line 901, in __init__
    resp = remote.rbe_create_session(dimensions, bot_version,
  File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/remote_client.py", line 587, in rbe_create_session
    raise RBEServerError('Failed to create RBE session, see bot logs')
bot_code.remote_client_errors.RBEServerError: Failed to create RBE session, see bot logs
38455 2024-05-22 18:26:05.889 E: Unable to open given url, https://chromium-swarm.appspot.com/swarming/api/v1/bot/rbe/session/create, after 1 attempts or 240 timeout.
429 Client Error: Too Many Requests for url: https://chromium-swarm.appspot.com/swarming/api/v1/bot/rbe/session/create
----------
Alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
Content-length: 284
Content-type: text/plain; charset=utf-8
Date: Wed, 22 May 2024 18:26:05 GMT
Server: Google Frontend

rpc error: code = ResourceExhausted desc = Quota exceeded for quota metric 'Create Bot Session requests per project' and limit 'Create Bot Session requests per project per minute per region' of service 'remotebuildexecution.googleapis.com' for consumer 'project_number:575346572923'.

----------
38455 2024-05-22 18:26:05.889 E: Failed to open RBE Session: Failed to create RBE session, see bot logs
Traceback (most recent call last):
  File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/bot_main.py", line 1831, in rbe_poll
    self._rbe_session = remote_client.RBESession(
  File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/remote_client.py", line 901, in __init__
    resp = remote.rbe_create_session(dimensions, bot_version,
  File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/remote_client.py", line 587, in rbe_create_session
    raise RBEServerError('Failed to create RBE session, see bot logs')
bot_code.remote_client_errors.RBEServerError: Failed to create RBE session, see bot logs
38455 2024-05-22 18:26:09.744 E: Unable to open given url, https://chromium-swarm.appspot.com/swarming/api/v1/bot/rbe/session/create, after 1 attempts or 240 timeout.
429 Client Error: Too Many Requests for url: https://chromium-swarm.appspot.com/swarming/api/v1/bot/rbe/session/create
----------
Alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
Content-length: 284
Content-type: text/plain; charset=utf-8
Date: Wed, 22 May 2024 18:26:09 GMT
Server: Google Frontend

rpc error: code = ResourceExhausted desc = Quota exceeded for quota metric 'Create Bot Session requests per project' and limit 'Create Bot Session requests per project per minute per region' of service 'remotebuildexecution.googleapis.com' for consumer 'project_number:575346572923'.

----------
38455 2024-05-22 18:26:09.744 E: Failed to open RBE Session: Failed to create RBE session, see bot logs
Traceback (most recent call last):
  File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/bot_main.py", line 1831, in rbe_poll
    self._rbe_session = remote_client.RBESession(
  File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/remote_client.py", line 901, in __init__
    resp = remote.rbe_create_session(dimensions, bot_version,
  File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/remote_client.py", line 587, in rbe_create_session
    raise RBEServerError('Failed to create RBE session, see bot logs')
bot_code.remote_client_errors.RBEServerError: Failed to create RBE session, see bot logs

dmitshur · 2024-05-30T03:06:35Z

The error message above includes "quota exceeded". It seems to have been temporary. Looking at https://ci.chromium.org/ui/p/golang/g/port-openbsd-ppc64/builders, the builder seems to be stable and passing in the main Go repo and all golang.org/x repos. Congratulations on reaching this point!

Would you like to remove its known issue as the next step?

n2vi · 2024-05-30T16:45:59Z

We got another pager daemon kernel crash last night. I'm glad we'r getting substantial test runs done, but we're not out of the woods yet.

n2vi · 2024-06-05T20:14:32Z

I see a "context deadline exceeded" failure in the latest build. Not sure how to interpret that, but FYI as part of debugging the kernel crashes I've changed some kernel memory barriers that possibly slow page mapping changes a bit. I don't expect any large impact on system speed overall, but I'm unsure.

n2vi · 2024-06-10T18:30:22Z

I've been able to reproduce a kernel panic without anything involving Go, so will be pursuing that and temporarily not running swarm. I'll update here when we've made progress with the kernel.

gopherbot · 2024-06-20T19:06:32Z

Change https://go.dev/cl/593736 mentions this issue: main.star: set openbsd-ppc64 timeout scale to 3

Move the timeout scale closer to what's used by openbsd-riscv64 now. This was suggested by Eric who looked at their relative performance. For golang/go#63480. Change-Id: I1f28dd183c20b9b41c807296b5624ba0dcb10bee Co-authored-by: Eric Grosse <[email protected]> Reviewed-on: https://go-review.googlesource.com/c/build/+/593736 Auto-Submit: Dmitri Shuralyov <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Eric Grosse <[email protected]>

n2vi · 2024-07-06T15:23:42Z

Although the kernel issue is not fully solved, I'm satisfied that it is sufficiently understood and being worked on in Mac Studio locking bug. I now regard the LUCI migration as complete for openbsd-ppc64 builder and am no longer running the buildlet there. As long as we keep the machine load at a reasonable level, we're rarely triggering the kernel lock issue.

@dmitshur Thanks again for all your help with this. You may remove
knownissue
for me if you like. I would submit a CL myself except that the machine with my GitHub login is too locked down to import all the luciconfig toolchain. I'll think about how to get around that eventually.

gopherbot · 2024-07-07T03:49:34Z

Change https://go.dev/cl/596817 mentions this issue: main.star: unset known issue for openbsd/ppc64 builder type

n2vi · 2024-07-08T00:58:57Z

I regret to say that my comment seems to have jinxed things. After the change, openbsd-ppc64 builder is crashing more frequently again.

Anyway, let's leave things be for a couple weeks while y'all are at GopherCon and OpenBSD works on locks.

dmitshur · 2024-08-01T14:36:01Z

Following up here as some time has passed and the builder appears to have been doing well recently. @n2vi Okay to submit CL 596817 to mark this builder as complete? (If desired, it's always possible to open a new, more narrow known issue.)

n2vi · 2024-08-01T15:45:01Z

Sure.

As my system kernel friends say, multicore MMU is an art. I believe there are remaining bugs encountered when under high load, but I reboot the server when needed.

The builder has reached a point where it's considered added. Fixes golang/go#63480. Change-Id: I82985686fa1ac0f00d46c2b49fd8e2fc187fc5fa Reviewed-on: https://go-review.googlesource.com/c/build/+/596817 LUCI-TryBot-Result: Go LUCI <[email protected]> Auto-Submit: Dmitri Shuralyov <[email protected]> Reviewed-by: Eric Grosse <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Reviewed-by: Carlos Amedee <[email protected]>

gopherbot · 2024-08-01T16:11:29Z

Closed by merging CL 596817 (commit golang/build@4a73433) to luci-config.

gopherbot added the Builders x/build issues (builders, bots, dashboards) label Oct 10, 2023

gopherbot added this to the Unreleased milestone Oct 10, 2023

n2vi mentioned this issue Oct 10, 2023

x/build: Migrate builders to LUCI #63471

Open

20 tasks

dmitshur added this to Go Release Oct 10, 2023

dmitshur added the new-builder label Oct 10, 2023

dmitshur moved this to In Progress in Go Release Oct 10, 2023

dmitshur self-assigned this Oct 11, 2023

dmitshur assigned n2vi and unassigned dmitshur Oct 12, 2023

dmitshur unassigned n2vi Oct 13, 2023

cagedmantis added the NeedsFix The path to resolution is known, but the work has not been done. label Oct 16, 2023

cagedmantis assigned dr2chase Oct 24, 2023

dmitshur assigned dmitshur and unassigned dr2chase Nov 28, 2023

dmitshur assigned n2vi and unassigned dmitshur Dec 7, 2023

dmitshur added the FixPending Issues that have a fix which has not yet been reviewed or submitted. label Jul 7, 2024

dmitshur assigned n2vi Jul 7, 2024

gopherbot closed this as completed Aug 1, 2024

github-project-automation bot moved this from In Progress to Done in Go Release Aug 1, 2024

x/build: add LUCI openbsd-ppc64 builder #63480

x/build: add LUCI openbsd-ppc64 builder #63480

Comments

n2vi commented Oct 10, 2023

gopherbot commented Oct 12, 2023

dmitshur commented Oct 12, 2023 • edited Loading

n2vi commented Oct 12, 2023

dmitshur commented Oct 12, 2023

n2vi commented Oct 13, 2023

n2vi commented Oct 13, 2023 • edited Loading

dmitshur commented Oct 13, 2023

n2vi commented Oct 13, 2023

dmitshur commented Oct 13, 2023

n2vi commented Oct 14, 2023

dmitshur commented Oct 24, 2023

n2vi commented Nov 3, 2023

dmitshur commented Dec 7, 2023

dmitshur commented May 13, 2024 • edited Loading

n2vi commented May 13, 2024

n2vi commented May 13, 2024 • edited Loading

n2vi commented May 15, 2024

dmitshur commented May 15, 2024

n2vi commented May 15, 2024

n2vi commented May 15, 2024

n2vi commented May 16, 2024

n2vi commented May 17, 2024

n2vi commented May 18, 2024 via email

n2vi commented May 20, 2024 • edited Loading

n2vi commented May 23, 2024

dmitshur commented May 30, 2024

n2vi commented May 30, 2024

n2vi commented Jun 5, 2024

n2vi commented Jun 10, 2024

gopherbot commented Jun 20, 2024

n2vi commented Jul 6, 2024

gopherbot commented Jul 7, 2024

n2vi commented Jul 8, 2024

dmitshur commented Aug 1, 2024

n2vi commented Aug 1, 2024

gopherbot commented Aug 1, 2024

dmitshur commented Oct 12, 2023 •

edited

Loading

n2vi commented Oct 13, 2023 •

edited

Loading

dmitshur commented May 13, 2024 •

edited

Loading

n2vi commented May 13, 2024 •

edited

Loading

n2vi commented May 20, 2024 •

edited

Loading