-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/build: LUCI support for plan9 #62025
Comments
According to the person who did the python2.7 Plan 9 port, python3 has more dependencies on GCC or Clang (both unavailable on Plan 9), so getting that working would not be trivial. As for virtualisation, I can only speak for plan9-arm, maybe @0intro can comment about other platforms. The plan9-arm builders are real hardware (currently a cluster of Raspbery Pi 4 boards); to my knowledge Plan 9 hasn't been ported to qemu for ARM. In any case, I don't think testing Go in a virtual machine is the same as testing on real hardware. The most interesting bugs are usually the ones that depend on timing or caches or other nuances which are likely to be simplified away by emulation. Do the tasks run by the LUCI bot need to interact with each other (apart from through the shared file system)? If not, it should be feasible to add a linux machine to the plan9-arn cluster to run LUCI bots, which would do a remote execution of tasks on a Plan 9 machine (sharing the bot's file sysem) in place of local execution as a subprocess. How is testing going to be done for things like android and iOS, which presumably aren't able to run a python3 bot? |
I would expect that the The Adding a Linux machine that sends remote commands to a Plan9 instance sounds like a reasonable architecture to me. |
Yeah, that sounds viable if you're willing to maintain the extra hop. We'll have to add some kind of interceptor logic to our build script, but that doesn't seem prohibitive. |
I think the easiest way would indeed be to execute the commands in a remote Plan 9 instance, running in either a virtual machine (386 and amd64) or on real hardware (arm). @heschi Could you indicate where the build scripts are located? I'd like to take a look and see how this could be achieved. I think the idea would be to execute the commands remotely (think |
Gentle reminder -- we'd like to proceed with moving Plan 9 builders to LUCI, but we're somewhat in the dark. I've found https://chromium.googlesource.com/infra/luci/luci-py -- is this the relevant source or does golang have its own fork? The documentation in that repository refers to "device bots", which is I think what we need, because the Plan 9 tests will run on devices or qemu instances separate from the (linux) platform where the swarming bot runs. It would be helpful to see documentation on setting up a device bot, and -- as @0intro asked a few months ago -- to see examples of how existing build scripts work. I'm guessing android will be the most similar. Also in #63599 a problem launching the swarming bot was reported; we are stalled waiting for a response to that too. |
Thank you for working on this and your patience. I'd like to try to help move this forward. Apologies for not commenting here sooner. We don't quite have an ideal existing builder that we can point to as the canonical example of how a builder like this can be implemented. Our work in progress on the Android emulator builder (#61097) and iOS Simulator (#66360) is the closest, in that both of those systems have distinct host and target OSes. An advantage they have is the host OS runs the emulator/simulator locally, whereas here it's likely networking to a Plan 9 system will be needed. For both iOS and Android builders, we're relying the the approach of having a go_{goos}_{goarch}_exec wrapper in $PATH and setting GOOS + GOARCH to the target OS/arch as well as GOHOSTOS + GOHOSTARCH to the host OS/arch in the environment. (See "If the -exec flag is not given, ..." at https://pkg.go.dev/cmd/go#hdr-Compile_and_run_Go_program.) Provided the go_…_exec wrapper is doing its job as intended, at that point there are no further modifications needed to make a builder work. The builder runs the equivalent of make.bash followed by In the case of mobile builders, their go_…_exec scripts invoke appropriate emulators/simulators, but in this case your script would be either SSHing into a Plan 9 machine—or something equivalent like Plan 9 in qemu—in order to execute GOOS=plan9 test binaries. The scripts may also need to copy testdata that tests need. I recently saw @FiloSottile using this approach. To make progress on all this, I suggest splitting the work into two fairly orthogonal halves:
Once those two pieces are working independently, there's a clear path forward to connect them together to get a LUCI builder that provides initial signal on Plan 9 test execution. For 1, I think we can track the next steps in any of issues #63599, #63600, or #63601. Thanks again. |
Thank you for the instructions. The I think we should be good for 1. See #63599. The other architectures will follow. For 2, I've started to experiment the
One of the problems I have is that the test data file or directory location is not always deterministic, so we don't really know what should be copied on the remote host to be able to run the test. |
I've realized that the Now, most of the tests are passing successfully. There are two remaining issues.
|
Hi @dmitshur, @millerresearch and I finally succeeded to run all tests on a remote Plan 9 machine from Linux, using a What is the next step to receive builds on the LUCI bot, for example plan9-386?
|
That's great to hear, thanks! One idea I wanted to briefly mention about testdata: I wonder if could work well to rely on For next steps, using the current plan9-386 LUCI builder as the example, we'll need to update the builder definition at https://ci.chromium.org/ui/p/golang/builders/luci.golang.ci/gotip-plan9-386 to match your builder. If you follow the "Machine Pool" link there now, it filters by cipd_platform:plan9-386 pool:luci.golang.shared-workers (you'll need to login to view the page content) which currently matches no bots. Updating cipd_platform to linux-amd64 (since your builder is a Linux AMD64 host) matches the plan9-386 builder and more, see here. We'll need to think the dimensions to use long term, but maybe narrowing down by bot id can work initially: Sent CL 588756. |
Change https://go.dev/cl/588756 mentions this issue: |
The current Plan 9 builders are expected to be Linux AMD64 hosts in the shared-workers pool that will provide functional go_plan9_{goarch}_exec scripts in order to test the Plan 9 Go ports. For a lack of a better alternative available now, rely on their unique bot id dimension to tell them apart from other Linux AMD64 hosts in the shared-workers pool. For golang/go#62025. Change-Id: Ib53de20dd4a56dd222b281abc8b7dbbeeb18a0ab Reviewed-on: https://go-review.googlesource.com/c/build/+/588756 Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: David du Colombier <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Auto-Submit: Dmitri Shuralyov <[email protected]>
@0intro https://chromium-swarm.appspot.com/task?id=69d9f835adb76f10 was a recent build after the aforementioned CL:
|
Our plan is to have one filesystem shared via 9p between the swarming-bot proxy and the plan9 test [virtual] machine, so the source and test files will be visible with the same names (presumably under /home/swarming/.swarming/c). Will these mean it's not necessary to embed testdata? |
A This builder has python 3.11 so I think the missing |
The first task seems to have got as far as building go for linux-amd64 and plan9-arm, then failed in the "update prebuilt go / cas archive" stage:
|
I've just installed the Python
That said, if we had the possibility to run Go tests without access to the full sources tree, it would be great. |
Sharing a filesystem seems good if it works well. Embedding is just an alternative path. It would make the test binaries larger and may require them to write out the embedded data to a temporary directory, so the main advantage is that it doesn't require copying files via a separate mechanism, and that it might make it easier to avoid copying unnecessary files. That "update prebuilt go / cas archive" build step ran for 6.2 minutes, which is quite long to upload a pre-built Go toolchain. Is the uplink on that machine expected to be slow? Let's also wait and see if the 6.2 minute time keeps happening more than just once. |
The link is nominally 20Mbit/sec upstream. |
In b8746572283039962513, the 'upload prebuilt go' step took 54 seconds. A pre-built toolchain is around 150 MB compressed, so that seems to line up with 20 Mbit/s. The build ran for 57 minutes. 27020 tests passed, 74 failed. It's great to see this progress. It seems we should expect these builders to be slow—please feel free add them here so that they get increased timeouts. |
Today's tests have been failing because I restarted the bot and accidentally didn't have go_plan9_arm_exec in the PATH, so it was trying to run arm test binaries on amd64. Fixed now. With the old build infrastructure we could use |
We haven't updated |
Is there a way with the Swarming Bot to execute a script before or after running a task? |
To answer my own question, there is actually a way to execute a script after running a task. When the In my case, I've deleted
The remaining issue is that when the |
Is there a way for us to become authorized to retry builds for the plan9-* builders? It would be easier for us to debug the new infrastructure we're having to develop because of LUCI, if we could arrange to receive tasks in a controlled way instead of having long and unpredictable waits until the coordinator decides to send one. At this moment I have a plan9-arm swarming bot running, I see a long list of tasks for it showing state |
@0intro Sorry it took a while before I had a chance to get back to this, but I'm glad you were able to make progress. There is open source documentation for the Swarming Bot at https://chromium.googlesource.com/infra/luci/luci-py.git/+/main/appengine/swarming/doc/Bot.md. Its Hooks section mentions @millerresearch Yes, I agree you should be able to trigger these builds yourself. may-start-trybots is the relevant access tier (see the current LUCI configuration that uses it). Please follow the process there to request that access. |
Now that we're making more progress with tests, I observe that LUCI is sending a larger set of tests to the Plan 9 builders than the old coordinator was doing. For example:
Were these changes intended, or just an unexpected side effect of switching to LUCI? |
This is a consequence of the simplification of the build policy during the migration to LUCI. With coordinator, there were many custom policies applied to builders individually, including what you mention: The LUCI configuration tried to generalize a higher level classification at a repository level, instead of doing it per builder. See: Because the x/benchmarks repo is considered a "library" category, it's tested on a wider set of platforms, which includes Plan 9. Also see the commit message of CL 515355 for a bit more motivation. If the current strategy doesn't work well for testing the Plan 9 OS, I think it's reasonable for us to consider adjusting the LUCI configuration accordingly. For tests in the main repo, the change can also be done inside cmd/dist rather than in LUCI configuration if the goal is to apply test skips to all.bash as well. It'll probably work better to file a separate issue to discuss those details. |
By setting |
All tests have now passed on plan9-arm. A few are spectacularly slow (probably build cache is confused by cross-compiling of packages on the swarming machine). |
I'm trying to deploy a second |
You're doing the "--02" suffix part right, but running into a limitation of the approach we used in #62025 (comment) that currently relies on the "id" dimension matching the exact bot ID. We need to make a change there before it can handle additional bots. I'll take a look at what we can do. |
In #61671 (comment), @millerresearch reported that the plan9 builders don't support python3. The LUCI swarming bot requires python3, which means that we aren't going to be able to test plan9 once we finish our migration. We can probably keep the old infrastructure going through the 1.22 cycle, but after that we'll likely have to declare the ports as broken in 1.23.
I have no idea how much work it would be to port python3 to plan9. I imagine that if it were easy to do it'd have been done already...? The LUCI team has long-term plans to port the bot to Go but it won't happen on our timeline. (For the record, even with Python supported, the LUCI project has a number of other programs that we'd really like to have working. Fortunately, those are already written in Go and porting them should be pretty easy. Worst case we may be able to work around them.)
Alternatively, maybe it's possible to run it virtualized somehow, say with the bot running under linux, then booting a qemu VM to do its work? The plan9 builders are already kinda slow, though.
cc @golang/plan9 @golang/release
The text was updated successfully, but these errors were encountered: