-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/build: add LUCI solaris-amd64 builder #61666
Comments
Please generate a new certificate signing request using |
I see. I'd already wondered if the fqdn would be desired here. |
solaris-amd64-1690923319.cert.txt |
Hi Carlos,
[solaris-amd64-1690923319.cert.txt](https://github.com/golang/go/files/12233926/solaris-amd64-1690923319.cert.txt)
I've generated the cert and registered your bot.
Thanks. As I'd mentioned, I've got the luci_machine_tokend running with
that now.
However, trying run bootstrapswarm next got me into a new problem: the
log shows
2023/08/02 15:59:52 Bootstrapping the swarming bot with certificate authentication
2023/08/02 15:59:52 retrieving the luci-machine-token from the token file
2023/08/02 15:59:52 Downloading the swarming bot
2023/08/02 15:59:54 Starting the swarming bot /opt/golang/.swarming/swarming_bot.zip
4104 2023-08-02 13:59:56.578 E: ts_mon monitoring is disabled because the endpoint provided is invalid or not supported:
4104 2023-08-02 13:59:56.579 E: os.utilities.get_hostname_short() failed
Traceback (most recent call last):
File "/opt/golang/.swarming/swarming_bot.1.zip/bot_code/bot_main.py", line 385, in _get_botid_safe
return os_utilities.get_hostname_short()
File "/opt/golang/.swarming/swarming_bot.1.zip/utils/tools.py", line 211, in wrapper
v = func(*args)
File "/opt/golang/.swarming/swarming_bot.1.zip/api/os_utilities.py", line 421, in get_hostname_short
return get_hostname().split('.', 1)[0]
File "/opt/golang/.swarming/swarming_bot.1.zip/utils/tools.py", line 211, in wrapper
v = func(*args)
File "/opt/golang/.swarming/swarming_bot.1.zip/api/os_utilities.py", line 399, in get_hostname
if platforms.is_gce() and not os.path.isfile('/.dockerenv'):
AttributeError: module 'api.platforms' has no attribute 'is_gce'
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/golang/.swarming/swarming_bot.1.zip/__main__.py", line 336, in <module>
File "/opt/golang/.swarming/swarming_bot.1.zip/__main__.py", line 324, in main
File "/opt/golang/.swarming/swarming_bot.1.zip/__main__.py", line 203, in CMDstart_bot
File "/opt/golang/.swarming/swarming_bot.1.zip/bot_code/bot_main.py", line 2218, in main
File "/opt/golang/.swarming/swarming_bot.1.zip/bot_code/bot_main.py", line 1130, in _run_bot
File "/opt/golang/.swarming/swarming_bot.1.zip/bot_code/bot_main.py", line 1168, in _run_bot_inner
AttributeError: module 'api.platforms' has no attribute 'is_gce'
Sending the crash report ... done.
Report URL: https://chromium-swarm.appspot.com/restricted/ereporter2/errors/5378321203200000
Process exited due to exception
module 'api.platforms' has no attribute 'is_gce'
2023/08/02 15:59:57 command execution /usr/bin/python3 /opt/golang/.swarming/swarming_bot.zip start_bot: exit status 1
Looking at the python code, I find (in api/os_utilities.py):
def get_hostname_short():
"""Returns the base host name."""
return get_hostname().split('.', 1)[0]
The machine's hostname is "s11-i386.foss", i.e. it's qualified, but not
an FQDN. The function above thus yields "s11-i386", which cannot be
resolved.
In this case, the host's fqdn is s11-i386.foss.cebitec.uni-bielefeld.de,
the domain part being cebitec.uni-bielefeld.de. Maybe one could use a
heuristic of checking for more than 1 dot to distinguish between
unqualied hostnames and fqdn, or, better yet, let the admin decide what
they prefer for hostname conventions?
Besides, I've got a couple of other questions:
* How can I control the parallelism of the resulting bot? Just with
GOMAXPROCS in the bot's environment? That's particularly important
for this host which co-hosts other buildbots which shouldn't interfere
with each other by overtaking the whole machine.
* Same for the bot's working directory. Is this hardcoded as
$HOME/.swarming?
Thanks.
Rainer
|
cc/ @golang/release |
@roth We've looked into this error. The swarming bot doesn't seem to support solaris-amd64. Thanks for doing this work, it revealed that this would be an issue. We added the work to add support to our roadmap. I will comment on this issue once that work has started. |
Hi @rorth, I think we're now in a state where running the latest bootstrapswarm will work. Please give it a try? Regarding your earlier questions: The There isn't any official way to control the builder's parallelism. Feel free to try setting It is currently hardcoded as |
Hi Heschi,
Hi @rorth, I think we're now in a state where running the latest
bootstrapswarm will work. Please give it a try?
sure, just did so. I get along further now, but still fail with
28804 2023-09-05 12:21:35.703 E: Request to https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake failed with HTTP status code 403: 403 Client Error: Forbidden for url: https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake
28804 2023-09-05 12:21:35.703 E: Failed to contact for handshake, retrying in 0 sec...
I get a similar error when I try to access that URL with wget:
bash-5.2$ wget https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake
--2023-09-05 14:26:20-- https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake
Resolving chromium-swarm.appspot.com (chromium-swarm.appspot.com)... 142.250.184.244, 2a00:1450:4001:831::2014
Connecting to chromium-swarm.appspot.com (chromium-swarm.appspot.com)|142.250.184.244|:443... connected.
HTTP request sent, awaiting response... 405 Method Not Allowed
2023-09-05 14:26:20 ERROR 405: Method Not Allowed.
Regarding your earlier questions:
The `-hostname` flag to `bootstrapswarm` overrides the bot's hostname
calculation; I believe the crash was unrelated.
Ok, I see.
There isn't any official way to control the builder's parallelism. Feel
free to try setting GOMAXPROCS, especially if that' s what you were doing
before, and we can verify that it's getting propagated into the child
processes. But that will only have some effect, since there's a lot of
parallelism across many processes.
Good: I'll continue with GOXMAXPROCS as before; it worked well enough.
It is currently hardcoded as `$HOME/.swarming` in bootstrapswarm, yes. If
there's a compelling need to change it (or make it overridable) we can look
into that.
It would certainly be nice to be able to control that: software
enforcing directory layouts when there's no strong reason to do so feels
inflexiblel.
Rainer
|
Just to eliminate any confusion, what |
Heschi Kreinick ***@***.***> writes:
Just to eliminate any confusion, what `--hostname` argument did you pass?
I used -hostname solaris-amd64, matching what I passed to genbotcert
-bot-hostname.
|
Oh, are you passing |
Heschi Kreinick ***@***.***> writes:
Oh, are you passing `--token-file-path`? I think that's an attractive
nuisance at the moment -- `bootstrapswarm` understands it, but the actual
Swarming bot doesn't, so its requests are coming in without a token. In
this case I definitely agree that you should be able to override the token
path, but it'll take a little while to get that implemented in the bot. In
the meantime, can you use the default path of
`/var/lib/luci_machine_tokend/token.json` just to see if there are other
surprises in store?
Sure: I hadn't used the default because /var/lib is completely alien to
Solaris, so I went for a more convenient/appropriate location instead.
Switching to the default definitely helped, thanks.
However, I noticed this in swarming_bot.log:
557 2023-09-06 07:18:29.947 D: State {'audio': None, 'cpu_name': None, 'cost_usd_hour': 0.6155357096354167, 'cwd': '/opt/golang/tokend/.swarming', 'disks': {'/system/shared': {'free_mb': 135725.0, 'size_mb': 135731.7}}, 'env': {'PATH': '/usr/bin'}, 'gpu': None, 'hostname': 's11-i386.foss', 'ip': '129.70.161.63', 'nb_files_in_temp': 22, 'pid': 557, 'python': {'executable': '/usr/bin/python3', 'packages': ['asn1crypto==1.5.1', 'attrs==22.2.0', 'Babel==2.10.1', 'cffi==1.15.1', 'chardet==5.1.0', 'cheroot==9.0.0', 'CherryPy==18.8.0', 'cmd2==2.4.0', 'colorama==0.4.6', 'cryptography==37.0.2', 'filelock==3.8.0', 'idna==3.4', 'jsonrpclib-pelix==0.4.3.2', 'jsonschema==4.5.1', 'lxml==4.9.1', 'Mako==1.2.4', 'MarkupSafe==2.1.2', 'more-itertools==9.0.0', 'numpy==1.23.4', 'Pillow==9.3.0', 'pkg==0.1', 'ply==3.8', 'portend==3.1.0', 'prettytable==3.3.0', 'psutil==5.9.4', 'pyasn1==0.5.0', 'pyasn1-modules==0.3.0', 'pybonjour==1.1.1', 'pycairo==1.21.0', 'pycparser==2.21', 'pycurl==7.45.2', 'Pygments==2.15.1', 'PyGObject==3.42.0', 'pyOpenSSL==22.0.0', 'pyparsing==3.0.9', 'pyperclip==1.8.2', 'pyrsistent==0.19.3', 'python-ldap==3.4.2', 'python-rapidjson==1.6', 'pytz==2022.5', 'requests==2.28.2', 'simplejson==3.18.4', 'six==1.16.0', 'tempora==5.0.1', 'urllib3==1.26.15', 'wcwidth==0.2.6'], 'version': '3.9.17 (main, Aug 4 2023, 07:38:38) \n[GCC 12.2.0]'}, 'ram': 0, 'running_time': 43072, 'ssd': [], 'started_ts': 1693941638, 'uptime': 0, 'user': 'golang-luci', 'quarantined': 'Bot denied running for user "golang-luci"', 'original_bot_id': 'solaris-amd64', 'sleep_streak': 767}
The "Bot denied" error certainly bothered me ;-) I've switched to a new
"swarming" user instead and the message vanished. However, no builds so
far AFAICS.
It seems to me that LUCI has an unfortunate habit of forcing its own
conventions (directories, usernames, probably more) on the bots when
it's completely unnecessary and uncalled for ;-(
One other issue: I noticed this in ~swarming/.swarming/README:
### Maintenance
Installing, shutting down, etc.
https://chrome-internal.googlesource.com/infra/infra_internal/+/master/doc/luci/swarming_bot.md
This URL seems to only be accessible inside Google, which seems
unfortunate if visible/expected to be usable on the bots.
Thanks.
Rainer
|
The bot seems to have died, possibly because I sent it work for the first time. Can you take a look? Thanks for your patience. |
Heschi Kreinick ***@***.***> writes:
The bot seems to have died, possibly because I sent it work for the first
time. Can you take a look? Thanks for your patience.
Right: I'd started it in the foreground and my DSL connection died.
Restarted now. The log ends in
10780 2023-09-07 17:36:35.059 I: [UFS]: zone ['cloud'] is not managed by UFS, skipping UFS state checks
10780 2023-09-07 17:36:35.059 I: get_dimensions(): 0s
10780 2023-09-07 17:36:35.096 I: get_state(): 0.036s
10780 2023-09-07 17:36:35.097 I: get_settings(): 0s
10780 2023-09-07 17:36:35.098 I: Unknown OS sunos5
10780 2023-09-07 17:36:35.098 I: Restarting machine with command sudo -n shutdown -r now (Internal failure)
10780 2023-09-07 17:36:35.107 E: Failed to run sudo -n shutdown -r now: [Errno 2] No such file or directory: 'sudo'
10780 2023-09-07 17:36:35.108 I: Sleeping for 300
Do I need to provide a fake sudo to work around this?
|
No, that won't help much. Thanks; I have a list of things to work on now and I'll let you know when it makes sense for you to try again.
|
Heschi Kreinick ***@***.***> writes:
No, that won't help much.
Thanks; I have a list of things to work on now and I'll let you know when
it makes sense for you to try again.
Excellent, thanks a lot.
|
Hi @rorth, most of the stuff that needs doing should be done now. Can you try again? I've added a couple of environment variables you can use:
|
Sure, thanks for working on this.
Will do: the
It seems your changes haven't made it to the repo yet: when I rebuild |
It looks like the builder is doing fine on x/ repos, but hangs indefinitely when testing the main repo. (example: http://ci.chromium.org/b/8768937412553978545) Unfortunately result-adapter is swallowing all the test output so we have nothing to go on from our side. @rorth, can you see anything happening on the machine? |
I don't find anything obvious in |
I've now converted the swarming bot to a proper Solaris SMF service. Let's see how it fares now... |
On the one hand it's not hanging any more, but on the other it's crashing hard in the same test every time with a strange runtime bug that doesn't happen on the old builders:
Any idea what's going on there? I can ask someone from the runtime team to take a look if necessary. |
I have no idea, unfortunately. I tried running the tests manually as the bot user inside the build tree (
and all tests just I've got two questions:
|
It seems there's something amiss with the JSON output: when I run
I get the expected
while for
the command returns with no output and exit status 0. Very weird. I suspect the current buildbot (like running |
@mknyszek spotted the more interesting stack trace: it appears to be crashing inside libc. You mentioned configuring it as a Solaris service. Perhaps that's causing a problem somehow? It seems unlikely to be LUCI-related per se.
|
Thanks, that led me way further: the failure can be reproduced with
Running the test under
many more with absurd fds, the a large number of
ultimately
which lets me suspect that Go gets |
I've confirmed that now:
This way, depending on the vagaries of memory allocation various fields in the struct are set to random values, e.g. The attached patch seems to fix this (it's a bit hard to tell since the success of failure of the resolver tests depends very much on memory layout). |
Thanks for investigating. Makes sense, but this is slightly out of my area of expertise and also we can't accept patches on GitHub issues. Would you mind sending a CL or PR? See https://go.dev/doc/contribute. If not I can dig into it deeper or find someone else to take a look. |
I'll send a different patch. @rorth thanks for finding the problem. |
Great, thanks a lot for your help. |
Change https://go.dev/cl/534516 mentions this issue: |
For #61666 Change-Id: I7a0a849fba0abebe28804bdd6d364b154456e399 Reviewed-on: https://go-review.googlesource.com/c/go/+/534516 Run-TryBot: Ian Lance Taylor <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Damien Neil <[email protected]> Auto-Submit: Ian Lance Taylor <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]>
The memory clearing patch is committed. |
We got a green build! Thanks all. |
For golang#61666 Change-Id: I7a0a849fba0abebe28804bdd6d364b154456e399 Reviewed-on: https://go-review.googlesource.com/c/go/+/534516 Run-TryBot: Ian Lance Taylor <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Damien Neil <[email protected]> Auto-Submit: Ian Lance Taylor <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]>
The |
@rorth It initially failed due to a bad rollout on the LUCI side, but then the machine got quarantined due to too many consecutive failures (that were not related to a build). I'll figure out how to get that resolved ASAP. Thanks for flagging this! |
I'm told the machine just has to be rebooted. :( Sorry for the inconvenience. Whenever you get the chance, can you do that please? Thanks. EDIT: Er, sorry, I misunderstood. Not the machine, just the swarming bot. |
Anyway: I've just rebooted the zone. |
Thanks, it appears to be back online. |
Right, thanks for your help. |
s11-i386.foss.cebitec.uni-bielefeld.de.csr.txt
Somehow I'm not able to add the
new-builder
as required in the installation docs.The text was updated successfully, but these errors were encountered: