-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/build: gotip-windows-arm64 builders stops working occasionally #66962
Comments
I got an access grant, and logged into VMs to inspect them. Both were up (not hung or dead) but the "swarming" user was completely inactive (which is not supposed to happen if the systems are healthy). I inspected the system event logs but I don't see any red flags-- last entry in the logs for anything useful done by swarming is on Apr 7th, then after that the user just vanishes. From the bot logs I see this in the Apr 7th swarming bot log ("C:\Users\swarming.swarming\logs\bot_stdout.log.1"): Found a previous bot, 11832 rebooting as a workaround for https://crbug.com/1061531 We have SWARMING_NEVER_REBOOT set to true for these VMs, but the code in question doesn't seem to respect that. Of course that doesn't explain why we would have two copies of the swarming bot running at the same time in the first place. Also a mystery as to why we don't get a proper auto-logon of the swarming user after this happens (since when I do manual restarts we don't seem to have this issue). If anyone has any ideas on how to debug this let me know. I restarted both VMs and and they seem to be processing jobs again. |
From what I can tell, SWARMING_NEVER_REBOOT has effect for most frequent reasons that would otherwise cause the reboot to happen, but it doesn't catch all. The swarming bot seems to occasionally trigger a reboot in some edge cases. We can try to catch and report those edge cases, and aim to get them fixed so the variable does as its name implies in all situations. There may still be future instances that get missed and a restart happens unintentionally anyway. Other options include making this builder come back automatically after a restart, i.e., remove the need for setting the variable, and just handling the occasional restart manually when it happens. Since the builders are now back online and working, let's close this particular issue. Thanks. |
It seems the builder stopped working again since earlier this week https://ci.chromium.org/ui/p/golang/builders/luci.golang.ci/gotip-windows-arm64 |
I'll take a look. Wish I could figure out how to make this builder a bit more bulletproof. |
OK, VMs restarted again. VMs were in the same state as last time, e.g. the
problem. |
Still working on trying to make our LUCI windows-arm64 builders more reliable. The latest set of problems here seem to relate to system oversubscription. I restart the VMs, and they run for a few days or a week, then at a certain point jobs launched to them wind up failing early with "out of memory" errors. Sometimes the problems are in the LUCI infrastructure (ex: cas_download), e.g.
and sometimes the out of memory errors happen during test build:
or this during a test run:
I am not sure what could have changed with the VMs to start triggering these sorts of issues-- the swarming account logs seems to be clean for the most part, I don't see anything odd in the system event logs. When I log into the builders and examine them, there are no tests running, but the system commit charge is at or near 100%. Pictures from process explorer: Paging @golang/windows experts -- if anyone has debugged these sorts of issues before and might have ideas on how to proceed, let me know (I am certainly out of ideas). My gut is that there is some sort of zombie process here, but given that I can't see any processes active from LUCI, I'm not sure how to debug this. |
errno of 1455 is also So I suspect your page file is too small or similar. I am not an expert in this area anymore. I even doubt page file still exist on modern Windows. But I agree with you that I googled for "page file windows task manager", and I find but I cannot find any good suggestions there. Hopefully other Windows experts will help. Alex |
Hi there 👋🏼 I'm the Go group manager in Microsoft and I'd like to setup a call to discuss improving the stability of these builders. Who should I include? Thanks! |
Recording here that there was another instance of the builders disconnecting and needing to be restarted around September 20-23: Purple boxes are where it was missing until being restarted. The 3 failed builds before that all failed with "cannot allocate memory" errors:
Since the restart it's been working okay again. That suggests the work to have the builder fully start up after a restart is complete and working, and so it could work well to stop setting SWARMING_NEVER_REBOOT to allow LUCI restart the builders when it's deemed necessary. |
https://ci.chromium.org/ui/p/golang/builders/luci.golang.ci/gotip-windows-arm64
seems all builders are offline.
cc @golang/release @thanm
The text was updated successfully, but these errors were encountered: