-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify why Ampere altras are restarting and not booting properly #2894
Comments
I thought the problematic one was ubuntu2004_docker-arm64-1? |
Changed the title |
And today it looks like test-equinix-ubuntu2004_docker-arm64-2 is down 😞. Logged into the out-of-band console and it was on the UEFI CLI. Typed |
Looks like test-equinix-ubuntu2004_docker-arm64-2 is down again. It was stuck on the UEFI CLI again -- I've exited it and it's booting. |
And again test-equinix-ubuntu2004_docker-arm64-2 had restarted and was stuck on the UEFI CLI. |
test-equinix-ubuntu2004_docker-arm64-2 had restarted again and was stuck on the UEFI CLI. Logged into to the OOB console and exited the CLI. |
Noticed the containers on test-equinix-ubuntu2004_docker-arm64-2 are all down again. Logged into the OOB console and exited the UEFI CLI again. |
Containers on test-equinix-ubuntu2004_docker-arm64-2 are all offline again. |
(Is it too optimistic to hope the planned maintenance makes a difference? 🙂) |
I suspect so ;-) I brought it back online earlier today and will contact WorksOnArm regarding the failures. It seems to be throwing a few of these before it dies, although it manages to recover from quite a lot of them too:
|
Both machines were offline over the weekend, stuck on the UEFI CLI #2959. I've logged into the OOB console on both and exited the CLI. |
It looks like one of them may not have been started after the previous maintenance window. For the other one (which has been unreliable for us) Equinix have provided me with a replacement which I'm provisioning with Ubuntu 20.04 just now and will be up as test-equinix-ubuntu2004-arm64-3 so we can migrate off the unstable one and leave it to them to analyse the fault. |
The second one (-2) was offline again. I've gone into the OOB console and exited the UEFI prompt. |
Rescued the second Altra again this morning. |
Looks to be down again. Let's not bring it back. I've got the playbook running at the moment which will bring up the (For anyone watching along, the firewall rules have been switched to replace |
@sxa , @richardlau , Request you to delete the problematic Altra server (Mt Jade under WoA) that is not used so that there is no confusion when the Equinix support team reclaims it. We need that deleted and freed for further investigation. Currently, all the 3 Mt Jade servers are showing as provisioned and active. |
I've deleted the Altra that had ip address 139.178.85.13. |
Confirmed via email |
Looks like the first Altra restarted around 5 and a half hours ago and was stuck on the UEFI prompt. I've logged into the OOB console and exited. |
Recovered test-equinix-ubuntu2004-docker-arm64-1 again from the UEFI prompt.
|
Recovered test-equinix-ubuntu2004-docker-arm64-1 again from the UEFI prompt.
|
Most recent jobs before the crash seem to have been
NOTES: In case there are any issues specific to |
Have taken the second centos7 container offline and currently repeatedly running the centos7 gcc6 job repeatedly on the "failing" altra. I will also add in the ubuntu2004-armv7l combination in future runs as that is potentially more suspect than the others and bring test-equinix-centos7_container-arm64-2 from the other machine offline for now too. Running as builds https://ci.nodejs.org/job/node-test-commit-arm 42988 up to 43000 which is running:
And builds https://ci.nodejs.org/job/node-test-commit-arm 43001 up to 43010 which is running:
|
It seems the issue is happening again #3022, it has been blocking the CI for a while |
I've brought https://ci.nodejs.org/computer/test-equinix-ubuntu2004_container-armv7l-2/ back online to clear the backlog. test-equinix-ubuntu2004-arm64-1 - 145.40.81.219 - had gone offline for the first time in a while so we'll need to re-evaluate what's going on here. That's the first outage we've had in a few weeks on that server. It's now back and so there are two executors for the |
Had to log into the oob console for test-equinix-ubuntu2004-arm64-1 today to exit the UEFI prompt. |
Had to recover test-equinix-ubuntu2004-arm64-1 today in the usual way. |
test-equinix-ubuntu2004-arm64-1 had rebooted/was stuck again today 😞. I've recovered it. |
@sxa FYI I've brought back the second container to help process the job queue. |
test-equinix-ubuntu2004-arm64-1 was stuck again and has now been recovered. |
Looks like all the containers on test-equinix-ubuntu2004-arm64-1 are offline again. I'm not sure for how long as there's no build history for any of them (we delete old build history, but I forget how far back the cut off is). I'm in a meeting now, but I'll look at the host after it -- I suspect the host is stuck on the UEFI boot prompt again.. |
It was. I've logged into the out of band console and exited the UEFI prompt. Host is back online and the containers are processing jobs. |
This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made. |
Interesting - I thought we had that applied previously on the machines - @richardlau how confident are you that we're ok with this on all the systems now? Wer had two issues - the fact it was falling over on its own and the fact that it didn't come back up (which sounds like it's what's resolved on |
Re. "didn't come back up" we had two issues:
I don't think we ever worked out why the machines restarted themselves in the first place. |
Hmmm ok if it's been about a year sine we last had an unexplained reboot then I think I'm ok with closing this and we can re-open if required. Hadn't realised it had been so long :-) |
This has happened multiple times recently. For some reason it's restarting itself and not coming back. We need to identify why it's rebooting (Error condition, patching, or something else) and then see why it's not coming back (Separate test - perhaps try rebooting in an idle time and see if it comes back)
Current recovery process it to connect to the out-of-band console (details in the Equinix UI) and
exit
from theShell>
prompt.The text was updated successfully, but these errors were encountered: