Identify why Ampere altras are restarting and not booting properly #2894

sxa · 2022-03-14T12:20:57Z

This has happened multiple times recently. For some reason it's restarting itself and not coming back. We need to identify why it's rebooting (Error condition, patching, or something else) and then see why it's not coming back (Separate test - perhaps try rebooting in an idle time and see if it comes back)

Current recovery process it to connect to the out-of-band console (details in the Equinix UI) and exit from the Shell> prompt.

The text was updated successfully, but these errors were encountered:

richardlau · 2022-03-14T12:26:49Z

I thought the problematic one was ubuntu2004_docker-arm64-1?
Refs: #2820 (comment)
Refs: #2835 (comment)

sxa · 2022-03-14T12:39:44Z

Changed the title

richardlau · 2022-04-14T14:30:45Z

And today it looks like test-equinix-ubuntu2004_docker-arm64-2 is down 😞. Logged into the out-of-band console and it was on the UEFI CLI. Typed exit at the prompt and then selected GNU/Linux at the GRUB menu and the machine booted.

richardlau · 2022-04-27T14:28:34Z

Looks like test-equinix-ubuntu2004_docker-arm64-2 is down again. It was stuck on the UEFI CLI again -- I've exited it and it's booting.

richardlau · 2022-04-29T15:01:36Z

And again test-equinix-ubuntu2004_docker-arm64-2 had restarted and was stuck on the UEFI CLI.

richardlau · 2022-05-09T11:39:50Z

test-equinix-ubuntu2004_docker-arm64-2 had restarted again and was stuck on the UEFI CLI. Logged into to the OOB console and exited the CLI.

richardlau · 2022-05-13T11:26:08Z

Noticed the containers on test-equinix-ubuntu2004_docker-arm64-2 are all down again. Logged into the OOB console and exited the UEFI CLI again.

richardlau · 2022-05-16T11:07:33Z

Containers on test-equinix-ubuntu2004_docker-arm64-2 are all offline again.

richardlau · 2022-05-16T11:21:03Z

(Is it too optimistic to hope the planned maintenance makes a difference? 🙂)

sxa · 2022-05-17T19:17:50Z

(Is it too optimistic to hope the #2948 makes a difference? slightly_smiling_face)

I suspect so ;-)

I brought it back online earlier today and will contact WorksOnArm regarding the failures.

It seems to be throwing a few of these before it dies, although it manages to recover from quite a lot of them too:

May 13 17:56:46 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23448.790563] "node" (999554) uses deprecated CP15 Barrier instruction at 0x11a4a9c
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526304] {73}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526311] {73}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526314] {73}[Hardware Error]: event severity: corrected
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526317] {73}[Hardware Error]:  Error 0, type: corrected
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526324] {73}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526326] {73}[Hardware Error]:   section length: 0x30
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526332] {73}[Hardware Error]:   00000000: 40000003 00000000 00400000 00462030  ...@[email protected] F.
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526336] {73}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526338] {73}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666503] {74}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666509] {74}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666512] {74}[Hardware Error]: event severity: corrected
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666515] {74}[Hardware Error]:  Error 0, type: corrected
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666522] {74}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666524] {74}[Hardware Error]:   section length: 0x30
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666531] {74}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@[email protected] F.
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666534] {74}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666537] {74}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879202] {75}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879208] {75}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879211] {75}[Hardware Error]: event severity: corrected
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879214] {75}[Hardware Error]:  Error 0, type: corrected
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879221] {75}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879222] {75}[Hardware Error]:   section length: 0x30
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879229] {75}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@[email protected] F.
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879232] {75}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879235] {75}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326137] {76}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326145] {76}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326147] {76}[Hardware Error]: event severity: corrected
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326150] {76}[Hardware Error]:  Error 0, type: corrected
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326157] {76}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326158] {76}[Hardware Error]:   section length: 0x30
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326166] {76}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@[email protected] F.
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326169] {76}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326172] {76}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754400] {77}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754406] {77}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754408] {77}[Hardware Error]: event severity: corrected
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754411] {77}[Hardware Error]:  Error 0, type: corrected
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754418] {77}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754419] {77}[Hardware Error]:   section length: 0x30
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754427] {77}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@[email protected] F.
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754430] {77}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754433] {77}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069449] {78}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069456] {78}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069458] {78}[Hardware Error]: event severity: corrected
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069461] {78}[Hardware Error]:  Error 0, type: corrected
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069470] {78}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069471] {78}[Hardware Error]:   section length: 0x30
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069478] {78}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@[email protected] F.
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069481] {78}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069484] {78}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552450] {79}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552457] {79}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552460] {79}[Hardware Error]: event severity: corrected
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552463] {79}[Hardware Error]:  Error 0, type: corrected
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552471] {79}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552473] {79}[Hardware Error]:   section length: 0x30
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552480] {79}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@[email protected] F.
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552483] {79}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552486] {79}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123337] {80}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123344] {80}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123346] {80}[Hardware Error]: event severity: corrected
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123349] {80}[Hardware Error]:  Error 0, type: corrected
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123356] {80}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123357] {80}[Hardware Error]:   section length: 0x30
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123364] {80}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@[email protected] F.
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123367] {80}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123370] {80}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802232] {81}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802239] {81}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802242] {81}[Hardware Error]: event severity: corrected
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802245] {81}[Hardware Error]:  Error 0, type: corrected
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802253] {81}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802254] {81}[Hardware Error]:   section length: 0x30
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802262] {81}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@[email protected] F.
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802265] {81}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802267] {81}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949286] {82}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949293] {82}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949295] {82}[Hardware Error]: event severity: corrected
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949298] {82}[Hardware Error]:  Error 0, type: corrected
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949306] {82}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949307] {82}[Hardware Error]:   section length: 0x30
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949315] {82}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@[email protected] F.
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949318] {82}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949321] {82}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 16 11:54:43 test-equinix-ubuntu2004-docker-arm64-2 kernel: [    0.000000] Booting Linux on physical CPU 0x0000120000 [0x413fd0c1]

richardlau · 2022-06-13T12:31:02Z

Both machines were offline over the weekend, stuck on the UEFI CLI #2959. I've logged into the OOB console on both and exited the CLI.

sxa · 2022-06-16T10:21:20Z

It looks like one of them may not have been started after the previous maintenance window. For the other one (which has been unreliable for us) Equinix have provided me with a replacement which I'm provisioning with Ubuntu 20.04 just now and will be up as test-equinix-ubuntu2004-arm64-3 so we can migrate off the unstable one and leave it to them to analyse the fault.

richardlau · 2022-06-17T12:05:39Z

The second one (-2) was offline again. I've gone into the OOB console and exited the UEFI prompt.

richardlau · 2022-06-20T10:44:46Z

Rescued the second Altra again this morning.

sxa · 2022-06-20T17:13:35Z

Looks to be down again. Let's not bring it back. I've got the playbook running at the moment which will bring up the -3 machine with direct replacements (same names) as the containers on the defective -2 system.

(For anyone watching along, the firewall rules have been switched to replace -2 with -3 so there should be no risk of both machines connecting together)

pgmwoa · 2022-06-29T19:49:42Z

@sxa , @richardlau , Request you to delete the problematic Altra server (Mt Jade under WoA) that is not used so that there is no confusion when the Equinix support team reclaims it. We need that deleted and freed for further investigation. Currently, all the 3 Mt Jade servers are showing as provisioned and active.
@sxa Please confirm via response to the email dated 27th Jun w/ subject " Node.js - Works On Arm Sponsored - Stability issue".
Thnx
WoA Program Team

richardlau · 2022-06-30T12:34:04Z

I've deleted the Altra that had ip address 139.178.85.13.

sxa · 2022-06-30T14:29:30Z

Confirmed via email

richardlau · 2022-06-30T15:28:10Z

Looks like the first Altra restarted around 5 and a half hours ago and was stuck on the UEFI prompt. I've logged into the OOB console and exited.

richardlau · 2022-07-12T14:43:43Z

Recovered test-equinix-ubuntu2004-docker-arm64-1 again from the UEFI prompt.
I saw this while the machine was booting (after the prompt was exited):

[    0.925839] tpm_crb MSFT0101:00: [Firmware Bug]: ACPI region does not cover the entire command/response buffer. [mem 0x88500000-0x88500fff flags 0x201] vs 88500038 1000
[    1.011928] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.018605] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.025254] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.031897] cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -12
[    1.039030] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.045686] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.052330] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.058972] cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -12

Ubuntu 20.04.4 LTS test-equinix-ubuntu2004-docker-arm64-1 ttyAMA0

test-equinix-ubuntu2004-docker-arm64-1 login:

richardlau · 2022-07-18T12:41:53Z

Recovered test-equinix-ubuntu2004-docker-arm64-1 again from the UEFI prompt.
Same messages as before when booting:

[    0.892690] tpm_crb MSFT0101:00: [Firmware Bug]: ACPI region does not cover the entire command/response buffer. [mem 0x88500000-0x88500fff flags 0x201] vs 88500038 1000
[    0.980799] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    0.987482] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    0.994141] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.000805] cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -12
[    1.008286] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.014963] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.021617] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.028270] cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -12

Ubuntu 20.04.4 LTS test-equinix-ubuntu2004-docker-arm64-1 ttyAMA0

test-equinix-ubuntu2004-docker-arm64-1 login:

sxa · 2022-07-22T13:31:07Z

Most recent jobs before the crash seem to have been centos7-arm64-gcc6 ones -although they were listed as SUCCESS (This is from the jenkins server log):

2022-07-16 06:08:53:620 - AuditLog - node-test-commit-arm » centos7-arm64-gcc6 #42769 Started by upstream project "node-test-commit-arm" build number 42,769, Parameters:[NODEJS_VERSION: {12.22.13}, NODEJS_MAJOR_VERSION: {12}] on node test-equinix-centos7_container-arm64-2 started at 2022-07-16T10:02:12Z completed in 392437ms completed: SUCCESS
2022-07-17 06:09:04:086 - AuditLog - node-test-commit-arm » centos7-arm64-gcc6 #42781 Started by upstream project "node-test-commit-arm" build number 42,781, Parameters:[NODEJS_VERSION: {12.22.13}, NODEJS_MAJOR_VERSION: {12}] on node test-equinix-centos7_container-arm64-2 started at 2022-07-17T10:02:14Z completed in 400434ms completed: SUCCESS
2022-07-18 06:11:07:881 - AuditLog - node-test-commit-arm » centos7-arm64-gcc6 #42808 Started by upstream project "node-test-commit-arm" build number 42,808, Parameters:[NODEJS_VERSION: {12.22.13}, NODEJS_MAJOR_VERSION: {12}] on node test-equinix-centos7_container-arm64-2 started at 2022-07-18T10:02:16Z completed in 523189ms completed: SUCCESS

NOTES:
The above is from using the output of using egrep - "test-equinix-centos7_container-arm64-2|test-equinix-ubuntu2004_sharedlibs_container-arm64-2|test-equinix-ubuntu1804_sharedlibs_container-arm64-2|test-equinix-ubuntu2004_sharedlibs_container-arm64-1|test-equinix-ubuntu1804_container-arm64-1|test-equinix-centos8_container-arm64-1|test-equinix-rhel8_container-arm64-1|test-equinix-ubuntu2004_container-armv7l-1|test-equinix-centos7_container-arm64-1|test-equinix-ubuntu2004_sharedlibs_container-arm64-3|test-equinix-ubuntu1804_sharedlibs_container-arm64-1|test-equinix-ubuntu2004_container-arm64-1|test-equinix-debian10_container-armv7l-1|test-equinix-ubuntu1804_sharedlibs_container-arm64-3" against the jenkins log which shows all the stuff about the containers on that host.

In case there are any issues specific to centos7-arm64-gcc6 I'm going to run a few rebuids of https://ci.nodejs.org/job/node-test-commit-arm/42880 which is ONLY building that one.

sxa · 2022-07-28T12:11:10Z

Have taken the second centos7 container offline and currently repeatedly running the centos7 gcc6 job repeatedly on the "failing" altra. I will also add in the ubuntu2004-armv7l combination in future runs as that is potentially more suspect than the others and bring test-equinix-centos7_container-arm64-2 from the other machine offline for now too.

Running as builds https://ci.nodejs.org/job/node-test-commit-arm 42988 up to 43000 which is running:

https://ci.nodejs.org/job/node-test-commit-arm/nodes=centos7-arm64-gcc6 42988 up to 43000

And builds https://ci.nodejs.org/job/node-test-commit-arm 43001 up to 43010 which is running:

https://ci.nodejs.org/job/node-test-commit-arm/nodes=ubuntu2004-armv7l 43001 up to 43010

joyeecheung · 2022-08-30T05:02:25Z

It seems the issue is happening again #3022, it has been blocking the CI for a while

sxa · 2022-08-30T09:53:06Z

I've brought https://ci.nodejs.org/computer/test-equinix-ubuntu2004_container-armv7l-2/ back online to clear the backlog.

test-equinix-ubuntu2004-arm64-1 - 145.40.81.219 - had gone offline for the first time in a while so we'll need to re-evaluate what's going on here. That's the first outage we've had in a few weeks on that server. It's now back and so there are two executors for the
ubuntu2004-armv7l jobs available again.

richardlau · 2022-10-17T16:14:17Z

Had to log into the oob console for test-equinix-ubuntu2004-arm64-1 today to exit the UEFI prompt.

richardlau · 2022-11-01T12:25:23Z

Had to recover test-equinix-ubuntu2004-arm64-1 today in the usual way.

richardlau · 2022-11-02T12:52:36Z

test-equinix-ubuntu2004-arm64-1 had rebooted/was stuck again today 😞. I've recovered it.

richardlau · 2022-11-02T12:54:31Z

Have taken the second centos7 container offline

@sxa FYI I've brought back the second container to help process the job queue.

richardlau · 2022-11-24T21:36:31Z

test-equinix-ubuntu2004-arm64-1 was stuck again and has now been recovered.

richardlau · 2023-02-03T15:05:23Z

Looks like all the containers on test-equinix-ubuntu2004-arm64-1 are offline again. I'm not sure for how long as there's no build history for any of them (we delete old build history, but I forget how far back the cut off is).

I'm in a meeting now, but I'll look at the host after it -- I suspect the host is stuck on the UEFI boot prompt again..

richardlau · 2023-02-03T17:10:15Z

Looks like all the containers on test-equinix-ubuntu2004-arm64-1 are offline again.
...
I suspect the host is stuck on the UEFI boot prompt again..

It was. I've logged into the out of band console and exited the UEFI prompt. Host is back online and the containers are processing jobs.

github-actions · 2023-12-01T00:18:09Z

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

targos · 2024-01-03T11:23:46Z

@sxa I think this was fixed in the context of #3492

sxa · 2024-01-03T17:30:24Z

Interesting - I thought we had that applied previously on the machines - @richardlau how confident are you that we're ok with this on all the systems now? Wer had two issues - the fact it was falling over on its own and the fact that it didn't come back up (which sounds like it's what's resolved on -3)

richardlau · 2024-01-03T17:38:42Z

Re. "didn't come back up" we had two issues:

Machine rebooted into UEFI prompt. No idea what was causing this, but I don't believe we've hit this for a while. (If Identify why Ampere altras are restarting and not booting properly #2894 (comment) was the last case then almost a year.)
Machine rebooted into grub prompt. This is fixed by applying https://gist.github.com/vielmetti/dafb5128ef7535c218f6d963c5bc624e#prevention-of-boot-failures which I believe has been done to both machines.

I don't think we ever worked out why the machines restarted themselves in the first place.

sxa · 2024-01-04T10:04:42Z

Hmmm ok if it's been about a year sine we last had an unexplained reboot then I think I'm ok with closing this and we can re-open if required. Hadn't realised it had been so long :-)

sxa changed the title ~~Identify why ubuntu2004_docker-arm64-2 is restarting and not booting properly~~ Identify why ubuntu2004_docker-arm64-1 is restarting and not booting properly Mar 14, 2022

richardlau changed the title ~~Identify why ubuntu2004_docker-arm64-1 is restarting and not booting properly~~ Identify why Ampere altras are restarting and not booting properly Apr 14, 2022

richardlau added incident platform:arm labels Apr 14, 2022

richardlau mentioned this issue Jun 13, 2022

node-test-binary-armv7l stuck #2959

Closed

richardlau mentioned this issue Jun 13, 2022

Planned outage for upgrade - Jun 2 during US business hours: Equinix aarch64 "Altra" systems #2948

Closed

sxa self-assigned this Jun 16, 2022

sxa mentioned this issue Jun 20, 2022

ansible: replace altra 2 with 3 in the inventory #2969

Merged

richardlau mentioned this issue Jul 1, 2022

arm jobs failing with corrupted workspace #2983

Closed

joyeecheung mentioned this issue Aug 30, 2022

node-test-binary-armv7l pending indefinitely: All nodes of label ‘ubuntu2004-armv7l’ are offline #3022

Closed

richardlau mentioned this issue Oct 2, 2022

node-test-commit-arm is broken #3044

Closed

richardlau mentioned this issue Nov 16, 2022

Legacy Equinix Metal data facility closures on November 30th, 2022 #3028

Closed

4 tasks

richardlau mentioned this issue May 18, 2023

test-equinix-ubuntu2004_docker-arm64-3 (and all hosted containers on it) is down #3359

Closed

github-actions bot added the stale label Dec 1, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 1, 2024

sxa reopened this Jan 3, 2024

github-actions bot removed the stale label Jan 4, 2024

sxa closed this as completed Jan 4, 2024

Identify why Ampere altras are restarting and not booting properly #2894

Identify why Ampere altras are restarting and not booting properly #2894

Comments

sxa commented Mar 14, 2022 • edited Loading

richardlau commented Mar 14, 2022

sxa commented Mar 14, 2022

richardlau commented Apr 14, 2022

richardlau commented Apr 27, 2022

richardlau commented Apr 29, 2022

richardlau commented May 9, 2022

richardlau commented May 13, 2022

richardlau commented May 16, 2022

richardlau commented May 16, 2022

sxa commented May 17, 2022 • edited Loading

richardlau commented Jun 13, 2022

sxa commented Jun 16, 2022

richardlau commented Jun 17, 2022

richardlau commented Jun 20, 2022

sxa commented Jun 20, 2022

pgmwoa commented Jun 29, 2022

richardlau commented Jun 30, 2022

sxa commented Jun 30, 2022

richardlau commented Jun 30, 2022

richardlau commented Jul 12, 2022

richardlau commented Jul 18, 2022

sxa commented Jul 22, 2022 • edited Loading

sxa commented Jul 28, 2022 • edited Loading

joyeecheung commented Aug 30, 2022

sxa commented Aug 30, 2022

richardlau commented Oct 17, 2022

richardlau commented Nov 1, 2022

richardlau commented Nov 2, 2022

richardlau commented Nov 2, 2022

richardlau commented Nov 24, 2022

richardlau commented Feb 3, 2023

richardlau commented Feb 3, 2023

github-actions bot commented Dec 1, 2023

targos commented Jan 3, 2024

sxa commented Jan 3, 2024

richardlau commented Jan 3, 2024

sxa commented Jan 4, 2024

sxa commented Mar 14, 2022 •

edited

Loading

sxa commented May 17, 2022 •

edited

Loading

sxa commented Jul 22, 2022 •

edited

Loading

sxa commented Jul 28, 2022 •

edited

Loading