You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After a reboot the node crashes while printing the hosts entries but without any error message.
Metropolis: this is /dev/ttyS1. Verbose node logs follow.
panichandler E Failed to open core runtime log file: read-only file system
panichandler W Continuing without persistent panic storage.
panichandler I Panic console: /dev/tty0
panichandler I Panic console: /dev/ttyS0
panichandler I Panic console: /dev/ttyS1
init I Starting Metropolis node init
root I Board name: "[REDACTED]"
root I No qemu fwcfg params.
root I Retrieved node parameters from ESP
k8s worker I Waiting for startup data...
cplane launcher I Waiting for start data...
k8s controller I Waiting for startup data...
hostsfile I Waiting for curator connection...
clusternet I Waiting for curator connection...
rolefetch I Waiting for curator connection...
nodemgmt I Waiting for cluster membership...
heartbeat I Waiting for curator connection...
net static I Configured interface "bond0"
net static I Configured interface "enp65s0f0"
root I Non-sealed configuration present. attempting to join cluster
root I Joining an existing cluster.
root I Using TPM-secured configuration: false
root I Node Join public key: 1d83cf94261775598949149bd5c37b028dc313033680a0571d54a7f370cd9b0b
root I Directory:
root I Addresses:
The text was updated successfully, but these errors were encountered:
This isn't a fix, but it should make this failure mode clearer to cluster operators.
Not sure how much time we wanna spend investigating the silent aspect of the crash. I expect it might be a quiet panic due to us not catching them so early on in the boot process. And without such a handler, the panics go straight into /dev/stderr, which in our case is likely not /dev/ttyS1.
We found the reason for the Silent fail last night. Its an issue with the BMC not being fast enough. I made a small patch I still have to push that removes printing the whole directory on startup, to reduce the amount of logs written to serial
After adding nodes fairly quickly we encountered a crash:
After a reboot the node crashes while printing the hosts entries but without any error message.
The text was updated successfully, but these errors were encountered: