You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
After an upgrade to 4.7.0-0.okd-2021-03-28-152009, two of our fifteen worker nodes did not come back after the reboot. It appeared as though their hard drives had disappeared. No entries in the boot menu, nothing.
So I tried re-igniting the nodes with the latest fcos, but the ignition seemingly did not work. While it appears to unpack the image, modify the disk, and generally do installer-y things, upon reboot after a successful install, it goes right back to the ISO boot screen, and the disk is still not visible in the boot menu.
I checked with a live disk, and the disk does infact get partitioned and is visible and everything, it's just corrupted or something.
I have tried igniting (and re-creating ignition configs) with the following versions of Fedora CoreOS
fedora-coreos-33.20210217.3.0
fedora-coreos-33.20210314.3.0
fedora-coreos-33.20210328.2.1
A very similar issue happened last time we updated, and I was able to solve it by re-igniting
with fedora-coreos-33.20210217.3.0. This issue thread tipped me off to that solution. Apparently it had something to do with fcos 33.20210301.3.1.
Reproduction steps
Steps to reproduce the behavior:
Use openshift-install to create ignition configs for OKD 4.7 cluster worker
Load up Fedora CoreOS in a VM, apply parameters either through live environment or kernel parameters
Observe the flashing process appearing to complete, but then, upon reboot, the installation is not recognized as bootable, and the node will not boot.
Expected behavior
Installation completes, worker reboots and attempts to join the cluster.
Actual behavior
upon reboot, the installation is not recognized as bootable, and the node will not boot.
System details
Bare Metal/QEMU/AWS/GCP/etc.
Fedora CoreOS version
Proxmox 6.3 cluster / OKD 4.7.0-0.okd-2021-03-28-152009
fedora-coreos-33.20210314.3.0
Ignition config
Please attach your FCC or Ignition config used to provision your system. Be sure to sanitize any private data. If not using FCCT to generate your Ignition config, does the Ignition config pass validation using ignition-validate?
Thanks for the report. This smells like a potential issue somewhere outside of FCOS/OKD, possibly at the proxmox host level.
In particular, the GPT warning should be benign and should self-correct on the first complete boot (at least the backup header, not sure about the protective MBR mismatch).
I suggest to start from a memtest of those nodes, and from the checking the underlying physical storage. Also, if there is any caching involved anywhere in your setup, make sure everything has been properly flushed to disk before booting the virtual machine.
If the host is confirmed is fine, do check that the ISO used as install media matched the SHA256 we publish.
It would be helpful to start from a brand new and empty virtual disk, to ensure you aren't seeing side-effect from ghost OSes.
Just tired remaking the virtual disk, but no luck. I don't understand what the deal is because it worked just fine the first time I had this issue. There is one more trick I can try by manually modifying the ignition configs OKD gives me, see if it is somehow a cert issue. I don't think it will work, so if you have any other suggestions I'm all ears.
So, it turns out it was a Proxmox issue. I ignited as normal, but upon reboot, I removed all boot sources (ISO disk, pxe) except for the hard disk image, and it worked. I used my original ignition configs that I made when I initially created the cluster, and fedora-coreos-33.20210217.3.0.
Describe the bug
After an upgrade to 4.7.0-0.okd-2021-03-28-152009, two of our fifteen worker nodes did not come back after the reboot. It appeared as though their hard drives had disappeared. No entries in the boot menu, nothing.
So I tried re-igniting the nodes with the latest fcos, but the ignition seemingly did not work. While it appears to unpack the image, modify the disk, and generally do installer-y things, upon reboot after a successful install, it goes right back to the ISO boot screen, and the disk is still not visible in the boot menu.
I checked with a live disk, and the disk does infact get partitioned and is visible and everything, it's just corrupted or something.
I have tried igniting (and re-creating ignition configs) with the following versions of Fedora CoreOS
fedora-coreos-33.20210217.3.0
fedora-coreos-33.20210314.3.0
fedora-coreos-33.20210328.2.1
A very similar issue happened last time we updated, and I was able to solve it by re-igniting
with fedora-coreos-33.20210217.3.0. This issue thread tipped me off to that solution. Apparently it had something to do with fcos 33.20210301.3.1.
Possibly related issues:
okd-project/okd#566
okd-project/okd#580
Reproduction steps
Steps to reproduce the behavior:
openshift-install
to create ignition configs for OKD 4.7 cluster workerExpected behavior
Installation completes, worker reboots and attempts to join the cluster.
Actual behavior
upon reboot, the installation is not recognized as bootable, and the node will not boot.
System details
Proxmox 6.3 cluster / OKD 4.7.0-0.okd-2021-03-28-152009
fedora-coreos-33.20210314.3.0
Ignition config
Please attach your FCC or Ignition config used to provision your system. Be sure to sanitize any private data. If not using FCCT to generate your Ignition config, does the Ignition config pass validation using ignition-validate?
Additional information
See a sister issue in the OKD github: okd-project/okd#590
The text was updated successfully, but these errors were encountered: