Cannot ignite worker nodes on OKD 4.7.0-0.okd-2021-03-28-152009 and fedora-coreos-33.20210314.3.0 #792

WillNilges · 2021-04-11T23:58:47Z

Describe the bug
After an upgrade to 4.7.0-0.okd-2021-03-28-152009, two of our fifteen worker nodes did not come back after the reboot. It appeared as though their hard drives had disappeared. No entries in the boot menu, nothing.

So I tried re-igniting the nodes with the latest fcos, but the ignition seemingly did not work. While it appears to unpack the image, modify the disk, and generally do installer-y things, upon reboot after a successful install, it goes right back to the ISO boot screen, and the disk is still not visible in the boot menu.

I checked with a live disk, and the disk does infact get partitioned and is visible and everything, it's just corrupted or something.

I have tried igniting (and re-creating ignition configs) with the following versions of Fedora CoreOS
fedora-coreos-33.20210217.3.0
fedora-coreos-33.20210314.3.0
fedora-coreos-33.20210328.2.1

A very similar issue happened last time we updated, and I was able to solve it by re-igniting
with fedora-coreos-33.20210217.3.0. This issue thread tipped me off to that solution. Apparently it had something to do with fcos 33.20210301.3.1.

Possibly related issues:
okd-project/okd#566
okd-project/okd#580

Reproduction steps
Steps to reproduce the behavior:

Use openshift-install to create ignition configs for OKD 4.7 cluster worker
Load up Fedora CoreOS in a VM, apply parameters either through live environment or kernel parameters
Observe the flashing process appearing to complete, but then, upon reboot, the installation is not recognized as bootable, and the node will not boot.

Expected behavior
Installation completes, worker reboots and attempts to join the cluster.

Actual behavior
upon reboot, the installation is not recognized as bootable, and the node will not boot.

System details

Bare Metal/QEMU/AWS/GCP/etc.
Fedora CoreOS version
Proxmox 6.3 cluster / OKD 4.7.0-0.okd-2021-03-28-152009
fedora-coreos-33.20210314.3.0

Ignition config
Please attach your FCC or Ignition config used to provision your system. Be sure to sanitize any private data. If not using FCCT to generate your Ignition config, does the Ignition config pass validation using ignition-validate?

{"ignition":{"config":{"merge":[{"source":"https://api-int.okd4.<url>:22623/config/worker"}]},"security":{"tls":{"certificateAuthorities":[{"source":"data:text/plain;charset=utf-8;base64,<Private Cert>"}]}},"version":"3.2.0"}}

Additional information
See a sister issue in the OKD github: okd-project/okd#590

The text was updated successfully, but these errors were encountered:

lucab · 2021-04-12T08:05:34Z

Thanks for the report. This smells like a potential issue somewhere outside of FCOS/OKD, possibly at the proxmox host level.

In particular, the GPT warning should be benign and should self-correct on the first complete boot (at least the backup header, not sure about the protective MBR mismatch).

I suggest to start from a memtest of those nodes, and from the checking the underlying physical storage. Also, if there is any caching involved anywhere in your setup, make sure everything has been properly flushed to disk before booting the virtual machine.
If the host is confirmed is fine, do check that the ISO used as install media matched the SHA256 we publish.
It would be helpful to start from a brand new and empty virtual disk, to ensure you aren't seeing side-effect from ghost OSes.

WillNilges · 2021-04-12T17:51:31Z

Just tired remaking the virtual disk, but no luck. I don't understand what the deal is because it worked just fine the first time I had this issue. There is one more trick I can try by manually modifying the ignition configs OKD gives me, see if it is somehow a cert issue. I don't think it will work, so if you have any other suggestions I'm all ears.

WillNilges · 2021-04-18T17:25:49Z

So, it turns out it was a Proxmox issue. I ignited as normal, but upon reboot, I removed all boot sources (ISO disk, pxe) except for the hard disk image, and it worked. I used my original ignition configs that I made when I initially created the cluster, and fedora-coreos-33.20210217.3.0.

Thank you for the help!

dustymabe · 2021-04-19T13:30:01Z

Thanks @WillNilges for letting us know. I'm glad you're unblocked!

WillNilges added the kind/bug label Apr 11, 2021

WillNilges closed this as completed Apr 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot ignite worker nodes on OKD 4.7.0-0.okd-2021-03-28-152009 and fedora-coreos-33.20210314.3.0 #792

Cannot ignite worker nodes on OKD 4.7.0-0.okd-2021-03-28-152009 and fedora-coreos-33.20210314.3.0 #792

WillNilges commented Apr 11, 2021

lucab commented Apr 12, 2021

WillNilges commented Apr 12, 2021 •

edited

Loading

WillNilges commented Apr 18, 2021

dustymabe commented Apr 19, 2021

Cannot ignite worker nodes on OKD 4.7.0-0.okd-2021-03-28-152009 and fedora-coreos-33.20210314.3.0 #792

Cannot ignite worker nodes on OKD 4.7.0-0.okd-2021-03-28-152009 and fedora-coreos-33.20210314.3.0 #792

Comments

WillNilges commented Apr 11, 2021

lucab commented Apr 12, 2021

WillNilges commented Apr 12, 2021 • edited Loading

WillNilges commented Apr 18, 2021

dustymabe commented Apr 19, 2021

WillNilges commented Apr 12, 2021 •

edited

Loading