Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot ignite worker nodes on OKD 4.7.0-0.okd-2021-03-28-152009 and fedora-coreos-33.20210314.3.0 #792

Closed
WillNilges opened this issue Apr 11, 2021 · 4 comments
Labels

Comments

@WillNilges
Copy link

Describe the bug
After an upgrade to 4.7.0-0.okd-2021-03-28-152009, two of our fifteen worker nodes did not come back after the reboot. It appeared as though their hard drives had disappeared. No entries in the boot menu, nothing.

So I tried re-igniting the nodes with the latest fcos, but the ignition seemingly did not work. While it appears to unpack the image, modify the disk, and generally do installer-y things, upon reboot after a successful install, it goes right back to the ISO boot screen, and the disk is still not visible in the boot menu.

image

I checked with a live disk, and the disk does infact get partitioned and is visible and everything, it's just corrupted or something.

image

I have tried igniting (and re-creating ignition configs) with the following versions of Fedora CoreOS
fedora-coreos-33.20210217.3.0
fedora-coreos-33.20210314.3.0
fedora-coreos-33.20210328.2.1

A very similar issue happened last time we updated, and I was able to solve it by re-igniting
with fedora-coreos-33.20210217.3.0. This issue thread tipped me off to that solution. Apparently it had something to do with fcos 33.20210301.3.1.

Possibly related issues:
okd-project/okd#566
okd-project/okd#580

Reproduction steps
Steps to reproduce the behavior:

  1. Use openshift-install to create ignition configs for OKD 4.7 cluster worker
  2. Load up Fedora CoreOS in a VM, apply parameters either through live environment or kernel parameters
  3. Observe the flashing process appearing to complete, but then, upon reboot, the installation is not recognized as bootable, and the node will not boot.

Expected behavior
Installation completes, worker reboots and attempts to join the cluster.

Actual behavior
upon reboot, the installation is not recognized as bootable, and the node will not boot.

System details

  • Bare Metal/QEMU/AWS/GCP/etc.
  • Fedora CoreOS version
    Proxmox 6.3 cluster / OKD 4.7.0-0.okd-2021-03-28-152009
    fedora-coreos-33.20210314.3.0

Ignition config
Please attach your FCC or Ignition config used to provision your system. Be sure to sanitize any private data. If not using FCCT to generate your Ignition config, does the Ignition config pass validation using ignition-validate?

{"ignition":{"config":{"merge":[{"source":"https://api-int.okd4.<url>:22623/config/worker"}]},"security":{"tls":{"certificateAuthorities":[{"source":"data:text/plain;charset=utf-8;base64,<Private Cert>"}]}},"version":"3.2.0"}}

Additional information
See a sister issue in the OKD github: okd-project/okd#590

@lucab
Copy link
Contributor

lucab commented Apr 12, 2021

Thanks for the report. This smells like a potential issue somewhere outside of FCOS/OKD, possibly at the proxmox host level.

In particular, the GPT warning should be benign and should self-correct on the first complete boot (at least the backup header, not sure about the protective MBR mismatch).

I suggest to start from a memtest of those nodes, and from the checking the underlying physical storage. Also, if there is any caching involved anywhere in your setup, make sure everything has been properly flushed to disk before booting the virtual machine.
If the host is confirmed is fine, do check that the ISO used as install media matched the SHA256 we publish.
It would be helpful to start from a brand new and empty virtual disk, to ensure you aren't seeing side-effect from ghost OSes.

@WillNilges
Copy link
Author

WillNilges commented Apr 12, 2021

Just tired remaking the virtual disk, but no luck. I don't understand what the deal is because it worked just fine the first time I had this issue. There is one more trick I can try by manually modifying the ignition configs OKD gives me, see if it is somehow a cert issue. I don't think it will work, so if you have any other suggestions I'm all ears.

@WillNilges
Copy link
Author

So, it turns out it was a Proxmox issue. I ignited as normal, but upon reboot, I removed all boot sources (ISO disk, pxe) except for the hard disk image, and it worked. I used my original ignition configs that I made when I initially created the cluster, and fedora-coreos-33.20210217.3.0.

Thank you for the help!

@dustymabe
Copy link
Member

Thanks @WillNilges for letting us know. I'm glad you're unblocked!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants