Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a way for Ignition to report provisioning success/failure #585

Open
coreosbot opened this issue Nov 8, 2017 · 8 comments
Open

Comments

@coreosbot
Copy link
Contributor

coreosbot commented Nov 8, 2017

Issue by @bgilbert


Issue Report

Feature Request

Environment

Any

Desired Feature

Consider adding a way (in the Ignition config itself, or on the kernel command line) to specify a URL Ignition can POST to report success or failure.

Other Information

Use cases:

  1. Cluster monitoring, e.g. nodes which fail provisioning due to a disk failure
  2. Platform-specific reporting via platform hooks, e.g. Container Linux should phone-home to Packet on boot failure bugs#2130
  3. kola logging
@coreosbot
Copy link
Contributor Author

Comment by @bgilbert


We'll need a way to invoke the failure hook from the command line. That way, if e.g. boot fails between stages because the disks stage clobbered the boot disk, the initramfs emergency shell can invoke the hook. Since bootengine also drops to an emergency shell if Ignition itself fails, we'll need an interlock to prevent the failure hook from running twice.

@ajeddeloh
Copy link
Contributor

ajeddeloh commented Aug 7, 2018

Given that we now support reporting success/failure on packet, I'm rebranding this bug to say the cmdline "oem" should support a url to POST to. Either that or we check for something like coreos.status.url and always post there are well, regardless of what OEM we're running on.

@ajeddeloh
Copy link
Contributor

Capturing a bit of discussion with @arithx on this:

We figured it'd also be a good idea to support posting logs to an external server/service for all the logs not just success/failure. That gets down a whole nother rabbit hole of what to support in terms of services to log to. Shouldn't be too bad to implement since we can just implement additional loggers.

@cgwalters
Copy link
Member

This is highly relevant for OpenShift too. We have all the pieces together to make use of this with the machine-config-operator - we'd have the MCS inject this into the Ignition it generates, pointing at itself. This would allow the MCO to make provisioning failures to admins much more obvious.

@cgwalters
Copy link
Member

An interesting topic here is: is this a way for Ignition to report when it fails, or is it a way for Ignition to support configuring the initramfs so that it can report any failures, even from services not technically part of Ignition?

I lean a bit more towards the latter.

The main thing then is how much detail to provide here - we could give a list of the failing services, or we could support dumping the entire systemd journal, or something in between.

And how much do we codify these things as API versus e.g. allowing the user to inject an arbitrary systemd service into the initramfs (eww?).

@cgwalters
Copy link
Member

Strawman:

ignition:
  onFailure:
    postJournalUrl: http://example.com/ignition-provisioning

Semantics: If the Ignition will POST the full gzip-compressed systemd journal to the target URL if the initrd enters emergency.target.

I'd also like to have:

ignition:
  onFailure:
    panic: true

Using kernel panics to signal this kind of permanent userspace failure is a useful technique in some circumstances; for example qemu has a pvpanic device that can be easily detected by the host.

cgwalters added a commit to cgwalters/ignition-dracut that referenced this issue Jan 4, 2020
Debugging failures in the initrd is annoying; this code
looks for a virtio-serial port named `com.coreos.ignition.journal`,
and runs as part of `emergency.target`.

I plan to change mantle to set up this port by default, so if
something fails in the initramfs we'll at least reliably get
the journal in a sane parsable format.

This is a special targeted subset of
coreos/ignition#585
@miabbott
Copy link
Member

cgwalters added a commit to cgwalters/ignition-dracut that referenced this issue Mar 27, 2020
Debugging failures in the initrd is annoying; this code
looks for a virtio-serial port named `com.coreos.ignition.journal`,
and runs as part of `emergency.target`.

I plan to change mantle to set up this port by default, so if
something fails in the initramfs we'll at least reliably get
the journal in a sane parsable format.

This is a special targeted subset of
coreos/ignition#585
cgwalters added a commit to cgwalters/ignition-dracut that referenced this issue Apr 16, 2020
Debugging failures in the initrd is annoying; this code
looks for a virtio-serial port named `com.coreos.ignition.journal`,
and runs as part of `emergency.target`.

I plan to change mantle to set up this port by default, so if
something fails in the initramfs we'll at least reliably get
the journal in a sane parsable format.

This is a special targeted subset of
coreos/ignition#585

(cherry picked from commit 84c89f4)
@cgwalters
Copy link
Member

cgwalters commented Sep 4, 2020

Random other idea:

ignition:
  onFailureSSHUser: core-ignition-failure

Basically if we fail in the initramfs, we'd revert all of the changes to /etc except the added SSH keys and create a core-ignition-failure user instead. This way an admin could ssh in interactively, but we'd be preventing the system from being used normally because the core user would still be locked out and no other services would run.

(A challenge with this likely generalizes into "but wait I need my networking Ignition config")

Or, perhaps the simple generalization of this is:

ignition:
  config:
    replaceOnFailure: 
      source: https://example.com/onfailure.ign

which is like the config/replace stanza but is invoked on failure.

(By analogy to systemd unit directive OnFailure=).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants