[sled-agent] Re-construct Zone management state within sled agent after reboot #725

smklein · 2022-03-07T21:20:36Z

Note: This issue tracks a follow-up from #686

Problem

Sled Agent should re-initialize in-memory state for managing Zones on boot, including:

Identifying Omicron-owned zones
Identifying their VNICs and other attached devices
Creating in-memory structures to accurately represent their state (e.g. "RunningZone") such that they can be handled within the context of other Zones, and cleaned up accordingly.

Current workaround

As the Sled Agent "deletes all zones on initialization", we can punt on this issue slightly.

When this'll be a bigger problem

If we want to cope with Sled Agent itself crashing (without taking out all other software on the sled), this issue will become blocking.

bnaecker · 2022-05-10T19:01:25Z

This issue also needs to expand to include recovering the data links used for guest instances. These are currently the xde devices, plus VNICs over them (required by the Viona virtio-net driver).

We are currently deleting all of the guest Propolis zones, and also deleting these links. We'll need to recover and track them similar to the control VNICs used for communicating with Oxide services running in zones.

This PR consists of changes to the Sled Agent to allow internal services to re-launch successfully after a reboot. This was tested with the following script, on an initialized system running with SoftNPU: ```bash # Emulate a "sled shutdown" svcadm disable sled-agent # Kill and restart softnpu, to "empty out switch state" zlogin softnpu pkill softnpu zlogin softnpu pgrep softnpu || { zlogin softnpu 'RUST_LOG=debug /stuff/softnpu /stuff/softnpu.toml &> /softnpu.log &' } # Restart the sled agent. Observe that it still tears down all running zones on boot, # but it successfully re-initializes all sleds which are described by: # /pool/int/<UUID>/config/services.toml svcadm enable sled-agent ``` ## Before this PR - Sled Agent was capable of booting and initializing via RSS - However, after a reboot, the Sled Agent did not successfully re-initialize all zones that had been running before reboot. It attempted to bring up the switch zone, failed, and would not be able to continue initializing other services ## This PR - Fixes the routing issue described by #3461: Zone initialization only attempts to add a route to the "sled agent underlay address" as a gateway iff (1) the underlay is up, and (2) the zone-being-launched has an address on the same subnet. - Delays launching all "Sled Agent services" until after the sled agent has an underlay address. This makes it more likely for services dependent on external connectivity (e.g., NTP, external DNS, Nexus) to initialize correctly. - The bootstrap agent, which may send a `start_rack_initialize` request if and RSS configuration file is included, no longer throws an error if RSS has already been initialized. Fixes #3461 Fixes #3106 Part of #725

askfongjojo · 2023-07-08T23:00:21Z

Naive question: If sled-agent stops reaping propolis zones upon restart, would it be able to rely on info from the config files /etc/zones to re-construct the necessary zone info?

BRM42220081 # cat /etc/zones/oxz_propolis-server_8e00d49b-29e2-4b4f-8ab4-aefab6043c5b.xml 
<?xml version="1.0"?>
<!DOCTYPE zone PUBLIC "-//Sun Microsystems Inc//DTD Zones//EN" "file:///usr/share/lib/xml/dtd/zonecfg.dtd.1">
<!--
    DO NOT EDIT THIS FILE.  Use zonecfg(8) instead.
-->
<zone name="oxz_propolis-server_8e00d49b-29e2-4b4f-8ab4-aefab6043c5b" zonepath="/zone/oxz_propolis-server_8e00d49b-29e2-4b4f-8ab4-aefab6043c5b" autoboot="false" brand="omicron1" ip-type="exclusive">
  <device match="/dev/vmm/*"/>
  <device match="/dev/vmmctl"/>
  <device match="/dev/viona"/>
  <network physical="vopte1"/>
  <network physical="oxControlInstance0"/>
</zone>

The zone running state isn't there but putting a user VM at a stopped state would still be better than blowing it away as an initial solution to the problem.

In the future, we probably want to allow the user to configure the autoboot flag so that they can indicate whether the VM should be booted up by default if the last known running state is unavailable.

jclulow · 2023-07-09T07:29:29Z

Don't look in the configuration files, they're private; just use zoneadm list and zonecfg etc.

You would indeed be able to tell which zones are already running on the machine if sled agent is restarting without the system rebooting. If the system reboots, /etc is effectively restored to the pristine copy that ships in the ramdisk, which doesn't include any configured zones -- so the zone-level autoboot flag won't ever really have an effect on Gimlets.

smklein added the Sled Agent Related to the Per-Sled Configuration and Management label Mar 7, 2022

smklein mentioned this issue Mar 7, 2022

Overhaul packaging, launching services within a omicron1 branded zones #686

Merged

smklein mentioned this issue Nov 10, 2022

[sled-agent] Launch switch zone automatically on scrimlets #1933

Merged

bnaecker mentioned this issue Mar 23, 2023

Sled agent tears down all running zones (including running VMs) if it crashes & restarts #2646

Open

smklein mentioned this issue Apr 30, 2023

Stop deleting chelsio addresses during uninstall #2953

Merged

askfongjojo added this to the FCS milestone Jun 2, 2023

smklein mentioned this issue Jul 2, 2023

[sled agent] Fixes to enable the reboot after RSS initialization #3466

Merged

smklein mentioned this issue Jul 5, 2023

Confirm that it's safe to reboot with transient VNICs #3493

Closed

morlandi7 modified the milestones: FCS, MVP Aug 15, 2023

gjcolombo mentioned this issue Nov 17, 2023

Stop request for an instance with an abandoned VMM succeeds, returns state of "Running," and never stops the instance #4511

Closed

smklein mentioned this issue Nov 7, 2024

propolis logs lost on reboot #7012

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sled-agent] Re-construct Zone management state within sled agent after reboot #725

[sled-agent] Re-construct Zone management state within sled agent after reboot #725

smklein commented Mar 7, 2022

bnaecker commented May 10, 2022

askfongjojo commented Jul 8, 2023

jclulow commented Jul 9, 2023

[sled-agent] Re-construct Zone management state within sled agent after reboot #725

[sled-agent] Re-construct Zone management state within sled agent after reboot #725

Comments

smklein commented Mar 7, 2022

Problem

Current workaround

When this'll be a bigger problem

bnaecker commented May 10, 2022

askfongjojo commented Jul 8, 2023

jclulow commented Jul 9, 2023