-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sled-agent] Re-construct Zone management state within sled agent after reboot #725
Comments
This issue also needs to expand to include recovering the data links used for guest instances. These are currently the We are currently deleting all of the guest Propolis zones, and also deleting these links. We'll need to recover and track them similar to the control VNICs used for communicating with Oxide services running in zones. |
This PR consists of changes to the Sled Agent to allow internal services to re-launch successfully after a reboot. This was tested with the following script, on an initialized system running with SoftNPU: ```bash # Emulate a "sled shutdown" svcadm disable sled-agent # Kill and restart softnpu, to "empty out switch state" zlogin softnpu pkill softnpu zlogin softnpu pgrep softnpu || { zlogin softnpu 'RUST_LOG=debug /stuff/softnpu /stuff/softnpu.toml &> /softnpu.log &' } # Restart the sled agent. Observe that it still tears down all running zones on boot, # but it successfully re-initializes all sleds which are described by: # /pool/int/<UUID>/config/services.toml svcadm enable sled-agent ``` ## Before this PR - Sled Agent was capable of booting and initializing via RSS - However, after a reboot, the Sled Agent did not successfully re-initialize all zones that had been running before reboot. It attempted to bring up the switch zone, failed, and would not be able to continue initializing other services ## This PR - Fixes the routing issue described by #3461: Zone initialization only attempts to add a route to the "sled agent underlay address" as a gateway iff (1) the underlay is up, and (2) the zone-being-launched has an address on the same subnet. - Delays launching all "Sled Agent services" until after the sled agent has an underlay address. This makes it more likely for services dependent on external connectivity (e.g., NTP, external DNS, Nexus) to initialize correctly. - The bootstrap agent, which may send a `start_rack_initialize` request if and RSS configuration file is included, no longer throws an error if RSS has already been initialized. Fixes #3461 Fixes #3106 Part of #725
Naive question: If sled-agent stops reaping propolis zones upon restart, would it be able to rely on info from the config files
The zone running state isn't there but putting a user VM at a stopped state would still be better than blowing it away as an initial solution to the problem. In the future, we probably want to allow the user to configure the |
Don't look in the configuration files, they're private; just use You would indeed be able to tell which zones are already running on the machine if sled agent is restarting without the system rebooting. If the system reboots, /etc is effectively restored to the pristine copy that ships in the ramdisk, which doesn't include any configured zones -- so the zone-level autoboot flag won't ever really have an effect on Gimlets. |
Note: This issue tracks a follow-up from #686
Problem
Sled Agent should re-initialize in-memory state for managing Zones on boot, including:
Current workaround
As the Sled Agent "deletes all zones on initialization", we can punt on this issue slightly.
When this'll be a bigger problem
If we want to cope with Sled Agent itself crashing (without taking out all other software on the sled), this issue will become blocking.
The text was updated successfully, but these errors were encountered: