Sled agent tears down all running zones (including running VMs) if it crashes & restarts #2646

gjcolombo · 2023-03-23T00:28:27Z

Repro steps: none yet; observed by inspection.

I've been looking around and can't find any mechanism by which a sled agent that dies and restarts is able to rediscover the Propolis zones running on the sled. The VMs might actually still survive and be reachable by users, but attempts to do anything with them in Nexus will either do nothing or move the instance into a Failed state (depending on the precise error code sled agent returns when Nexus asks to change the instance's state).

We probably need some combination of the following code to recover from this:

Sled agent has to enumerate the running Propolis zones on the system during startup
- Each Propolis needs to be registered into the instance manager
- Sled agent also needs to interrogate each Propolis to get the instance's current state
Sled agent must find out (either by asking or being told by Nexus) what instance state Nexus last received from the sled & must reconcile this with the current Propolis state, so that subsequent state updates will bear the correct generation numbers
- Another possible approach is to cut Propolis in on generation numbering so that, if the VM survives, its state generation survives too

Initial triage: Marking this for MVP since, unless I've overlooked something, a sled agent reboot may functionally wedge a sled: sled agent loses track of all the instances, but they'll still hang around consuming the sled's resources. (We might not try to provision anything there, depending on how Nexus does bookkeeping for the zombie instances, but either way this, ah, suboptimal for capacity management purposes.)

Again, I could totally be overlooking something here, and I'd be happy if I were, but wanted to file something to make sure this isn't lost if this is indeed a problem--sled agent seems especially load-bearing in this respect and we could stand to try to mitigate this if it is.

bnaecker · 2023-03-23T03:03:49Z

Related #725

gjcolombo · 2023-03-23T03:20:16Z

According to my browser history, I searched for "crash" and "restart" but not "boot" or "crashing" and so did not find that issue. I am going to go ponder for a while.

(Thanks @bnaecker, I appreciate the pointer! #725 is also referenced in a TODO in omicron_sled_agent::bootstrap::agent::cleanup_all_old_global_state, which does the cleanup described in that issue.)

gjcolombo added bug Something that isn't working. Sled Agent Related to the Per-Sled Configuration and Management labels Mar 23, 2023

gjcolombo added this to the MVP milestone Mar 23, 2023

gjcolombo changed the title ~~Sled agent appears to lose track of running Propolis zones if it crashes & restarts~~ Sled agent tears down all running zones (including running VMs) if it crashes & restarts Jun 2, 2023

rcgoodfellow mentioned this issue Jun 2, 2023

Launching a mega instance kills neighbors #3286

Closed

smklein mentioned this issue Nov 7, 2024

propolis logs lost on reboot #7012

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sled agent tears down all running zones (including running VMs) if it crashes & restarts #2646

Sled agent tears down all running zones (including running VMs) if it crashes & restarts #2646

gjcolombo commented Mar 23, 2023

bnaecker commented Mar 23, 2023

gjcolombo commented Mar 23, 2023

Sled agent tears down all running zones (including running VMs) if it crashes & restarts #2646

Sled agent tears down all running zones (including running VMs) if it crashes & restarts #2646

Comments

gjcolombo commented Mar 23, 2023

bnaecker commented Mar 23, 2023

gjcolombo commented Mar 23, 2023