You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been looking around and can't find any mechanism by which a sled agent that dies and restarts is able to rediscover the Propolis zones running on the sled. The VMs might actually still survive and be reachable by users, but attempts to do anything with them in Nexus will either do nothing or move the instance into a Failed state (depending on the precise error code sled agent returns when Nexus asks to change the instance's state).
We probably need some combination of the following code to recover from this:
Sled agent has to enumerate the running Propolis zones on the system during startup
Each Propolis needs to be registered into the instance manager
Sled agent also needs to interrogate each Propolis to get the instance's current state
Sled agent must find out (either by asking or being told by Nexus) what instance state Nexus last received from the sled & must reconcile this with the current Propolis state, so that subsequent state updates will bear the correct generation numbers
Another possible approach is to cut Propolis in on generation numbering so that, if the VM survives, its state generation survives too
Initial triage: Marking this for MVP since, unless I've overlooked something, a sled agent reboot may functionally wedge a sled: sled agent loses track of all the instances, but they'll still hang around consuming the sled's resources. (We might not try to provision anything there, depending on how Nexus does bookkeeping for the zombie instances, but either way this, ah, suboptimal for capacity management purposes.)
Again, I could totally be overlooking something here, and I'd be happy if I were, but wanted to file something to make sure this isn't lost if this is indeed a problem--sled agent seems especially load-bearing in this respect and we could stand to try to mitigate this if it is.
The text was updated successfully, but these errors were encountered:
According to my browser history, I searched for "crash" and "restart" but not "boot" or "crashing" and so did not find that issue. I am going to go ponder for a while.
(Thanks @bnaecker, I appreciate the pointer! #725 is also referenced in a TODO in omicron_sled_agent::bootstrap::agent::cleanup_all_old_global_state, which does the cleanup described in that issue.)
gjcolombo
changed the title
Sled agent appears to lose track of running Propolis zones if it crashes & restarts
Sled agent tears down all running zones (including running VMs) if it crashes & restarts
Jun 2, 2023
Repro steps: none yet; observed by inspection.
I've been looking around and can't find any mechanism by which a sled agent that dies and restarts is able to rediscover the Propolis zones running on the sled. The VMs might actually still survive and be reachable by users, but attempts to do anything with them in Nexus will either do nothing or move the instance into a Failed state (depending on the precise error code sled agent returns when Nexus asks to change the instance's state).
We probably need some combination of the following code to recover from this:
Initial triage: Marking this for MVP since, unless I've overlooked something, a sled agent reboot may functionally wedge a sled: sled agent loses track of all the instances, but they'll still hang around consuming the sled's resources. (We might not try to provision anything there, depending on how Nexus does bookkeeping for the zombie instances, but either way this, ah, suboptimal for capacity management purposes.)
Again, I could totally be overlooking something here, and I'd be happy if I were, but wanted to file something to make sure this isn't lost if this is indeed a problem--sled agent seems especially load-bearing in this respect and we could stand to try to mitigate this if it is.
The text was updated successfully, but these errors were encountered: