Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sled agent tears down all running zones (including running VMs) if it crashes & restarts #2646

Open
gjcolombo opened this issue Mar 23, 2023 · 2 comments
Labels
bug Something that isn't working. Sled Agent Related to the Per-Sled Configuration and Management
Milestone

Comments

@gjcolombo
Copy link
Contributor

Repro steps: none yet; observed by inspection.

I've been looking around and can't find any mechanism by which a sled agent that dies and restarts is able to rediscover the Propolis zones running on the sled. The VMs might actually still survive and be reachable by users, but attempts to do anything with them in Nexus will either do nothing or move the instance into a Failed state (depending on the precise error code sled agent returns when Nexus asks to change the instance's state).

We probably need some combination of the following code to recover from this:

  • Sled agent has to enumerate the running Propolis zones on the system during startup
    • Each Propolis needs to be registered into the instance manager
    • Sled agent also needs to interrogate each Propolis to get the instance's current state
  • Sled agent must find out (either by asking or being told by Nexus) what instance state Nexus last received from the sled & must reconcile this with the current Propolis state, so that subsequent state updates will bear the correct generation numbers
    • Another possible approach is to cut Propolis in on generation numbering so that, if the VM survives, its state generation survives too

Initial triage: Marking this for MVP since, unless I've overlooked something, a sled agent reboot may functionally wedge a sled: sled agent loses track of all the instances, but they'll still hang around consuming the sled's resources. (We might not try to provision anything there, depending on how Nexus does bookkeeping for the zombie instances, but either way this, ah, suboptimal for capacity management purposes.)

Again, I could totally be overlooking something here, and I'd be happy if I were, but wanted to file something to make sure this isn't lost if this is indeed a problem--sled agent seems especially load-bearing in this respect and we could stand to try to mitigate this if it is.

@gjcolombo gjcolombo added bug Something that isn't working. Sled Agent Related to the Per-Sled Configuration and Management labels Mar 23, 2023
@gjcolombo gjcolombo added this to the MVP milestone Mar 23, 2023
@bnaecker
Copy link
Collaborator

Related #725

@gjcolombo
Copy link
Contributor Author

According to my browser history, I searched for "crash" and "restart" but not "boot" or "crashing" and so did not find that issue. I am going to go ponder for a while.

(Thanks @bnaecker, I appreciate the pointer! #725 is also referenced in a TODO in omicron_sled_agent::bootstrap::agent::cleanup_all_old_global_state, which does the cleanup described in that issue.)

@gjcolombo gjcolombo changed the title Sled agent appears to lose track of running Propolis zones if it crashes & restarts Sled agent tears down all running zones (including running VMs) if it crashes & restarts Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that isn't working. Sled Agent Related to the Per-Sled Configuration and Management
Projects
None yet
Development

No branches or pull requests

2 participants