You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Nexus reserves the "Destroyed" instance state for instances that have actually been deleted. If sled agent pushes an instance state update (or returns an instance runtime state) to Nexus that bears the "Destroyed" state, things get messy, because Nexus won't let you destroy an instance that already appears to be destroyed.
The two places where this appears to happen are
put_state on an instance that has been registered (so it's known to sled agent's instance manager) but not started yet
an explicit call to terminate an instance (used when explicitly unregistering an instance without going through its Propolis; Nexus requests this in some saga unwind paths)
The text was updated successfully, but these errors were encountered:
This can cause instances to leak into the "Destroyed" state relatively easily. The most straightforward way to achieve this is to ask Nexus to stop an instance that's in the process of stopping:
T1: Propolis reaches its internal Destroyed state and publishes this back up to sled agent
T2: Parallel stop request arrives, reaches instance_manager::ensure_state, and successfully looks up the instance, obtaining a reference to it
T1: The Propolis's state monitor task takes the instance lock, tears down the zone, clears the instance's running_state, and releases the lock
T2: The stop request takes the instance lock, sees that there's no zone, and erroneously moves the instance to the Destroyed state and returns this state to Nexus
I can hit this within just a few seconds of running a stress test that I wrote for the occasion.
Switching the sled agent calls that transition instances to Destroyed to instead try to transition them to Stopped seems to mitigate the problem.
Nexus reserves the "Creating" and "Destroyed" instance states to refer
to instances that it (Nexus) is in the process of creating and that it
has deleted. Propolis uses these states to refer to VMMs that it has
created and destroyed. Sled agent is responsible for correcting this
impedance mismatch by turning Propolis's per-VMM states into appropriate
Nexus instance states. Specifically:
- A Propolis that's `Creating` its internal VM must be represented as an
instance that's `Starting`
- A Propolis that has `Destroyed` its internal VM must be represented as
an instance that's `Stopped`
Fix two places where sled agent was erroneously publishing the Destroyed
state to Nexus. Also, adjust some types and function signatures to make
it harder to make this mistake in the future.
Fixes#3260.
Nexus reserves the "Destroyed" instance state for instances that have actually been deleted. If sled agent pushes an instance state update (or returns an instance runtime state) to Nexus that bears the "Destroyed" state, things get messy, because Nexus won't let you destroy an instance that already appears to be destroyed.
The two places where this appears to happen are
put_state
on an instance that has been registered (so it's known to sled agent's instance manager) but not started yetterminate
an instance (used when explicitly unregistering an instance without going through its Propolis; Nexus requests this in some saga unwind paths)The text was updated successfully, but these errors were encountered: