sled agent sometimes publishes the Destroyed instance state when it should publish Stopped #3260

gjcolombo · 2023-05-30T21:18:22Z

Nexus reserves the "Destroyed" instance state for instances that have actually been deleted. If sled agent pushes an instance state update (or returns an instance runtime state) to Nexus that bears the "Destroyed" state, things get messy, because Nexus won't let you destroy an instance that already appears to be destroyed.

The two places where this appears to happen are

put_state on an instance that has been registered (so it's known to sled agent's instance manager) but not started yet
an explicit call to terminate an instance (used when explicitly unregistering an instance without going through its Propolis; Nexus requests this in some saga unwind paths)

The text was updated successfully, but these errors were encountered:

gjcolombo · 2023-06-02T04:09:38Z

This can cause instances to leak into the "Destroyed" state relatively easily. The most straightforward way to achieve this is to ask Nexus to stop an instance that's in the process of stopping:

T1: Propolis reaches its internal Destroyed state and publishes this back up to sled agent
T2: Parallel stop request arrives, reaches instance_manager::ensure_state, and successfully looks up the instance, obtaining a reference to it
T1: The Propolis's state monitor task takes the instance lock, tears down the zone, clears the instance's running_state, and releases the lock
T2: The stop request takes the instance lock, sees that there's no zone, and erroneously moves the instance to the Destroyed state and returns this state to Nexus

I can hit this within just a few seconds of running a stress test that I wrote for the occasion.

Switching the sled agent calls that transition instances to Destroyed to instead try to transition them to Stopped seems to mitigate the problem.

Nexus reserves the "Creating" and "Destroyed" instance states to refer to instances that it (Nexus) is in the process of creating and that it has deleted. Propolis uses these states to refer to VMMs that it has created and destroyed. Sled agent is responsible for correcting this impedance mismatch by turning Propolis's per-VMM states into appropriate Nexus instance states. Specifically: - A Propolis that's `Creating` its internal VM must be represented as an instance that's `Starting` - A Propolis that has `Destroyed` its internal VM must be represented as an instance that's `Stopped` Fix two places where sled agent was erroneously publishing the Destroyed state to Nexus. Also, adjust some types and function signatures to make it harder to make this mistake in the future. Fixes #3260.

gjcolombo added the Sled Agent Related to the Per-Sled Configuration and Management label May 30, 2023

gjcolombo self-assigned this May 30, 2023

gjcolombo mentioned this issue May 31, 2023

sagas::instance_create::test::test_action_failure_can_unwind doesn't test failure after all saga nodes #3265

Closed

gjcolombo mentioned this issue Jun 2, 2023

sled agent: don't publish the Destroyed instance state #3284

Merged

gjcolombo closed this as completed in #3284 Jun 7, 2023

gjcolombo mentioned this issue Jun 8, 2023

Instance stuck in Starting with a Propolis zone but no server process #3319

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sled agent sometimes publishes the Destroyed instance state when it should publish Stopped #3260

sled agent sometimes publishes the Destroyed instance state when it should publish Stopped #3260

gjcolombo commented May 30, 2023

gjcolombo commented Jun 2, 2023

sled agent sometimes publishes the Destroyed instance state when it should publish Stopped #3260

sled agent sometimes publishes the Destroyed instance state when it should publish Stopped #3260

Comments

gjcolombo commented May 30, 2023

gjcolombo commented Jun 2, 2023