Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sled agent sometimes publishes the Destroyed instance state when it should publish Stopped #3260

Closed
gjcolombo opened this issue May 30, 2023 · 1 comment · Fixed by #3284
Closed
Assignees
Labels
Sled Agent Related to the Per-Sled Configuration and Management

Comments

@gjcolombo
Copy link
Contributor

Nexus reserves the "Destroyed" instance state for instances that have actually been deleted. If sled agent pushes an instance state update (or returns an instance runtime state) to Nexus that bears the "Destroyed" state, things get messy, because Nexus won't let you destroy an instance that already appears to be destroyed.

The two places where this appears to happen are

  • put_state on an instance that has been registered (so it's known to sled agent's instance manager) but not started yet
  • an explicit call to terminate an instance (used when explicitly unregistering an instance without going through its Propolis; Nexus requests this in some saga unwind paths)
@gjcolombo
Copy link
Contributor Author

This can cause instances to leak into the "Destroyed" state relatively easily. The most straightforward way to achieve this is to ask Nexus to stop an instance that's in the process of stopping:

  1. T1: Propolis reaches its internal Destroyed state and publishes this back up to sled agent
  2. T2: Parallel stop request arrives, reaches instance_manager::ensure_state, and successfully looks up the instance, obtaining a reference to it
  3. T1: The Propolis's state monitor task takes the instance lock, tears down the zone, clears the instance's running_state, and releases the lock
  4. T2: The stop request takes the instance lock, sees that there's no zone, and erroneously moves the instance to the Destroyed state and returns this state to Nexus

I can hit this within just a few seconds of running a stress test that I wrote for the occasion.

Switching the sled agent calls that transition instances to Destroyed to instead try to transition them to Stopped seems to mitigate the problem.

gjcolombo added a commit that referenced this issue Jun 7, 2023
Nexus reserves the "Creating" and "Destroyed" instance states to refer
to instances that it (Nexus) is in the process of creating and that it
has deleted. Propolis uses these states to refer to VMMs that it has
created and destroyed. Sled agent is responsible for correcting this
impedance mismatch by turning Propolis's per-VMM states into appropriate
Nexus instance states. Specifically:

- A Propolis that's `Creating` its internal VM must be represented as an
instance that's `Starting`
- A Propolis that has `Destroyed` its internal VM must be represented as
an instance that's `Stopped`

Fix two places where sled agent was erroneously publishing the Destroyed
state to Nexus. Also, adjust some types and function signatures to make
it harder to make this mistake in the future.

Fixes #3260.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sled Agent Related to the Per-Sled Configuration and Management
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant