Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The API to instance_update_runtime is misleading #611

Open
smklein opened this issue Jan 19, 2022 · 5 comments
Open

The API to instance_update_runtime is misleading #611

smklein opened this issue Jan 19, 2022 · 5 comments
Labels
api Related to the API.

Comments

@smklein
Copy link
Collaborator

smklein commented Jan 19, 2022

Background

instance_update_runtime returns a boolean value meaning "updated", implying if the update operation succeeded (true) or if it failed, even though the object-to-be-updated does exist, because of a concurrent modification (false):

pub async fn instance_update_runtime(
&self,
instance_id: &Uuid,
new_runtime: &InstanceRuntimeState,
) -> Result<bool, Error> {

Implications

The generation numbers used within the Instance table are intended to guard access to the state object from concurrent modification. In a "read-modify-write" pattern, they can help the caller optimistically manage concurrency control.

I believe the intent behind ignoring the result was to imply "whoever was able to use the right generation number 'wins' the concurrency race, and their operation occurs". The other update operation - which does not occur - can be treated as if it happened earlier and was immediately overwritten.

This is supported by documentation in instance_set_runtime:

omicron/nexus/src/nexus.rs

Lines 1167 to 1170 in 3d132cc

* Ask the sled agent to begin the state change. Then update the
* database to reflect the new intermediate state. If this update is
* not the newest one, that's fine. That might just mean the sled agent
* beat us to it.

Problem

What if we modify more than the state column?

The API for instance_update_runtime takes an entire InstanceRuntimeState object as input, and updates the entire portion of the DB row based on the generation number.

This happens to currently be safe with our current usage, because we exclusively use the pattern of:

  • Read Instance row
  • Modify the State column exclusively
  • Write InstanceRuntimeState back to the Instance row
  • If the update was skipped, that's fine - it's as if we transitioned to our state, and then transitioned immediately to the new state.

However, InstanceRuntimeState contains many fields, not just the state. It is entirely possible for a user of this API to perform the following operation:

  • Read Instance row
  • Modify the State column and another column (perhaps changing the hostname, increasing memory, etc)
  • Write InstanceRuntimeState back to the Instance row
  • If the update is skipped here, our associated data is lost! In most cases, this will not be acceptable, regardless of concurrent state changes.

Further, even in the case where we are only modifying the state, there are often cases "skipping" an intermediate state would be unacceptable. For example, setting the state of an instance to failed should be terminal, not skipped.

Although this is technically avoidable by reacting to the return code of the function, I'd argue that it's still dangerous and subtle, especially with the naming of the pub function, which doesn't imply this "conditional" aspect.

Proposal

  1. Avoid the multi-field update issue by changing the arguments. Instead of operating on an entire InstanceRuntimeState object, I propose this method act on the inputs of a UUID, the new desired state, and the new (expected) generation number.

  2. Update the name of the function to imply that it is conditional, and only updates the state. I'd argue for a name like instance_try_update_state. Although all of our database operations may fail, it would be useful to be able to distinguish unconditional from conditional update functions, since the distinction always has implications for the caller.


Current Usage, for Reference

@davepacheco
Copy link
Collaborator

The intent behind InstanceRuntimeState involved three assumptions:

  1. all the fields in that object are owned by the Sled Agent (i.e., Sled Agent is the source of truth for all those fields)
  2. all the fields in that object are reported as one unit by the Sled Agent
  3. all the fields in that object are together protected by the generation number. (If any one of them changes, the generation number should change.)

The canonical case is that Nexus receives an updated InstanceRuntimeState from the Sled Agent (via notify_instance_updated). In that case, if the update is applied, then the state we received (and applied) is the latest. If the update was not applied, then there was a newer update in the database. Since a database update always updates all the fields, and they're all received together from the Sled Agent, I don't see how an update could be lost.

You're right that if Nexus read the Instance, modified some fields of the runtime state, wrote it back to the database, and ignored the case where the update wasn't applied, then some of Nexus's updates may have been lost. But that's a violation of the invariant that Sled Agent owns the runtime state -- Nexus isn't supposed to be the immediate source of changes to it.

For the three callers today:

  • instance_set_runtime: when Nexus does want to change something itself, it uses this function to ask Sled Agent to make the change, then gets back an updated InstanceRuntimeState, and then uses instance_update_runtime to apply it to the database. This satisfies the assumptions I said above: the new runtime comes from Sled Agent, including all fields together.
  • notify_instance_updated: discussed above -- this is a direct update from the Sled Agent of all the fields in the runtime state
  • sic_instance_ensure: during Instance creation, we tell the Sled Agent about the new Instance, then it returns the initial runtime state, then we use instance_update_runtime to apply the change. Again, we're satisfying the assumption that the fields are all coming from the Sled Agent in one unit.

Further, even in the case where we are only modifying the state, there are often cases "skipping" an intermediate state would be unacceptable. For example, setting the state of an instance to failed should be terminal, not skipped.

That's true but I'm not sure that's Nexus's responsibility. If the Sled Agent has produced a runtime state with state = "running" after it was previously "failed", then it is running and the database state should reflect that. There may be some bug that allowed us to get there, and we may want to understand it that, but the database state should still reflect reality, and Sled Agent is the source of truth for this information.

I believe the intent behind ignoring the result was to imply "whoever was able to use the right generation number 'wins' the concurrency race, and their operation occurs". The other update operation - which does not occur - can be treated as if it happened earlier and was immediately overwritten.

That's basically right -- and this is essential. If you look at instance_set_runtime or sic_instance_ensure: in those cases, Sled Agent returns one runtime state, then later will notify us about a new one (e.g., the returned state might be "starting", then later it becomes "running"). But there's no guarantee that we'll get the returned value and apply it to the database before Sled Agent notifies us about the next state change. If we didn't ignore a newer generation, we might clobber a "running" state with an earlier-in-time "starting" state if we got those events in the wrong order.

Avoid the multi-field update issue by changing the arguments. Instead of operating on an entire InstanceRuntimeState object, I propose this method act on the inputs of a UUID, the new desired state, and the new (expected) generation number.

I think this would break the intended behavior. If we start at generation N, host = H1, state = "stopped"; and Sled Agent issues two updates: generation N + 1, hostname = H2, state = "starting"; and generation N+2, hostname = H3, state = "running"; how would we ensure that the database state always converges to generation = N + 2, hostname = H3, state = "running", regardless of the order we received those two events?

@smklein
Copy link
Collaborator Author

smklein commented Jan 19, 2022

So this issue was motivated by the use case within live migration: https://github.com/oxidecomputer/omicron/pull/447/files/fc6d00749336b276d722f1a1edce572425b4dab6#r784341062

From what I can tell, @luqmana was updating the state from Nexus, and setting a UUID identifying that a migration is underway.

If what you're saying about "the Sled owns this data" is an invariant, this seems like an incorrect call to instance_update_runtime, because "Sled Agent owns the information, and Nexus shouldn't directly modify it".

It sounds like the correct behavior to start a migration would need to be:

  • Nexus tells the sled agent to try to start a migration with a UUID
  • Nexus awaits a response from the Sled Agent with the new runtime, which should include the updated information
  • Only after that information is returned should Nexus record the "instance is migrating" info to the DB

Does that seem right?

@davepacheco
Copy link
Collaborator

That sounds right, given what we've got. instance_set_runtime may be helpful for this -- it takes a requested runtime state, sends it to the sled agent (which returns the new actual runtime), and saves that to the database. If the migration can be expressed as part of the runtime state, I hope a lot of this would just work. I haven't taken a look at the migration PR in a while -- this may be way off.

If that works out, maybe there are ways to better communicate or enforce these invariants. I can see how it's pretty implicit today. It's also possible this approach is too simple and needs work. Let me know if this feels like a bad fit. (The reason it works this way today is to avoid having to synchronize in Nexus for create/boot/halt/reboot/etc. The synchronization happens at Sled Agent, where it has to happen anyway, and that makes the Nexus and database state quite a lot simpler.)

@smklein
Copy link
Collaborator Author

smklein commented Jan 19, 2022

I think the ownership semantics totally make sense, and do simplify things a lot, but I'll think about expressiveness in code.

I've been wondering about this for a bit, but I wonder if we'd benefit from restructuring of the datastore into more isolated subcomponents, so we can enforce these semantics behind slightly-more-private interfaces. I even see that the concept of "sled agent owns the runtime state" is documented briefly in model.rs, but that isn't visible if someone is just eyeballing the public interface to the datastore, and stumbles across this instance_update_runtime method.

@davepacheco
Copy link
Collaborator

I've been wondering about this for a bit, but I wonder if we'd benefit from restructuring of the datastore into more isolated subcomponents, so we can enforce these semantics behind slightly-more-private interfaces.

I think that's a good idea. It'd be great to do this in a way that doesn't recreate the problems of traditional ORMs, where it's not always clear where database operations happen, what things can fail for what reasons, etc.

I've been thinking of doing something like this for the authz-aware lookup APIs at some point.

@morlandi7 morlandi7 added the api Related to the API. label Jun 23, 2023
leftwo pushed a commit that referenced this issue Jan 28, 2024
Crucible changes
Remove a superfluous copy during write serialization (#1087)
Update to progenitor v0.5.0, pull in required Omicron updates (#1115)
Update usdt to v0.5.0 (#1116)
Do not panic on reinitialize of a downstairs client. (#1114)
Bump (tracing-)opentelemetry(-jaeger) (#1113)
Make the Guest -> Upstairs queue fully async (#1086)
Switch to per-block ownership (#1107)
Handle timeout in the client IO task (#1109)
Enforce buffer alignment (#1106)
Block size buffers (#1105)
New dtrace probes and a counter struct in the Upstairs. (#1104)
Implement read decryption offloading (#1089)
Remove Arc + Mutex from Buffer (#1094)
Comment cleanup and rename of DsState::Repair -> Reconcile (#1102)
do not panic the dynamometer for OOB writes (#1101)
Allow dsc to start the downstairs in read-only mode. (#1098)
Use the omicron-zone-package methods for topo sorting (#1099)
Package with topological sorting (#1097)
Fix clippy lints in dsc (#1095)

Propolis changes:
PHD: demote artifact store logs to DEBUG, enable DEBUG on CI (#626)
PHD: fix missing newlines in serial.log (#622)
PHD: fix run_shell_command with multiline commands (#621)
PHD: fix `--artifact-directory` not doing anything (#618)
Update h2 dependency
Update Crucible (and Omicron) dependencies
PHD: refactor guest serial console handling (#615)
phd: add basic "migration-from-base" tests + machinery (#609)
phd: Ensure min disk size fits read-only parents (#611)
phd: automatically fetch `crucible-downstairs` from Buildomat (#604)
Mitigate behavior from illumos#16183
PHD: add guest adapter for WS2022 (#607)
phd: include error cause chain in failure output (#606)
add QEMU pvpanic ISA device (#596)
Add crucible-mem backend
Make crucible opt parsing more terse in standalone
leftwo added a commit that referenced this issue Jan 29, 2024
Crucible changes
Remove a superfluous copy during write serialization (#1087) Update to
progenitor v0.5.0, pull in required Omicron updates (#1115) Update usdt
to v0.5.0 (#1116)
Do not panic on reinitialize of a downstairs client. (#1114) Bump
(tracing-)opentelemetry(-jaeger) (#1113)
Make the Guest -> Upstairs queue fully async (#1086) Switch to per-block
ownership (#1107)
Handle timeout in the client IO task (#1109)
Enforce buffer alignment (#1106)
Block size buffers (#1105)
New dtrace probes and a counter struct in the Upstairs. (#1104)
Implement read decryption offloading (#1089)
Remove Arc + Mutex from Buffer (#1094)
Comment cleanup and rename of DsState::Repair -> Reconcile (#1102) do
not panic the dynamometer for OOB writes (#1101) Allow dsc to start the
downstairs in read-only mode. (#1098) Use the omicron-zone-package
methods for topo sorting (#1099) Package with topological sorting
(#1097)
Fix clippy lints in dsc (#1095)

Propolis changes:
PHD: demote artifact store logs to DEBUG, enable DEBUG on CI (#626) 
PHD: fix missing newlines in serial.log (#622)
PHD: fix run_shell_command with multiline commands (#621) 
PHD: fix `--artifact-directory` not doing anything (#618) Update h2
dependency
Update Crucible (and Omicron) dependencies
PHD: refactor guest serial console handling (#615) 
phd: add basic "migration-from-base" tests + machinery (#609) 
phd: Ensure min disk size fits read-only parents (#611) 
phd: automatically fetch `crucible-downstairs` from Buildomat (#604)
Mitigate behavior from illumos#16183
PHD: add guest adapter for WS2022 (#607)
phd: include error cause chain in failure output (#606) add QEMU pvpanic
ISA device (#596)
Add crucible-mem backend
Make crucible opt parsing more terse in standalone

Co-authored-by: Alan Hanson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Related to the API.
Projects
None yet
Development

No branches or pull requests

3 participants