-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The API to instance_update_runtime
is misleading
#611
Comments
The intent behind
The canonical case is that Nexus receives an updated You're right that if Nexus read the Instance, modified some fields of the runtime state, wrote it back to the database, and ignored the case where the update wasn't applied, then some of Nexus's updates may have been lost. But that's a violation of the invariant that Sled Agent owns the runtime state -- Nexus isn't supposed to be the immediate source of changes to it. For the three callers today:
That's true but I'm not sure that's Nexus's responsibility. If the Sled Agent has produced a runtime state with state = "running" after it was previously "failed", then it is running and the database state should reflect that. There may be some bug that allowed us to get there, and we may want to understand it that, but the database state should still reflect reality, and Sled Agent is the source of truth for this information.
That's basically right -- and this is essential. If you look at
I think this would break the intended behavior. If we start at generation N, host = H1, state = "stopped"; and Sled Agent issues two updates: generation N + 1, hostname = H2, state = "starting"; and generation N+2, hostname = H3, state = "running"; how would we ensure that the database state always converges to generation = N + 2, hostname = H3, state = "running", regardless of the order we received those two events? |
So this issue was motivated by the use case within live migration: https://github.com/oxidecomputer/omicron/pull/447/files/fc6d00749336b276d722f1a1edce572425b4dab6#r784341062 From what I can tell, @luqmana was updating the state from Nexus, and setting a UUID identifying that a migration is underway. If what you're saying about "the Sled owns this data" is an invariant, this seems like an incorrect call to It sounds like the correct behavior to start a migration would need to be:
Does that seem right? |
That sounds right, given what we've got. If that works out, maybe there are ways to better communicate or enforce these invariants. I can see how it's pretty implicit today. It's also possible this approach is too simple and needs work. Let me know if this feels like a bad fit. (The reason it works this way today is to avoid having to synchronize in Nexus for create/boot/halt/reboot/etc. The synchronization happens at Sled Agent, where it has to happen anyway, and that makes the Nexus and database state quite a lot simpler.) |
I think the ownership semantics totally make sense, and do simplify things a lot, but I'll think about expressiveness in code. I've been wondering about this for a bit, but I wonder if we'd benefit from restructuring of the datastore into more isolated subcomponents, so we can enforce these semantics behind slightly-more-private interfaces. I even see that the concept of "sled agent owns the runtime state" is documented briefly in |
I think that's a good idea. It'd be great to do this in a way that doesn't recreate the problems of traditional ORMs, where it's not always clear where database operations happen, what things can fail for what reasons, etc. I've been thinking of doing something like this for the authz-aware lookup APIs at some point. |
Crucible changes Remove a superfluous copy during write serialization (#1087) Update to progenitor v0.5.0, pull in required Omicron updates (#1115) Update usdt to v0.5.0 (#1116) Do not panic on reinitialize of a downstairs client. (#1114) Bump (tracing-)opentelemetry(-jaeger) (#1113) Make the Guest -> Upstairs queue fully async (#1086) Switch to per-block ownership (#1107) Handle timeout in the client IO task (#1109) Enforce buffer alignment (#1106) Block size buffers (#1105) New dtrace probes and a counter struct in the Upstairs. (#1104) Implement read decryption offloading (#1089) Remove Arc + Mutex from Buffer (#1094) Comment cleanup and rename of DsState::Repair -> Reconcile (#1102) do not panic the dynamometer for OOB writes (#1101) Allow dsc to start the downstairs in read-only mode. (#1098) Use the omicron-zone-package methods for topo sorting (#1099) Package with topological sorting (#1097) Fix clippy lints in dsc (#1095) Propolis changes: PHD: demote artifact store logs to DEBUG, enable DEBUG on CI (#626) PHD: fix missing newlines in serial.log (#622) PHD: fix run_shell_command with multiline commands (#621) PHD: fix `--artifact-directory` not doing anything (#618) Update h2 dependency Update Crucible (and Omicron) dependencies PHD: refactor guest serial console handling (#615) phd: add basic "migration-from-base" tests + machinery (#609) phd: Ensure min disk size fits read-only parents (#611) phd: automatically fetch `crucible-downstairs` from Buildomat (#604) Mitigate behavior from illumos#16183 PHD: add guest adapter for WS2022 (#607) phd: include error cause chain in failure output (#606) add QEMU pvpanic ISA device (#596) Add crucible-mem backend Make crucible opt parsing more terse in standalone
Crucible changes Remove a superfluous copy during write serialization (#1087) Update to progenitor v0.5.0, pull in required Omicron updates (#1115) Update usdt to v0.5.0 (#1116) Do not panic on reinitialize of a downstairs client. (#1114) Bump (tracing-)opentelemetry(-jaeger) (#1113) Make the Guest -> Upstairs queue fully async (#1086) Switch to per-block ownership (#1107) Handle timeout in the client IO task (#1109) Enforce buffer alignment (#1106) Block size buffers (#1105) New dtrace probes and a counter struct in the Upstairs. (#1104) Implement read decryption offloading (#1089) Remove Arc + Mutex from Buffer (#1094) Comment cleanup and rename of DsState::Repair -> Reconcile (#1102) do not panic the dynamometer for OOB writes (#1101) Allow dsc to start the downstairs in read-only mode. (#1098) Use the omicron-zone-package methods for topo sorting (#1099) Package with topological sorting (#1097) Fix clippy lints in dsc (#1095) Propolis changes: PHD: demote artifact store logs to DEBUG, enable DEBUG on CI (#626) PHD: fix missing newlines in serial.log (#622) PHD: fix run_shell_command with multiline commands (#621) PHD: fix `--artifact-directory` not doing anything (#618) Update h2 dependency Update Crucible (and Omicron) dependencies PHD: refactor guest serial console handling (#615) phd: add basic "migration-from-base" tests + machinery (#609) phd: Ensure min disk size fits read-only parents (#611) phd: automatically fetch `crucible-downstairs` from Buildomat (#604) Mitigate behavior from illumos#16183 PHD: add guest adapter for WS2022 (#607) phd: include error cause chain in failure output (#606) add QEMU pvpanic ISA device (#596) Add crucible-mem backend Make crucible opt parsing more terse in standalone Co-authored-by: Alan Hanson <[email protected]>
Background
instance_update_runtime
returns a boolean value meaning "updated", implying if the update operation succeeded (true
) or if it failed, even though the object-to-be-updated does exist, because of a concurrent modification (false
):omicron/nexus/src/db/datastore.rs
Lines 816 to 820 in 3d132cc
Implications
The generation numbers used within the Instance table are intended to guard access to the state object from concurrent modification. In a "read-modify-write" pattern, they can help the caller optimistically manage concurrency control.
I believe the intent behind ignoring the result was to imply "whoever was able to use the right generation number 'wins' the concurrency race, and their operation occurs". The other update operation - which does not occur - can be treated as if it happened earlier and was immediately overwritten.
This is supported by documentation in
instance_set_runtime
:omicron/nexus/src/nexus.rs
Lines 1167 to 1170 in 3d132cc
Problem
What if we modify more than the
state
column?The API for
instance_update_runtime
takes an entireInstanceRuntimeState
object as input, and updates the entire portion of the DB row based on the generation number.This happens to currently be safe with our current usage, because we exclusively use the pattern of:
Instance
rowState
column exclusivelyInstanceRuntimeState
back to theInstance
rowHowever,
InstanceRuntimeState
contains many fields, not just the state. It is entirely possible for a user of this API to perform the following operation:Instance
rowState
column and another column (perhaps changing the hostname, increasing memory, etc)InstanceRuntimeState
back to theInstance
rowFurther, even in the case where we are only modifying the state, there are often cases "skipping" an intermediate state would be unacceptable. For example, setting the state of an instance to
failed
should be terminal, not skipped.Although this is technically avoidable by reacting to the return code of the function, I'd argue that it's still dangerous and subtle, especially with the naming of the
pub
function, which doesn't imply this "conditional" aspect.Proposal
Avoid the multi-field update issue by changing the arguments. Instead of operating on an entire
InstanceRuntimeState
object, I propose this method act on the inputs of a UUID, the new desired state, and the new (expected) generation number.Update the name of the function to imply that it is conditional, and only updates the state. I'd argue for a name like
instance_try_update_state
. Although all of our database operations may fail, it would be useful to be able to distinguish unconditional from conditional update functions, since the distinction always has implications for the caller.Current Usage, for Reference
instance_set_runtime
notify_instance_udpated
sic_instance_ensure
The text was updated successfully, but these errors were encountered: