-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nexus] Re-implement disk attach/detach #1106
Conversation
/// // /* Make a decision on whether or not to apply ANY updates */ | ||
/// // do_update AS ( | ||
/// // SELECT IF( | ||
/// // EXISTS(SELECT id FROM collection_info) AND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this may be a simpler implementation of the "dueling administrators" solution proposed in RFD 192.
Basically, to perform a conditional UPDATE and check for existence of an object, that proposal relies on a division-by-zero to do an early exit before performing the UPDATE itself.
However, it seems possible to:
- Avoid the early exit
- Make the UPDATE conditional on whatever subqueries we want
- Return the results of multiple rows, letting the caller figure out what happened (did the object exist? did we perform the update? etc.)
This version is slightly more verbose, but the lack of "division by zero" makes it seem simpler to me.
This change should also resolve the race condition mentioned in this comment by @plotnick : https://github.com/oxidecomputer/omicron/pull/1068/files#r873063471 |
@@ -0,0 +1,1480 @@ | |||
// This Source Code Form is subject to the terms of the Mozilla Public |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there's a lot here, I'm going to go ahead and try to give a "tour" of the contents of this CTE.
collection_detach.rs
and collection_detach_many.rs
are similar in layout.
nexus/src/db/collection_attach.rs
Outdated
use diesel::sql_types::{BigInt, Nullable, SingleValue}; | ||
use std::fmt::Debug; | ||
|
||
/// The table representing the collection. The resource references |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right out the gate, I'm using a ton of type aliases. The DatastoreAtttachTarget<ResourceType>
trait is implemented on a collection (after all, "collections" hold "resources") so nearly all of these act on "resource type, collection type".
Most of these are just short-hands for "get me the key/column/table/etc belonging to the resource/column".
/// type ResourceTimeDeletedColumn = disk::dsl::time_deleted; | ||
/// } | ||
/// ``` | ||
pub trait DatastoreAttachTarget<ResourceType>: Selectable<Pg> + Sized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the trait I would expect clients to implement - just pointing to the types we actually need, by steering towards the right columns.
A real implementation exists in db/model/instance.rs
.
type CollectionIdColumn: Column; | ||
|
||
/// The time deleted column in the CollectionTable | ||
type CollectionTimeDeletedColumn: Column<Table = <Self::CollectionIdColumn as Column>::Table> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By stating Column<Table = <Self::CollectionIdColumn as Column>::Table>
, we're forcing this column to belong to the "Collection" table.
This prevents users of this API from accidentally pointing the "Collection" columns at the "Resource" table.
+ Default | ||
+ ExpressionMethods; | ||
|
||
/// Creates a statement for attaching a resource to the given collection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wanna compare this CTE with the others, the signature of this function is something to keep an eye out for.
It's basically our statement of "how do we expect the API user to interact with this CTE"?
My goal was to ingest:
- UUIDs
- "Query statements"
- "Update statements"
Since we use them elsewhere in the datastore, so I figured that language would be easy to understand.
A fair bit of the CTE trait bounds are devoted to preventing misuse of these APIs.
.filter( | ||
Self::ResourceCollectionIdColumn::default() | ||
.eq(collection_id), | ||
) | ||
.filter(Self::ResourceTimeDeletedColumn::default().is_null()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ensure the "count of attached resources" clause is always scoped to non-deleted resources that point back to the collection, based on the provided collection_id
.
.filter(collection_table().primary_key().eq(collection_id)) | ||
.filter(Self::CollectionTimeDeletedColumn::default().is_null()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We force the user-supplied queries to look at the same collection_id
, and still be non-deleted.
This means someone can't get accidentally call this to modify the wrong resource:
Instance::attach_resource(
collection_id,
resource_id,
collection::table.into_boxed().filter(collection_id.eq(a_different_collection_id))
...
)
If they did, it would always result in "no update", since we'd filter by ID = 1 AND ID = 2
.filter(resource_table().primary_key().eq(resource_id)) | ||
.filter(Self::ResourceTimeDeletedColumn::default().is_null()) | ||
.filter(Self::ResourceCollectionIdColumn::default().is_null()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We enforce a similar constraint on the resource query. We also enforce that the "foreign key" field is actually NULL, as it should be before we call "attach".
/// Result of [`AttachToCollectionStatement`] when executed asynchronously | ||
pub type AsyncAttachToCollectionResult<ResourceType, C> = | ||
Result<(C, ResourceType), AttachError<ResourceType, C, PoolError>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the result type I expect most users to interact with.
Either:
- "Here is the collection + resource you updated", or
- "We did not do the update, and here's an
AttachError
explaining why"
/// 1. (collection_by_id, resource_by_id): Identify if the collection and | ||
/// resource objects exist at all. | ||
/// 2. (resource_count): Identify if the number of resources already attached to | ||
/// the collection exceeds a threshold. | ||
/// 3. (collection_info, resource_info): Checks for arbitrary user-provided | ||
/// constraints on the collection and resource objects. | ||
/// 4. (do_update): IFF all previous checks succeeded, make a decision to perfom | ||
/// an update. | ||
/// 5. (updated_resource): Apply user-provided updates on the resource - | ||
/// presumably, setting the collection ID value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hopefully describes the meat of the CTE.
The TL;DR: We want to check a few things, then do an update if all those checks pass, and we want it to be atomic, even though multiple tables are involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 with some questions and suggestions
let resource_exists_query = Box::new( | ||
resource_table() | ||
.into_boxed() | ||
.filter(resource_table().primary_key().eq(resource_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we guarantee that primary_key is always only id? why not the more specific ResourceIdColumn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, you're suggesting "can we guarantee this is the only call to filter
which contains an ID?"
In this case, we're constructing the entire resource_exists_query
from scratch - the only user input is the UUID - so I don't think this check is explicitly necessary. Re-phrasing: I don't such a check would prevent any user behavior.
Also, unfortunately, it's a little tricky to parse the whole list of filters applied through Diesel (it's easy to ask are there any filters, but hard to find a specific one, especially since this query is boxed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, we do have the bound:
BoxedQuery<ResourceTable<ResourceType, Self>>: FilterBy<Eq<ResourcePrimaryKey<ResourceType, Self>, Self::Id>>
which states that the resource table should be filterable comparing the ID
type with the ResourcePrimaryKey
.
However, the Diesel table type:
https://docs.diesel.rs/master/diesel/prelude/trait.Table.html
has an associated type PrimaryKey, but I'm not 100% sure how to connect it back to the Column type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the linked PrimaryKey documentation, I'm concerned about the case of
If the table has a composite primary key, this will be a tuple.
where ResourceIdColumn
is not a tuple. Though, looking at that bound, I can squint and convince myself that if primary_key()
returns a tuple then compilation would fail :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think the primary_key()
and Self::Id
types would need to match for this to work... since the CollectionIdColumn
must implement Column
, I think this CTE as currently implemented would not allow for composite PKs.
... but as you said, that would at least be a compile-time error, rather than a runtime error.
/// The unchanged resource and collection are returned as a part of this | ||
/// error; it is the responsibility of the caller to determine which | ||
/// condition was not met. | ||
NoUpdate { attached_count: i64, resource: ResourceType, collection: C }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a couple of match branches on this variant that state something like "this error didn't tell us enough", leading me to think we should attach more context or information here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What more information would you want? Is there a match branch in particular you're looking at?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more context below - if you're talking about cases where we return 500 errors, those should be impossible to hit, so I'm not sure more context would help.
// We can't attach, but the error hasn't | ||
// helped us infer why. | ||
return Err(Error::internal_error( | ||
"cannot attach disk" | ||
)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this is an example of the error not providing enough context, referenced in another review comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any case where I return "internal_error" implies a CTE bug to me - that's why we're returning internal_error
for these cases, or a 500 error.
In this particular case:
- We are trying to attach a disk to an instance
- The disk and instance both exist, and "Time deleted" is NULL
- We are not at the capacity of "number of attached disks"
- Both instance and disk appeared in valid states
- and yet the update did not occur
This should not happen - If I saw this case, I'd want to debug it, hence the 500 error - but I do not think any valid input should be able to trigger it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - I'd want to debug that too, but all I would see internally is "cannot attach disk". The error hasn't helped us infer why at all, like the comment says, and I don't see any other existing context in parse_result
that can be attached to the NoUpdate
errors which would point to additional investigation steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this, but I'm still not sure what other info to attach. I think if this case happened, we'd need to perform manual inspection of the DB to infer why it happened.
I view this as akin to "attempting, but failing, to deserialize a value from the database". I don't think it's appropriate to dump raw strings all the way back to the client, but it's still worth flagging.
Crucible changes Remove a superfluous copy during write serialization (#1087) Update to progenitor v0.5.0, pull in required Omicron updates (#1115) Update usdt to v0.5.0 (#1116) Do not panic on reinitialize of a downstairs client. (#1114) Bump (tracing-)opentelemetry(-jaeger) (#1113) Make the Guest -> Upstairs queue fully async (#1086) Switch to per-block ownership (#1107) Handle timeout in the client IO task (#1109) Enforce buffer alignment (#1106) Block size buffers (#1105) New dtrace probes and a counter struct in the Upstairs. (#1104) Implement read decryption offloading (#1089) Remove Arc + Mutex from Buffer (#1094) Comment cleanup and rename of DsState::Repair -> Reconcile (#1102) do not panic the dynamometer for OOB writes (#1101) Allow dsc to start the downstairs in read-only mode. (#1098) Use the omicron-zone-package methods for topo sorting (#1099) Package with topological sorting (#1097) Fix clippy lints in dsc (#1095) Propolis changes: PHD: demote artifact store logs to DEBUG, enable DEBUG on CI (#626) PHD: fix missing newlines in serial.log (#622) PHD: fix run_shell_command with multiline commands (#621) PHD: fix `--artifact-directory` not doing anything (#618) Update h2 dependency Update Crucible (and Omicron) dependencies PHD: refactor guest serial console handling (#615) phd: add basic "migration-from-base" tests + machinery (#609) phd: Ensure min disk size fits read-only parents (#611) phd: automatically fetch `crucible-downstairs` from Buildomat (#604) Mitigate behavior from illumos#16183 PHD: add guest adapter for WS2022 (#607) phd: include error cause chain in failure output (#606) add QEMU pvpanic ISA device (#596) Add crucible-mem backend Make crucible opt parsing more terse in standalone
Crucible changes Remove a superfluous copy during write serialization (#1087) Update to progenitor v0.5.0, pull in required Omicron updates (#1115) Update usdt to v0.5.0 (#1116) Do not panic on reinitialize of a downstairs client. (#1114) Bump (tracing-)opentelemetry(-jaeger) (#1113) Make the Guest -> Upstairs queue fully async (#1086) Switch to per-block ownership (#1107) Handle timeout in the client IO task (#1109) Enforce buffer alignment (#1106) Block size buffers (#1105) New dtrace probes and a counter struct in the Upstairs. (#1104) Implement read decryption offloading (#1089) Remove Arc + Mutex from Buffer (#1094) Comment cleanup and rename of DsState::Repair -> Reconcile (#1102) do not panic the dynamometer for OOB writes (#1101) Allow dsc to start the downstairs in read-only mode. (#1098) Use the omicron-zone-package methods for topo sorting (#1099) Package with topological sorting (#1097) Fix clippy lints in dsc (#1095) Propolis changes: PHD: demote artifact store logs to DEBUG, enable DEBUG on CI (#626) PHD: fix missing newlines in serial.log (#622) PHD: fix run_shell_command with multiline commands (#621) PHD: fix `--artifact-directory` not doing anything (#618) Update h2 dependency Update Crucible (and Omicron) dependencies PHD: refactor guest serial console handling (#615) phd: add basic "migration-from-base" tests + machinery (#609) phd: Ensure min disk size fits read-only parents (#611) phd: automatically fetch `crucible-downstairs` from Buildomat (#604) Mitigate behavior from illumos#16183 PHD: add guest adapter for WS2022 (#607) phd: include error cause chain in failure output (#606) add QEMU pvpanic ISA device (#596) Add crucible-mem backend Make crucible opt parsing more terse in standalone Co-authored-by: Alan Hanson <[email protected]>
Fixes #1073
Summary of previous version
When a request to attach a disk arrived at Nexus, the following would be performed:
Problems with previous version
New implementation
TODOs: