[nexus] Re-implement disk attach/detach #1106

smklein · 2022-05-23T17:43:19Z

Fixes #1073

Summary of previous version

When a request to attach a disk arrived at Nexus, the following would be performed:

Check the number of disks attached to an instance
Check the disk state (valid for attach?)
Check the instance state (valid for disk attach?)
(If the instance is running) Send a request to the Sled agent to update the state
Update the disk state in the DB

Problems with previous version

None of the steps are guarded against concurrent access - it's possible to attach a disk after the "max disk attach count" check, but before the DB has been updated. Similarly, it's possible to modify the disk/instance state after the check, but before the update.
The previous version allows sending requests to the attach disks to running instances (by making requests to the sled agent) but this path is both (1) not implemented, and (2) has some subtle races that need a more holistic solution. For example: what if a concurrent request was made to attach a disk to two instances simultaneously? Although Nexus should prevent this case, if both bypass the checks before actually updating the DB, both operations would be issued.

New implementation

Add CTEs for attach, detach to atomically: (1) Check instance/disk state, (2) update instance/disk state, and (3) return any information (e.g., do the resources exist) within a single statement.

TODOs:

Q: Do we need the rcgen in instances? A: No! Removed.
Deal with the disk-attach-during-instance-destroy race condition. Done via "collection_detach_many"
Be more consistent about units for the "max attached resources" value
Would love to de-duplicate more trait bounds between the "collection_*" CTEs. However, I think I may be close to my limits here until more progress on 🔬 Tracking issue for RFC 2089: Extended Implied bounds rust-lang/rust#44491 is made.

nexus/src/db/collection_attach.rs

smklein · 2022-05-23T19:22:30Z

nexus/src/db/collection_attach.rs

+/// //      /* Make a decision on whether or not to apply ANY updates */
+/// //      do_update AS (
+/// //          SELECT IF(
+/// //              EXISTS(SELECT id FROM collection_info) AND


Note: this may be a simpler implementation of the "dueling administrators" solution proposed in RFD 192.

Basically, to perform a conditional UPDATE and check for existence of an object, that proposal relies on a division-by-zero to do an early exit before performing the UPDATE itself.

However, it seems possible to:

Avoid the early exit

Make the UPDATE conditional on whatever subqueries we want

Return the results of multiple rows, letting the caller figure out what happened (did the object exist? did we perform the update? etc.)

This version is slightly more verbose, but the lack of "division by zero" makes it seem simpler to me.

smklein · 2022-05-25T04:15:09Z

This change should also resolve the race condition mentioned in this comment by @plotnick : https://github.com/oxidecomputer/omicron/pull/1068/files#r873063471

smklein · 2022-05-25T15:31:43Z

nexus/src/db/collection_attach.rs

@@ -0,0 +1,1480 @@
+// This Source Code Form is subject to the terms of the Mozilla Public


Since there's a lot here, I'm going to go ahead and try to give a "tour" of the contents of this CTE.

collection_detach.rs and collection_detach_many.rs are similar in layout.

smklein · 2022-05-25T15:33:41Z

nexus/src/db/collection_attach.rs

+use diesel::sql_types::{BigInt, Nullable, SingleValue};
+use std::fmt::Debug;
+
+/// The table representing the collection. The resource references


Right out the gate, I'm using a ton of type aliases. The DatastoreAtttachTarget<ResourceType> trait is implemented on a collection (after all, "collections" hold "resources") so nearly all of these act on "resource type, collection type".

Most of these are just short-hands for "get me the key/column/table/etc belonging to the resource/column".

smklein · 2022-05-25T15:34:55Z

nexus/src/db/collection_attach.rs

+///     type ResourceTimeDeletedColumn = disk::dsl::time_deleted;
+/// }
+/// ```
+pub trait DatastoreAttachTarget<ResourceType>: Selectable<Pg> + Sized {


This is the trait I would expect clients to implement - just pointing to the types we actually need, by steering towards the right columns.

A real implementation exists in db/model/instance.rs.

smklein · 2022-05-25T15:36:14Z

nexus/src/db/collection_attach.rs

+    type CollectionIdColumn: Column;
+
+    /// The time deleted column in the CollectionTable
+    type CollectionTimeDeletedColumn: Column<Table = <Self::CollectionIdColumn as Column>::Table>


By stating Column<Table = <Self::CollectionIdColumn as Column>::Table>, we're forcing this column to belong to the "Collection" table.

This prevents users of this API from accidentally pointing the "Collection" columns at the "Resource" table.

smklein · 2022-05-25T15:38:48Z

nexus/src/db/collection_attach.rs

+        + Default
+        + ExpressionMethods;
+
+    /// Creates a statement for attaching a resource to the given collection.


If you wanna compare this CTE with the others, the signature of this function is something to keep an eye out for.

It's basically our statement of "how do we expect the API user to interact with this CTE"?

My goal was to ingest:

UUIDs

"Query statements"

"Update statements"

Since we use them elsewhere in the datastore, so I figured that language would be easy to understand.
A fair bit of the CTE trait bounds are devoted to preventing misuse of these APIs.

smklein · 2022-05-25T15:54:53Z

nexus/src/db/collection_attach.rs

+                .filter(
+                    Self::ResourceCollectionIdColumn::default()
+                        .eq(collection_id),
+                )
+                .filter(Self::ResourceTimeDeletedColumn::default().is_null())


We ensure the "count of attached resources" clause is always scoped to non-deleted resources that point back to the collection, based on the provided collection_id.

smklein · 2022-05-25T15:58:08Z

nexus/src/db/collection_attach.rs

+                .filter(collection_table().primary_key().eq(collection_id))
+                .filter(Self::CollectionTimeDeletedColumn::default().is_null()),


We force the user-supplied queries to look at the same collection_id, and still be non-deleted.

This means someone can't get accidentally call this to modify the wrong resource:

Instance::attach_resource( collection_id, resource_id, collection::table.into_boxed().filter(collection_id.eq(a_different_collection_id)) ... )

If they did, it would always result in "no update", since we'd filter by ID = 1 AND ID = 2

smklein · 2022-05-25T15:58:55Z

nexus/src/db/collection_attach.rs

+                .filter(resource_table().primary_key().eq(resource_id))
+                .filter(Self::ResourceTimeDeletedColumn::default().is_null())
+                .filter(Self::ResourceCollectionIdColumn::default().is_null()),


We enforce a similar constraint on the resource query. We also enforce that the "foreign key" field is actually NULL, as it should be before we call "attach".

smklein · 2022-05-25T16:00:41Z

nexus/src/db/collection_attach.rs

+/// Result of [`AttachToCollectionStatement`] when executed asynchronously
+pub type AsyncAttachToCollectionResult<ResourceType, C> =
+    Result<(C, ResourceType), AttachError<ResourceType, C, PoolError>>;


This is the result type I expect most users to interact with.

Either:

"Here is the collection + resource you updated", or

"We did not do the update, and here's an AttachError explaining why"

smklein · 2022-05-25T16:02:00Z

nexus/src/db/collection_attach.rs

+/// 1. (collection_by_id, resource_by_id): Identify if the collection and
+///    resource objects exist at all.
+/// 2. (resource_count): Identify if the number of resources already attached to
+///    the collection exceeds a threshold.
+/// 3. (collection_info, resource_info): Checks for arbitrary user-provided
+///    constraints on the collection and resource objects.
+/// 4. (do_update): IFF all previous checks succeeded, make a decision to perfom
+///    an update.
+/// 5. (updated_resource): Apply user-provided updates on the resource -
+///    presumably, setting the collection ID value.


This hopefully describes the meat of the CTE.

The TL;DR: We want to check a few things, then do an update if all those checks pass, and we want it to be atomic, even though multiple tables are involved.

jmpesp

🚀 with some questions and suggestions

nexus/src/app/instance.rs

jmpesp · 2022-05-25T16:24:32Z

nexus/src/db/collection_attach.rs

+        let resource_exists_query = Box::new(
+            resource_table()
+                .into_boxed()
+                .filter(resource_table().primary_key().eq(resource_id))


can we guarantee that primary_key is always only id? why not the more specific ResourceIdColumn?

To be clear, you're suggesting "can we guarantee this is the only call to filter which contains an ID?"

In this case, we're constructing the entire resource_exists_query from scratch - the only user input is the UUID - so I don't think this check is explicitly necessary. Re-phrasing: I don't such a check would prevent any user behavior.

Also, unfortunately, it's a little tricky to parse the whole list of filters applied through Diesel (it's easy to ask are there any filters, but hard to find a specific one, especially since this query is boxed).

FWIW, we do have the bound:

BoxedQuery<ResourceTable<ResourceType, Self>>: FilterBy<Eq<ResourcePrimaryKey<ResourceType, Self>, Self::Id>>

which states that the resource table should be filterable comparing the ID type with the ResourcePrimaryKey .

However, the Diesel table type:

https://docs.diesel.rs/master/diesel/prelude/trait.Table.html

has an associated type PrimaryKey, but I'm not 100% sure how to connect it back to the Column type.

As mentioned in the linked PrimaryKey documentation, I'm concerned about the case of

If the table has a composite primary key, this will be a tuple.

where ResourceIdColumn is not a tuple. Though, looking at that bound, I can squint and convince myself that if primary_key() returns a tuple then compilation would fail :)

Yeah, I think the primary_key() and Self::Id types would need to match for this to work... since the CollectionIdColumn must implement Column, I think this CTE as currently implemented would not allow for composite PKs.

... but as you said, that would at least be a compile-time error, rather than a runtime error.

nexus/src/db/collection_attach.rs

jmpesp · 2022-05-25T16:27:56Z

nexus/src/db/collection_attach.rs

+    /// The unchanged resource and collection are returned as a part of this
+    /// error; it is the responsibility of the caller to determine which
+    /// condition was not met.
+    NoUpdate { attached_count: i64, resource: ResourceType, collection: C },


there's a couple of match branches on this variant that state something like "this error didn't tell us enough", leading me to think we should attach more context or information here.

What more information would you want? Is there a match branch in particular you're looking at?

I added more context below - if you're talking about cases where we return 500 errors, those should be impossible to hit, so I'm not sure more context would help.

nexus/src/db/collection_detach.rs

jmpesp · 2022-05-25T17:28:17Z

nexus/src/db/datastore.rs

+                                    // We can't attach, but the error hasn't
+                                    // helped us infer why.
+                                    return Err(Error::internal_error(
+                                        "cannot attach disk"
+                                    ));
+                                }


(this is an example of the error not providing enough context, referenced in another review comment)

Any case where I return "internal_error" implies a CTE bug to me - that's why we're returning internal_error for these cases, or a 500 error.

In this particular case:

We are trying to attach a disk to an instance

The disk and instance both exist, and "Time deleted" is NULL

We are not at the capacity of "number of attached disks"

Both instance and disk appeared in valid states

and yet the update did not occur

This should not happen - If I saw this case, I'd want to debug it, hence the 500 error - but I do not think any valid input should be able to trigger it.

Right - I'd want to debug that too, but all I would see internally is "cannot attach disk". The error hasn't helped us infer why at all, like the comment says, and I don't see any other existing context in parse_result that can be attached to the NoUpdate errors which would point to additional investigation steps.

I agree with this, but I'm still not sure what other info to attach. I think if this case happened, we'd need to perform manual inspection of the DB to infer why it happened.

I view this as akin to "attempting, but failing, to deserialize a value from the database". I don't think it's appropriate to dump raw strings all the way back to the client, but it's still worth flagging.

Crucible changes Remove a superfluous copy during write serialization (#1087) Update to progenitor v0.5.0, pull in required Omicron updates (#1115) Update usdt to v0.5.0 (#1116) Do not panic on reinitialize of a downstairs client. (#1114) Bump (tracing-)opentelemetry(-jaeger) (#1113) Make the Guest -> Upstairs queue fully async (#1086) Switch to per-block ownership (#1107) Handle timeout in the client IO task (#1109) Enforce buffer alignment (#1106) Block size buffers (#1105) New dtrace probes and a counter struct in the Upstairs. (#1104) Implement read decryption offloading (#1089) Remove Arc + Mutex from Buffer (#1094) Comment cleanup and rename of DsState::Repair -> Reconcile (#1102) do not panic the dynamometer for OOB writes (#1101) Allow dsc to start the downstairs in read-only mode. (#1098) Use the omicron-zone-package methods for topo sorting (#1099) Package with topological sorting (#1097) Fix clippy lints in dsc (#1095) Propolis changes: PHD: demote artifact store logs to DEBUG, enable DEBUG on CI (#626) PHD: fix missing newlines in serial.log (#622) PHD: fix run_shell_command with multiline commands (#621) PHD: fix `--artifact-directory` not doing anything (#618) Update h2 dependency Update Crucible (and Omicron) dependencies PHD: refactor guest serial console handling (#615) phd: add basic "migration-from-base" tests + machinery (#609) phd: Ensure min disk size fits read-only parents (#611) phd: automatically fetch `crucible-downstairs` from Buildomat (#604) Mitigate behavior from illumos#16183 PHD: add guest adapter for WS2022 (#607) phd: include error cause chain in failure output (#606) add QEMU pvpanic ISA device (#596) Add crucible-mem backend Make crucible opt parsing more terse in standalone

Crucible changes Remove a superfluous copy during write serialization (#1087) Update to progenitor v0.5.0, pull in required Omicron updates (#1115) Update usdt to v0.5.0 (#1116) Do not panic on reinitialize of a downstairs client. (#1114) Bump (tracing-)opentelemetry(-jaeger) (#1113) Make the Guest -> Upstairs queue fully async (#1086) Switch to per-block ownership (#1107) Handle timeout in the client IO task (#1109) Enforce buffer alignment (#1106) Block size buffers (#1105) New dtrace probes and a counter struct in the Upstairs. (#1104) Implement read decryption offloading (#1089) Remove Arc + Mutex from Buffer (#1094) Comment cleanup and rename of DsState::Repair -> Reconcile (#1102) do not panic the dynamometer for OOB writes (#1101) Allow dsc to start the downstairs in read-only mode. (#1098) Use the omicron-zone-package methods for topo sorting (#1099) Package with topological sorting (#1097) Fix clippy lints in dsc (#1095) Propolis changes: PHD: demote artifact store logs to DEBUG, enable DEBUG on CI (#626) PHD: fix missing newlines in serial.log (#622) PHD: fix run_shell_command with multiline commands (#621) PHD: fix `--artifact-directory` not doing anything (#618) Update h2 dependency Update Crucible (and Omicron) dependencies PHD: refactor guest serial console handling (#615) phd: add basic "migration-from-base" tests + machinery (#609) phd: Ensure min disk size fits read-only parents (#611) phd: automatically fetch `crucible-downstairs` from Buildomat (#604) Mitigate behavior from illumos#16183 PHD: add guest adapter for WS2022 (#607) phd: include error cause chain in failure output (#606) add QEMU pvpanic ISA device (#596) Add crucible-mem backend Make crucible opt parsing more terse in standalone Co-authored-by: Alan Hanson <[email protected]>

smklein added 15 commits May 16, 2022 09:38

a compiling (... but wrong) version of attach

ae77727

wip collection attach

1f1230b

more wip

ddcd66e

Merge branch 'main' into fix-attach

3fa448e

collection attach - at least the query builds

bee020b

... it... works?

7e3d520

expand tests

2d79a1c

touch-ups

f599e3f

Enforced update WHERE

ead8ea4

polishing docs

6e8214b

more cleanup

ba1acc5

Sync test added

6da1643

wip detach

d7db1ed

use 'disk_detach'

e3e9807

tests

71fa5d9

smklein commented May 23, 2022

View reviewed changes

nexus/src/db/collection_attach.rs Show resolved Hide resolved

smklein commented May 23, 2022

View reviewed changes

smklein added 10 commits May 23, 2022 18:12

More readable bounds

e96237d

more trait bound cleanup

b709179

removing rcgen, WIP moving bounds to associated types

362dfba

less destroying

f38c456

Removed ExpressionMethods bound

897b311

Working through it

36810b3

Removed dead code

3146caf

fmt

438526b

Detach many working, sorta. Needs cleanup

5952462

Use 'detach_all'

03f9d2f

smklein marked this pull request as ready for review May 25, 2022 04:14

smklein requested review from jmpesp and plotnick May 25, 2022 04:14

smklein commented May 25, 2022

View reviewed changes

smklein requested a review from bnaecker May 25, 2022 16:04

smklein added 3 commits May 25, 2022 12:05

Merge branch 'main' into fix-attach

e168d4a

de-duplicate type aliases

2d1531f

patch docs

12ab587

jmpesp approved these changes May 25, 2022

View reviewed changes

bnaecker approved these changes May 25, 2022

View reviewed changes

smklein added 4 commits May 25, 2022 14:06

review feedback

a043a37

Updated comment

11fea7c

Merge branch 'main' into fix-attach

a1b10d0

u32 max attach count

317633c

smklein force-pushed the fix-attach branch from 63e95b8 to 317633c Compare May 27, 2022 18:05

smklein merged commit 5a0d1e0 into main May 27, 2022

smklein deleted the fix-attach branch May 27, 2022 19:14

leftwo mentioned this pull request Jan 28, 2024

Update crucible and propolis versions #4912

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nexus] Re-implement disk attach/detach #1106

[nexus] Re-implement disk attach/detach #1106

smklein commented May 23, 2022 •

edited

Loading

smklein May 23, 2022 •

edited

Loading

smklein commented May 25, 2022

smklein May 25, 2022

smklein May 25, 2022

smklein May 25, 2022

smklein May 25, 2022

smklein May 25, 2022

smklein May 25, 2022

smklein May 25, 2022

smklein May 25, 2022

smklein May 25, 2022

smklein May 25, 2022

jmpesp left a comment

jmpesp May 25, 2022

smklein May 25, 2022

smklein May 25, 2022

jmpesp May 25, 2022

smklein May 27, 2022

jmpesp May 25, 2022

smklein May 25, 2022

smklein May 25, 2022

jmpesp May 25, 2022

smklein May 25, 2022

jmpesp May 25, 2022

smklein May 27, 2022

		@@ -0,0 +1,1480 @@
		// This Source Code Form is subject to the terms of the Mozilla Public

		.filter(collection_table().primary_key().eq(collection_id))
		.filter(Self::CollectionTimeDeletedColumn::default().is_null()),

[nexus] Re-implement disk attach/detach #1106

[nexus] Re-implement disk attach/detach #1106

Conversation

smklein commented May 23, 2022 • edited Loading

Summary of previous version

Problems with previous version

New implementation

smklein May 23, 2022 • edited Loading

Choose a reason for hiding this comment

smklein commented May 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmpesp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein commented May 23, 2022 •

edited

Loading

smklein May 23, 2022 •

edited

Loading