Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] Re-implement disk attach/detach #1106

Merged
merged 32 commits into from
May 27, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
ae77727
a compiling (... but wrong) version of attach
smklein May 16, 2022
1f1230b
wip collection attach
smklein May 16, 2022
ddcd66e
more wip
smklein May 17, 2022
3fa448e
Merge branch 'main' into fix-attach
smklein May 18, 2022
bee020b
collection attach - at least the query builds
smklein May 18, 2022
7e3d520
... it... works?
smklein May 19, 2022
2d79a1c
expand tests
smklein May 19, 2022
f599e3f
touch-ups
smklein May 19, 2022
ead8ea4
Enforced update WHERE
smklein May 19, 2022
6e8214b
polishing docs
smklein May 19, 2022
ba1acc5
more cleanup
smklein May 19, 2022
6da1643
Sync test added
smklein May 19, 2022
d7db1ed
wip detach
smklein May 20, 2022
e3e9807
use 'disk_detach'
smklein May 23, 2022
71fa5d9
tests
smklein May 23, 2022
e96237d
More readable bounds
smklein May 23, 2022
b709179
more trait bound cleanup
smklein May 24, 2022
362dfba
removing rcgen, WIP moving bounds to associated types
smklein May 24, 2022
f38c456
less destroying
smklein May 24, 2022
897b311
Removed ExpressionMethods bound
smklein May 24, 2022
36810b3
Working through it
smklein May 24, 2022
3146caf
Removed dead code
smklein May 24, 2022
438526b
fmt
smklein May 24, 2022
5952462
Detach many working, sorta. Needs cleanup
smklein May 25, 2022
03f9d2f
Use 'detach_all'
smklein May 25, 2022
e168d4a
Merge branch 'main' into fix-attach
smklein May 25, 2022
2d1531f
de-duplicate type aliases
smklein May 25, 2022
12ab587
patch docs
smklein May 25, 2022
a043a37
review feedback
smklein May 25, 2022
11fea7c
Updated comment
smklein May 26, 2022
a1b10d0
Merge branch 'main' into fix-attach
smklein May 27, 2022
317633c
u32 max attach count
smklein May 27, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions nexus/src/app/disk.rs
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,11 @@ impl super::Nexus {

/// Modifies the runtime state of the Disk as requested. This generally
/// means attaching or detaching the disk.
// TODO(https://github.com/oxidecomputer/omicron/issues/811):
// This will be unused until we implement hot-plug support.
// However, it has been left for reference until then, as it will
// likely be needed once that feature is implemented.
#[allow(dead_code)]
pub(crate) async fn disk_set_runtime(
&self,
opctx: &OpContext,
Expand Down
293 changes: 47 additions & 246 deletions nexus/src/app/instance.rs
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@ use omicron_common::api::external;
use omicron_common::api::external::CreateResult;
use omicron_common::api::external::DataPageParams;
use omicron_common::api::external::DeleteResult;
use omicron_common::api::external::DiskState;
use omicron_common::api::external::Error;
use omicron_common::api::external::InstanceState;
use omicron_common::api::external::ListResultVec;
Expand Down Expand Up @@ -152,17 +151,9 @@ impl super::Nexus {
Ok(db_instance)
}

// TODO-correctness It's not totally clear what the semantics and behavior
// should be here. It might be nice to say that you can only do this
// operation if the Instance is already stopped, in which case we can
// execute this immediately by just removing it from the database, with the
// same race we have with disk delete (i.e., if someone else is requesting
// an instance boot, we may wind up in an inconsistent state). On the other
// hand, we could always allow this operation, issue the request to the SA
// to destroy the instance (not just stop it), and proceed with deletion
// when that finishes. But in that case, although the HTTP DELETE request
// completed, the object will still appear for a little while, which kind of
// sucks.
// This operation may only occur on stopped instances, which implies that
// the attached disks do not have any running "upstairs" process running
// within the sled.
pub async fn project_destroy_instance(
&self,
opctx: &OpContext,
Expand All @@ -173,58 +164,14 @@ impl super::Nexus {
// TODO-robustness We need to figure out what to do with Destroyed
// instances? Presumably we need to clean them up at some point, but
// not right away so that callers can see that they've been destroyed.
let (.., authz_instance, db_instance) =
let (.., authz_instance, _) =
LookupPath::new(opctx, &self.db_datastore)
.organization_name(organization_name)
.project_name(project_name)
.instance_name(instance_name)
.fetch()
.await?;

opctx.authorize(authz::Action::Delete, &authz_instance).await?;

match db_instance.runtime_state.state.state() {
InstanceState::Stopped | InstanceState::Failed => {
// ok
}

state => {
return Err(Error::InvalidRequest {
message: format!(
"instance cannot be deleted in state \"{}\"",
state,
),
});
}
}

// Detach all attached disks
let disks = self
.instance_list_disks(
opctx,
organization_name,
project_name,
instance_name,
&DataPageParams {
marker: None,
direction: dropshot::PaginationOrder::Ascending,
limit: std::num::NonZeroU32::new(MAX_DISKS_PER_INSTANCE)
.unwrap(),
},
)
.await?;

for disk in &disks {
self.instance_detach_disk(
opctx,
organization_name,
project_name,
instance_name,
&disk.name(),
)
.await?;
}

self.db_datastore.project_delete_instance(opctx, &authz_instance).await
}

Expand Down Expand Up @@ -586,144 +533,43 @@ impl super::Nexus {
instance_name: &Name,
disk_name: &Name,
) -> UpdateResult<db::model::Disk> {
let (.., authz_project, authz_disk, db_disk) =
let (.., authz_project, authz_disk, _) =
LookupPath::new(opctx, &self.db_datastore)
.organization_name(organization_name)
.project_name(project_name)
.disk_name(disk_name)
.fetch()
.await?;
let (.., authz_instance, db_instance) =
let (.., authz_instance, _) =
LookupPath::new(opctx, &self.db_datastore)
.project_id(authz_project.id())
.instance_name(instance_name)
.fetch()
.await?;
let instance_id = &authz_instance.id();

// Enforce attached disks limit
let attached_disks = self
.instance_list_disks(
opctx,
organization_name,
project_name,
instance_name,
&DataPageParams {
marker: None,
direction: dropshot::PaginationOrder::Ascending,
limit: std::num::NonZeroU32::new(MAX_DISKS_PER_INSTANCE)
.unwrap(),
},
// TODO(https://github.com/oxidecomputer/omicron/issues/811):
// Disk attach is only implemented for instances that are not
// currently running. This operation therefore can operate exclusively
// on database state.
//
// To implement hot-plug support, we should do the following in a saga:
// - Update the state to "Attaching", rather than "Attached".
// - If the instance is running...
// - Issue a request to "disk attach" to the associated sled agent,
// using the "state generation" value from the moment we attached.
// - Update the DB if the request succeeded (hopefully to "Attached").
// - If the instance is not running...
// - Update the disk state in the DB to "Attached".
let (_instance, disk) = self
.db_datastore
.instance_attach_disk(
&opctx,
&authz_instance,
&authz_disk,
MAX_DISKS_PER_INSTANCE,
)
.await?;

if attached_disks.len() == MAX_DISKS_PER_INSTANCE as usize {
return Err(Error::invalid_request(&format!(
"cannot attach more than {} disks to instance!",
MAX_DISKS_PER_INSTANCE
)));
}

fn disk_attachment_error(
disk: &db::model::Disk,
) -> CreateResult<db::model::Disk> {
let disk_status = match disk.runtime().state().into() {
DiskState::Destroyed => "disk is destroyed",
DiskState::Faulted => "disk is faulted",
DiskState::Creating => "disk is detached",
DiskState::Detached => "disk is detached",

// It would be nice to provide a more specific message here, but
// the appropriate identifier to provide the user would be the
// other instance's name. Getting that would require another
// database hit, which doesn't seem worth it for this.
DiskState::Attaching(_) => {
"disk is attached to another instance"
}
DiskState::Attached(_) => {
"disk is attached to another instance"
}
DiskState::Detaching(_) => {
"disk is attached to another instance"
}
};
let message = format!(
"cannot attach disk \"{}\": {}",
disk.name().as_str(),
disk_status
);
Err(Error::InvalidRequest { message })
}

match &db_disk.state().into() {
// If we're already attaching or attached to the requested instance,
// there's nothing else to do.
// TODO-security should it be an error if you're not authorized to
// do this and we did not actually have to do anything?
DiskState::Attached(id) if id == instance_id => return Ok(db_disk),

// If the disk is currently attaching or attached to another
// instance, fail this request. Users must explicitly detach first
// if that's what they want. If it's detaching, they have to wait
// for it to become detached.
// TODO-debug: the error message here could be better. We'd have to
// look up the other instance by id (and gracefully handle it not
// existing).
DiskState::Attached(id) => {
assert_ne!(id, instance_id);
return disk_attachment_error(&db_disk);
}
DiskState::Detaching(_) => {
return disk_attachment_error(&db_disk);
}
DiskState::Attaching(id) if id != instance_id => {
return disk_attachment_error(&db_disk);
}
DiskState::Destroyed => {
return disk_attachment_error(&db_disk);
}
DiskState::Faulted => {
return disk_attachment_error(&db_disk);
}

DiskState::Creating => (),
DiskState::Detached => (),
DiskState::Attaching(id) => {
assert_eq!(id, instance_id);
}
}

match &db_instance.runtime_state.state.state() {
// If there's a propolis zone for this instance, ask the Sled Agent
// to hot-plug the disk.
//
// TODO this will probably involve volume construction requests as
// well!
InstanceState::Running | InstanceState::Starting => {
self.disk_set_runtime(
opctx,
&authz_disk,
&db_disk,
self.instance_sled(&db_instance).await?,
sled_agent_client::types::DiskStateRequested::Attached(
*instance_id,
),
)
.await?;
}

_ => {
// If there is not a propolis zone, then disk attach only occurs
// in the DB.
let new_runtime = db_disk.runtime().attach(*instance_id);

self.db_datastore
.disk_update_runtime(opctx, &authz_disk, &new_runtime)
.await?;
}
}

self.db_datastore.disk_refetch(opctx, &authz_disk).await
Ok(disk)
}

/// Detach a disk from an instance.
Expand All @@ -735,83 +581,38 @@ impl super::Nexus {
instance_name: &Name,
disk_name: &Name,
) -> UpdateResult<db::model::Disk> {
let (.., authz_project, authz_disk, db_disk) =
let (.., authz_project, authz_disk, _) =
LookupPath::new(opctx, &self.db_datastore)
.organization_name(organization_name)
.project_name(project_name)
.disk_name(disk_name)
.fetch()
.await?;
let (.., authz_instance, db_instance) =
let (.., authz_instance, _) =
LookupPath::new(opctx, &self.db_datastore)
.project_id(authz_project.id())
.instance_name(instance_name)
.fetch()
.await?;
let instance_id = &authz_instance.id();

match &db_disk.state().into() {
// This operation is a noop if the disk is not attached or already
// detaching from the same instance.
// TODO-security should it be an error if you're not authorized to
// do this and we did not actually have to do anything?
DiskState::Creating => return Ok(db_disk),
DiskState::Detached => return Ok(db_disk),
DiskState::Destroyed => return Ok(db_disk),
DiskState::Faulted => return Ok(db_disk),
DiskState::Detaching(id) if id == instance_id => {
return Ok(db_disk)
}

// This operation is not allowed if the disk is attached to some
// other instance.
DiskState::Attaching(id) if id != instance_id => {
return Err(Error::InvalidRequest {
message: String::from("disk is attached elsewhere"),
});
}
DiskState::Attached(id) if id != instance_id => {
return Err(Error::InvalidRequest {
message: String::from("disk is attached elsewhere"),
});
}
DiskState::Detaching(_) => {
return Err(Error::InvalidRequest {
message: String::from("disk is attached elsewhere"),
});
}

// These are the cases where we have to do something.
DiskState::Attaching(_) => (),
DiskState::Attached(_) => (),
}

// If there's a propolis zone for this instance, ask the Sled
// Agent to hot-remove the disk.
match &db_instance.runtime_state.state.state() {
InstanceState::Running | InstanceState::Starting => {
self.disk_set_runtime(
opctx,
&authz_disk,
&db_disk,
self.instance_sled(&db_instance).await?,
sled_agent_client::types::DiskStateRequested::Detached,
)
.await?;
}

_ => {
// If there is not a propolis zone, then disk detach only occurs
// in the DB.
let new_runtime = db_disk.runtime().detach();

self.db_datastore
.disk_update_runtime(opctx, &authz_disk, &new_runtime)
.await?;
}
}

self.db_datastore.disk_refetch(opctx, &authz_disk).await
// TODO(https://github.com/oxidecomputer/omicron/issues/811):
// Disk detach is only implemented for instances that are not
// currently running. This operation therefore can operate exclusively
// on database state.
//
// To implement hot-unplug support, we should do the following in a saga:
// - Update the state to "Detaching", rather than "Detached".
// - If the instance is running...
// - Issue a request to "disk detach" to the associated sled agent,
// using the "state generation" value from the moment we attached.
// - Update the DB if the request succeeded (hopefully to "Detached").
// - If the instance is not running...
// - Update the disk state in the DB to "Detached".
let disk = self
.db_datastore
.instance_detach_disk(&opctx, &authz_instance, &authz_disk)
.await?;
Ok(disk)
}

/// Create a network interface attached to the provided instance.
Expand Down
Loading