Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set up Recovery Silo and user during rack initialization #2943

Merged
merged 9 commits into from
May 3, 2023

Conversation

davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented Apr 27, 2023

This change plumbs configuration for the initial (Recovery) Silo (see RFD 234), the initial (privileged) user in that Silo, and the delegated DNS name for external DNS all the way from RSS to Nexus. Nexus now creates that SIlo, the user, grants the user Fleet Admin privileges, etc.

Fixes #2304.

@davepacheco davepacheco marked this pull request as ready for review April 28, 2023 20:23
@davepacheco davepacheco requested a review from smklein April 28, 2023 21:17
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for adding this!

let conn = self.pool_authorized(opctx).await?;
self.silo_create_conn(
conn,
opctx,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this isn't totally new in this PR, but what's the justification for separately passing an opctx and a group_opctx?

I've always considered OpContext a representation of "who called this function, and with what authn/authz", so seeing two of them in a single function is a little jarring. After all, isn't only one "caller" asking for the silo to be created?

(I think this has something to do with "automatically giving admin access to a particular group within a newly created silo", I guess I'm just a little surprised to see that take the form of an OpContext, since I would have suspected that this group name would just be a "name" parameter. I think passing an OpContext may be totally fine here, but I think it's probably worth clarifying the distinction between two parameters to a function that happen to have the same type)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the problem here is that you have two different actors performing this operation together: (1) the end user creating the Silo, and (2) Nexus itself, acting on behalf of that same user. The reason is that while the person is probably a Fleet Admin, they do not generally have the privilege to create a group inside the Silo (see #1580, #1681), and they certainly don't have permission to read and modify the fleet-wide DNS configuration. But we do want them to be able to do create a group and read/modify the DNS configuration in this specific case that is highly controlled. This is similar to the behavior that during provisioning, Nexus needs access to the list of Sleds in the system, but the user doing the provisioning doesn't have permission to see that. In both cases, Nexus assumes an identity as itself to do those operations. This one's a little more complicated because we need both OpContexts. We can't just use Nexus's because we want to use the user's OpContext to decide whether they're allowed to do other stuff as part of this operation, like create the Silo itself.

There's a comment about this in the caller:

// Silo group creation happens as Nexus's "external authn" context,
// not the user's context here. The user may not have permission to
// create arbitrary groups in the Silo, but we allow them to create
// this one in this case.

I will update that comment to talk about DNS and add some comments here to explain this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest commit.

HashMap::new(),
)
}
/// returns a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest commit.

Comment on lines +712 to +721
// It should *not* show up in the list because it's not discoverable.
assert_eq!(silos.len(), 0);
let (authz_silo, db_silo) = LookupPath::new(&opctx, &datastore)
.silo_name(&nexus_db_model::Name(
rack_init.recovery_silo.identity.name.clone(),
))
.fetch()
.await
.expect("Failed to lookup Silo");
assert!(!db_silo.discoverable);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we confirm that it's not possible for an operator to create a new silo with the same name as the recovery silo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only by way of existing tests that verify that you cannot create a Silo with a name that conflicts with an existing Silo. I could add an explicit test here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I basically just wanted to confirm that all the otherwise-valid CRUD operations are unable to touch this one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #2985 for part of this. I'm also adding a test that we cannot re-use the name while it's in use. (Turns out there was a bug here where we would fail on a 500 because we were not handling the database error correctly. This is not specific to the recovery Silo.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test added in the latest commit.

nexus/passwords/src/lib.rs Show resolved Hide resolved
let recovery_silo = RecoverySiloConfig {
silo_name: silo_name.clone(),
user_name: user_name.clone(),
// The test suite's password is "oxide". This password is only used by
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose we need to change this value, maybe because we're using a new hash function. How would I go about updating this value?

I noticed the string "oxide" hard-coded in some tests, so this seems a little subtle to change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. What belongs here is a password hash in PHC format using algorithm argon2id and parameters at least as strong as we currently use to generate password hashes for new users. The API documentation for this type explains this. It's a little vague about what "secure enough parameters" means because (1) I didn't want to encode that information in multiple places and have it go out of date, (2) it's an internal API anyway, and (3) if you get it wrong, the failure mode is very explicit (e.g.,

"password hash: parameter 'm': expected at least 98304 (KiB), \
found 4096"
).

As you pointed out, it also needs to be consistent with any places that use the password. There shouldn't be many of these. Presumably tests will fail (somewhat explicitly -- with a login failure) if that happens. There are a few different places that use the same password (e.g., omicron-dev run-all), but that's only for convenience. If you changed this one and "forgot" to update that one, that's fine.

Given these constraints, you can construct this hash with any tool that can generate Argon2 hashes. I use argon2.online for test stuff like this.

Background: I started from the position that it's safer to transfer the password hash in the APIs (from Wicket -> RSS and from RSS -> Nexus) rather than the password itself. That said, for the test suite and omicron-dev run-all, we could define a Rust constant which is the password itself, and then have us hash that and put that into the API calls. That might require more things outside of Nexus to depend on nexus-passwords, but maybe that's okay, and probably has to happen eventually if Wicket is to hash whatever password the user provides.


I'll think a bit more about whether to commonize some of the password and hash uses in the test suite, run-all, and sim sled agent. (I would probably not commonize these with the two config-rss.toml files, even though they use the same password, because (1) nothing depends on these being the same as each other or anything else, and (2) these are supposed to be exactly what would come in via RSS, so I don't want to put a cleartext password there, and (3) I don't think there's a way to refer to a Rust constant in those.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in the latest commit. With each hardcoded hash I added a comment explaining what depends on it and referring people to the relevant docs. I also had the test suite itself encode the password rather than the hash so that it can generate its own hash and we don't need this hardcoded there.

@davepacheco
Copy link
Collaborator Author

I think I've addressed all of the review feedback (and it was approved anyway), so I'm going to enable auto-merge. Feel free to disable if I've missed something, and of course I'm happy to iterate post-integration as needed.

@davepacheco davepacheco enabled auto-merge (squash) May 2, 2023 23:32
@davepacheco davepacheco merged commit 3072e20 into main May 3, 2023
@davepacheco davepacheco deleted the dap/initial-setup branch May 3, 2023 01:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

create Recovery Silo and initial user during RSS handoff
2 participants