-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
set up Recovery Silo and user during rack initialization #2943
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for adding this!
let conn = self.pool_authorized(opctx).await?; | ||
self.silo_create_conn( | ||
conn, | ||
opctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this isn't totally new in this PR, but what's the justification for separately passing an opctx
and a group_opctx
?
I've always considered OpContext
a representation of "who called this function, and with what authn/authz", so seeing two of them in a single function is a little jarring. After all, isn't only one "caller" asking for the silo to be created?
(I think this has something to do with "automatically giving admin access to a particular group within a newly created silo", I guess I'm just a little surprised to see that take the form of an OpContext
, since I would have suspected that this group name would just be a "name" parameter. I think passing an OpContext may be totally fine here, but I think it's probably worth clarifying the distinction between two parameters to a function that happen to have the same type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the problem here is that you have two different actors performing this operation together: (1) the end user creating the Silo, and (2) Nexus itself, acting on behalf of that same user. The reason is that while the person is probably a Fleet Admin, they do not generally have the privilege to create a group inside the Silo (see #1580, #1681), and they certainly don't have permission to read and modify the fleet-wide DNS configuration. But we do want them to be able to do create a group and read/modify the DNS configuration in this specific case that is highly controlled. This is similar to the behavior that during provisioning, Nexus needs access to the list of Sleds in the system, but the user doing the provisioning doesn't have permission to see that. In both cases, Nexus assumes an identity as itself to do those operations. This one's a little more complicated because we need both OpContexts. We can't just use Nexus's because we want to use the user's OpContext to decide whether they're allowed to do other stuff as part of this operation, like create the Silo itself.
There's a comment about this in the caller:
Lines 85 to 88 in 2317079
// Silo group creation happens as Nexus's "external authn" context, | |
// not the user's context here. The user may not have permission to | |
// create arbitrary groups in the Silo, but we allow them to create | |
// this one in this case. |
I will update that comment to talk about DNS and add some comments here to explain this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in the latest commit.
HashMap::new(), | ||
) | ||
} | ||
/// returns a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in the latest commit.
// It should *not* show up in the list because it's not discoverable. | ||
assert_eq!(silos.len(), 0); | ||
let (authz_silo, db_silo) = LookupPath::new(&opctx, &datastore) | ||
.silo_name(&nexus_db_model::Name( | ||
rack_init.recovery_silo.identity.name.clone(), | ||
)) | ||
.fetch() | ||
.await | ||
.expect("Failed to lookup Silo"); | ||
assert!(!db_silo.discoverable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we confirm that it's not possible for an operator to create a new silo with the same name as the recovery silo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only by way of existing tests that verify that you cannot create a Silo with a name that conflicts with an existing Silo. I could add an explicit test here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I basically just wanted to confirm that all the otherwise-valid CRUD operations are unable to touch this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed #2985 for part of this. I'm also adding a test that we cannot re-use the name while it's in use. (Turns out there was a bug here where we would fail on a 500 because we were not handling the database error correctly. This is not specific to the recovery Silo.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test added in the latest commit.
nexus/test-utils/src/lib.rs
Outdated
let recovery_silo = RecoverySiloConfig { | ||
silo_name: silo_name.clone(), | ||
user_name: user_name.clone(), | ||
// The test suite's password is "oxide". This password is only used by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose we need to change this value, maybe because we're using a new hash function. How would I go about updating this value?
I noticed the string "oxide" hard-coded in some tests, so this seems a little subtle to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. What belongs here is a password hash in PHC format using algorithm argon2id and parameters at least as strong as we currently use to generate password hashes for new users. The API documentation for this type explains this. It's a little vague about what "secure enough parameters" means because (1) I didn't want to encode that information in multiple places and have it go out of date, (2) it's an internal API anyway, and (3) if you get it wrong, the failure mode is very explicit (e.g.,
omicron/nexus/passwords/src/lib.rs
Lines 548 to 549 in 2317079
"password hash: parameter 'm': expected at least 98304 (KiB), \ | |
found 4096" |
As you pointed out, it also needs to be consistent with any places that use the password. There shouldn't be many of these. Presumably tests will fail (somewhat explicitly -- with a login failure) if that happens. There are a few different places that use the same password (e.g., omicron-dev run-all
), but that's only for convenience. If you changed this one and "forgot" to update that one, that's fine.
Given these constraints, you can construct this hash with any tool that can generate Argon2 hashes. I use argon2.online for test stuff like this.
Background: I started from the position that it's safer to transfer the password hash in the APIs (from Wicket -> RSS and from RSS -> Nexus) rather than the password itself. That said, for the test suite and omicron-dev run-all
, we could define a Rust constant which is the password itself, and then have us hash that and put that into the API calls. That might require more things outside of Nexus to depend on nexus-passwords
, but maybe that's okay, and probably has to happen eventually if Wicket is to hash whatever password the user provides.
I'll think a bit more about whether to commonize some of the password and hash uses in the test suite, run-all
, and sim sled agent. (I would probably not commonize these with the two config-rss.toml files, even though they use the same password, because (1) nothing depends on these being the same as each other or anything else, and (2) these are supposed to be exactly what would come in via RSS, so I don't want to put a cleartext password there, and (3) I don't think there's a way to refer to a Rust constant in those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in the latest commit. With each hardcoded hash I added a comment explaining what depends on it and referring people to the relevant docs. I also had the test suite itself encode the password rather than the hash so that it can generate its own hash and we don't need this hardcoded there.
I think I've addressed all of the review feedback (and it was approved anyway), so I'm going to enable auto-merge. Feel free to disable if I've missed something, and of course I'm happy to iterate post-integration as needed. |
This change plumbs configuration for the initial (Recovery) Silo (see RFD 234), the initial (privileged) user in that Silo, and the delegated DNS name for external DNS all the way from RSS to Nexus. Nexus now creates that SIlo, the user, grants the user Fleet Admin privileges, etc.
Fixes #2304.