-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nexus] Add HTTPS support, plumbing x509 certificates #1500
Conversation
Part of #249 This PR forces Nexus's external interface to be served via HTTPS when deployed by the sled-agent. - The packaging system expects to find these certificates within `./out/certs`, named `cert.pem` and `key.pem`. - `./tools/create_self_signed_cert.sh` is capable of creating a self-signed certificate.
# Note, we could just map the whole "out/certs" directory, but this ensures | ||
# both files exist. | ||
[[package.omicron-nexus.paths]] | ||
from = "out/certs/cert.pem" | ||
to = "/var/nexus/certs/cert.pem" | ||
[[package.omicron-nexus.paths]] | ||
from = "out/certs/key.pem" | ||
to = "/var/nexus/certs/key.pem" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a longer-term issue to resolve, but I'd like for this to not necessarily be bundled with the package.
However, this may require a different mechanism for acquiring / rotation of the certificate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed up today and met with Rick to discuss this.
My takeaways:
- RSS must provide a way for users to input the Nexus IP address + initial certificates
- Later, an interface through Nexus will support modifying these values
- A key storage mechanism like https://github.com/hashicorp/vault - not necessarily this particular technology, but something highly available - should be used to store keys for both internal and externally facing services.
- For internal services, we could concoct a scheme where:
- When booted, services generate a public/private key pair
- The private key stays private to the service while it exists in-memory
- The public key gets published to "vault" - or whatever key management facility we use - and signed by the internal root CA. This should rely on trusting "who is launching the service" (namely, sled agent) to extend this trust.
- For external services, we rely on user input for these certificates. We could store them in this "vault", but cannot generate them ourselves.
- For internal services, we could concoct a scheme where:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tracked by #1528
Nexus's external interface will typically be served using public-facing x.509 | ||
certificate. While we are still configuring the mechanism to integrate this real | ||
certificate into the package system, `./tools/create_self_signed_cert.sh` can be | ||
used to generate an equivalent self-signed certificate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO(me): I wonder if I can fix #1398 and make this easier at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See: #1528. We're gonna need to grab / insert certs from somewhere.
I've updating the packaging hints to make this more obvious if an error is encountered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this...it'll be great to have TLS here!
I think all of my comments boil down to the way this gets configured. I didn't follow why we don't just pass through the corresponding dropshot config. I may well have missed something here about why that wouldn't work. But if we could do that, it'd be less code and at the same time preserve more configurability -- some of which seems missing here for the development machine use case.
@@ -53,6 +53,13 @@ This script requires Omicron be uninstalled, e.g., with `pfexec | |||
that is not the case. The script will then remove the file-based vdevs and the | |||
VNICs created by `create_virtual_hardware.sh`. | |||
|
|||
=== Make me a certificate! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an analogous change for how-to-run-simulated
? In general, the test suite matches what how-to-run-simulated
does so if you've updated that then things should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is not, because the implementation within Nexus checks for the existence of the certificate files before deciding whether or not to launch the HTTPS server.
For the simulated server, they won't exist, so only the HTTP server will be launched.
I definitely think this would be required before moving to "HTTPS only", but considered this PR an intermediate step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioned in the other comment, but that doesn't seem like the right behavior. It seems like in production, if they don't exist, that should be a fatal error. And in development, if they do exist, we should start the HTTP server. If this were part of configuration, then whoever's running Nexus can express their intent. (I'm not saying we have to do all that in this PR, but if we decide that is what we want, then I think this behavior is not really an intermediate step.)
nexus/src/lib.rs
Outdated
// would be the more secure long-term plan, but would make gradual | ||
// deployment of this feature more difficult. | ||
|
||
let cert_file = PathBuf::from("/var/nexus/certs/cert.pem"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unlikely to work on development machines.
Why not make this part of dropshot_external
, with /var/nexus/certs/cert.pem
the value in the SMF config file and something like "out/certs" the value in nexus/examples/config.toml?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, this file is only used if cert_file.exists() && key_file.exists()
.
I was kinda opposed to using the config-file-supplied certificate, because longer-term, it seemed like this value would be:
- Passed from RSS into Nexus: RSS -> Nexus handoff plumbing #1528 , as part of the rack initialization request
- Stored in CRDB (and eventually a more secure key store, like Vault): https://github.com/oxidecomputer/omicron/pull/1500/files#r932591199
So although we are currently plumbing the certificates through the package system, I don't expect that to stay the case for very long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, this file is only used if cert_file.exists() && key_file.exists().
What happens if the path is missing on a production system? It seems like that should produce a fatal error.
I was kinda opposed to using the config-file-supplied certificate
I can see where you're coming from. At the same time, it seems like an essential developer or demo tool to be able to start Nexus with a specific certificate?
Is that a good idea for the production config? I'm not sure. It might depend on the expected behavior in the longer term. Is it that Nexus reaches out to CockroachDB/Vault to get the certificate, then loads that into memory, and then starts the HTTPS server with it? If that's true, then what do we do if Vault is down? I could see: (1) not coming up at all until we can get the cert, (2) storing the last-known-good certificate to a file and trying to use that until we can get the cert, (3) generating a self-signed certificate and using that until we're able to access the real one. These all have different tradeoffs. I've been assuming Nexus could come up (albeit not very usefully) even if everything else in the world was down so that we could at least ask it why it wasn't able to come up. Anyway, if we go with (2) or (3), then it seems like accepting a file path in the config file would be a reasonable stepping stone.
(Along these lines, I guess hardcoded, deployment-specific absolute paths in source files seem like a code smell to me, even if we gracefully handle them being missing.)
I've tried to make this more amenable to the idea of "allow the Since Sled Agent is the entity actually responsible for launching Nexus, it is now plumbing this information, by passing a I do still think there's a legitimate question of "how do we pass these certificates when they are not part of the Nexus package" - it seems like sled agent will need to load them into the filesystem somehow to force them into the Nexus zone, and ultimately the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I like this a lot better. Thanks. (I hope you at least don't hate it!)
Crucible changes Add test and fix for replay race condition (#1519) Fix clippy warnings (#1517) Add edition to `crucible-workspace-hack` (#1516) Split out Downstairs-specific stats (#1511) Move remaining `GuestWork` functionality into `Downstairs` (#1510) Track both jobs and bytes in each IO state (#1507) Fix new `rustc` and `clippy` warnings (#1509) Remove IOP/BW limits (for now) (#1506) Move `GuestBlockRes` into `DownstairsIO` (#1502) Update actions/checkout digest to eef6144 (#1499) Update Rust crate hyper-staticfile to 0.10.1 (#1411) Turn off test-up-2region-encrypted.sh (#1504) Add `IOop::Barrier` (#1494) Fix IPv6 addresses in `crutest` (#1503) Add region set options to more tests. (#1496) Simplify `CompleteJobs` (#1493) Removed ignored CI jobs (#1497) Minor cleanups to `print_last_completed` (#1501) Remove remaining `Arc<Volume>` instances (#1500) Add `VolumeBuilder` type (#1492) remove old unused scripts (#1495) More multiple region support. (#1484) Simplify matches (#1490) Move complete job tracker to a helper object (#1489) Expand summary and add documentation references to the README. (#1486) Remove `GuestWorkId` (2/2) (#1482) Remove `JobId` from `DownstairsIO` (1/2) (#1481) Remove unused `#[derive(..)]` (#1483) Update more tests to use dsc (#1480) Crutest now Volume only (#1479) Propolis changes manually impl Deserialize for PciPath for validation purposes (#801) phd: gate OS-specific tests, make others more OS-agnostic (#799) lib: log vCPU diagnostics on triple fault and for some unhandled exit types (#795) add marker trait to help check safety of guest memory reads (#794) clippy fixes for 1.82 (#796) lib: move cpuid::Set to cpuid_utils; prevent semantic subleaf conflicts (#782) PHD: write efivars in one go (#786) PHD: support guest-initiated reboot (#785) server: accept CPUID values in instance specs and plumb them to bhyve (#780) PHD: allow patched Crucible dependencies (#778) server: add a first-class error type to machine init (#777) PciPath to Bdf conversion is infallible; prove it and refactor (#774) instance spec rework: flatten InstanceSpecV0 (#767) Make PUT /instance/state 503 when waiting to init Less anxiety-inducing `Vm::{get, state_watcher}`
Crucible changes Add test and fix for replay race condition (#1519) Fix clippy warnings (#1517) Add edition to `crucible-workspace-hack` (#1516) Split out Downstairs-specific stats (#1511) Move remaining `GuestWork` functionality into `Downstairs` (#1510) Track both jobs and bytes in each IO state (#1507) Fix new `rustc` and `clippy` warnings (#1509) Remove IOP/BW limits (for now) (#1506) Move `GuestBlockRes` into `DownstairsIO` (#1502) Update actions/checkout digest to eef6144 (#1499) Update Rust crate hyper-staticfile to 0.10.1 (#1411) Turn off test-up-2region-encrypted.sh (#1504) Add `IOop::Barrier` (#1494) Fix IPv6 addresses in `crutest` (#1503) Add region set options to more tests. (#1496) Simplify `CompleteJobs` (#1493) Removed ignored CI jobs (#1497) Minor cleanups to `print_last_completed` (#1501) Remove remaining `Arc<Volume>` instances (#1500) Add `VolumeBuilder` type (#1492) remove old unused scripts (#1495) More multiple region support. (#1484) Simplify matches (#1490) Move complete job tracker to a helper object (#1489) Expand summary and add documentation references to the README. (#1486) Remove `GuestWorkId` (2/2) (#1482) Remove `JobId` from `DownstairsIO` (1/2) (#1481) Remove unused `#[derive(..)]` (#1483) Update more tests to use dsc (#1480) Crutest now Volume only (#1479) Propolis changes manually impl Deserialize for PciPath for validation purposes (#801) phd: gate OS-specific tests, make others more OS-agnostic (#799) lib: log vCPU diagnostics on triple fault and for some unhandled exit types (#795) add marker trait to help check safety of guest memory reads (#794) clippy fixes for 1.82 (#796) lib: move cpuid::Set to cpuid_utils; prevent semantic subleaf conflicts (#782) PHD: write efivars in one go (#786) PHD: support guest-initiated reboot (#785) server: accept CPUID values in instance specs and plumb them to bhyve (#780) PHD: allow patched Crucible dependencies (#778) server: add a first-class error type to machine init (#777) PciPath to Bdf conversion is infallible; prove it and refactor (#774) instance spec rework: flatten InstanceSpecV0 (#767) Make PUT /instance/state 503 when waiting to init Less anxiety-inducing `Vm::{get, state_watcher}` --------- Co-authored-by: Alan Hanson <[email protected]>
Another attempt at #1287
In addition to launching an HTTPS server, this also launches an HTTP server so we can smoothly migrate clients (like the CLI).
Part of #249