Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] Add HTTPS support, plumbing x509 certificates #1500

Merged
merged 16 commits into from
Aug 8, 2022
Merged

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Jul 26, 2022

Another attempt at #1287

In addition to launching an HTTPS server, this also launches an HTTP server so we can smoothly migrate clients (like the CLI).

Part of #249

Part of #249

This PR forces Nexus's external interface to be served via HTTPS when deployed by the sled-agent.

- The packaging system expects to find these certificates within `./out/certs`, named `cert.pem` and `key.pem`.
- `./tools/create_self_signed_cert.sh` is capable of creating a self-signed certificate.
Comment on lines +33 to +40
# Note, we could just map the whole "out/certs" directory, but this ensures
# both files exist.
[[package.omicron-nexus.paths]]
from = "out/certs/cert.pem"
to = "/var/nexus/certs/cert.pem"
[[package.omicron-nexus.paths]]
from = "out/certs/key.pem"
to = "/var/nexus/certs/key.pem"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a longer-term issue to resolve, but I'd like for this to not necessarily be bundled with the package.

However, this may require a different mechanism for acquiring / rotation of the certificate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed up today and met with Rick to discuss this.

My takeaways:

  • RSS must provide a way for users to input the Nexus IP address + initial certificates
    • Later, an interface through Nexus will support modifying these values
  • A key storage mechanism like https://github.com/hashicorp/vault - not necessarily this particular technology, but something highly available - should be used to store keys for both internal and externally facing services.
    • For internal services, we could concoct a scheme where:
      • When booted, services generate a public/private key pair
      • The private key stays private to the service while it exists in-memory
      • The public key gets published to "vault" - or whatever key management facility we use - and signed by the internal root CA. This should rely on trusting "who is launching the service" (namely, sled agent) to extend this trust.
    • For external services, we rely on user input for these certificates. We could store them in this "vault", but cannot generate them ourselves.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tracked by #1528

Comment on lines +58 to +61
Nexus's external interface will typically be served using public-facing x.509
certificate. While we are still configuring the mechanism to integrate this real
certificate into the package system, `./tools/create_self_signed_cert.sh` can be
used to generate an equivalent self-signed certificate.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO(me): I wonder if I can fix #1398 and make this easier at the same time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: #1528. We're gonna need to grab / insert certs from somewhere.

I've updating the packaging hints to make this more obvious if an error is encountered.

common/src/nexus_config.rs Outdated Show resolved Hide resolved
@smklein smklein marked this pull request as ready for review August 4, 2022 14:17
@smklein smklein requested a review from davepacheco August 4, 2022 14:17
@smklein smklein changed the title X509 again [nexus] Add HTTPS support, plumbing x509 certificates Aug 4, 2022
@smklein smklein added the nexus Related to nexus label Aug 4, 2022
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this...it'll be great to have TLS here!

I think all of my comments boil down to the way this gets configured. I didn't follow why we don't just pass through the corresponding dropshot config. I may well have missed something here about why that wouldn't work. But if we could do that, it'd be less code and at the same time preserve more configurability -- some of which seems missing here for the development machine use case.

README.adoc Outdated Show resolved Hide resolved
common/src/nexus_config.rs Outdated Show resolved Hide resolved
@@ -53,6 +53,13 @@ This script requires Omicron be uninstalled, e.g., with `pfexec
that is not the case. The script will then remove the file-based vdevs and the
VNICs created by `create_virtual_hardware.sh`.

=== Make me a certificate!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an analogous change for how-to-run-simulated? In general, the test suite matches what how-to-run-simulated does so if you've updated that then things should work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not, because the implementation within Nexus checks for the existence of the certificate files before deciding whether or not to launch the HTTPS server.

For the simulated server, they won't exist, so only the HTTP server will be launched.

I definitely think this would be required before moving to "HTTPS only", but considered this PR an intermediate step.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned in the other comment, but that doesn't seem like the right behavior. It seems like in production, if they don't exist, that should be a fatal error. And in development, if they do exist, we should start the HTTP server. If this were part of configuration, then whoever's running Nexus can express their intent. (I'm not saying we have to do all that in this PR, but if we decide that is what we want, then I think this behavior is not really an intermediate step.)

common/src/nexus_config.rs Outdated Show resolved Hide resolved
nexus/src/lib.rs Outdated
// would be the more secure long-term plan, but would make gradual
// deployment of this feature more difficult.

let cert_file = PathBuf::from("/var/nexus/certs/cert.pem");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unlikely to work on development machines.

Why not make this part of dropshot_external, with /var/nexus/certs/cert.pem the value in the SMF config file and something like "out/certs" the value in nexus/examples/config.toml?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this file is only used if cert_file.exists() && key_file.exists().

I was kinda opposed to using the config-file-supplied certificate, because longer-term, it seemed like this value would be:

So although we are currently plumbing the certificates through the package system, I don't expect that to stay the case for very long.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this file is only used if cert_file.exists() && key_file.exists().

What happens if the path is missing on a production system? It seems like that should produce a fatal error.

I was kinda opposed to using the config-file-supplied certificate

I can see where you're coming from. At the same time, it seems like an essential developer or demo tool to be able to start Nexus with a specific certificate?

Is that a good idea for the production config? I'm not sure. It might depend on the expected behavior in the longer term. Is it that Nexus reaches out to CockroachDB/Vault to get the certificate, then loads that into memory, and then starts the HTTPS server with it? If that's true, then what do we do if Vault is down? I could see: (1) not coming up at all until we can get the cert, (2) storing the last-known-good certificate to a file and trying to use that until we can get the cert, (3) generating a self-signed certificate and using that until we're able to access the real one. These all have different tradeoffs. I've been assuming Nexus could come up (albeit not very usefully) even if everything else in the world was down so that we could at least ask it why it wasn't able to come up. Anyway, if we go with (2) or (3), then it seems like accepting a file path in the config file would be a reasonable stepping stone.

(Along these lines, I guess hardcoded, deployment-specific absolute paths in source files seem like a code smell to me, even if we gracefully handle them being missing.)

@smklein
Copy link
Collaborator Author

smklein commented Aug 5, 2022

I've tried to make this more amenable to the idea of "allow the ConfigDropshot structure to be plugged directly into Nexus", which attempted to address your comment here: #1500 (comment)

Since Sled Agent is the entity actually responsible for launching Nexus, it is now plumbing this information, by passing a Vec<ConfigDropshot> to represent any number of external interfaces. I'm hoping this addresses your comment here, in that it leaves the input to Nexus "somewhat general-purpose", so developer environments + test harnesses can do whatever they want to do.

I do still think there's a legitimate question of "how do we pass these certificates when they are not part of the Nexus package" - it seems like sled agent will need to load them into the filesystem somehow to force them into the Nexus zone, and ultimately the ConfigDropshot path, if we are still trying to stick to that configuration.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I like this a lot better. Thanks. (I hope you at least don't hate it!)

nexus/tests/config.test.toml Show resolved Hide resolved
@smklein smklein enabled auto-merge (squash) August 8, 2022 13:58
@smklein smklein merged commit d72a148 into main Aug 8, 2022
@smklein smklein deleted the x509-again branch August 8, 2022 14:48
leftwo pushed a commit that referenced this pull request Oct 25, 2024
Crucible changes
Add test and fix for replay race condition (#1519)
Fix clippy warnings (#1517)
Add edition to `crucible-workspace-hack` (#1516)
Split out Downstairs-specific stats (#1511)
Move remaining `GuestWork` functionality into `Downstairs` (#1510)
Track both jobs and bytes in each IO state (#1507)
Fix new `rustc` and `clippy` warnings (#1509)
Remove IOP/BW limits (for now) (#1506)
Move `GuestBlockRes` into `DownstairsIO` (#1502)
Update actions/checkout digest to eef6144 (#1499)
Update Rust crate hyper-staticfile to 0.10.1 (#1411)
Turn off test-up-2region-encrypted.sh (#1504)
Add `IOop::Barrier` (#1494)
Fix IPv6 addresses in `crutest` (#1503)
Add region set options to more tests. (#1496)
Simplify `CompleteJobs` (#1493)
Removed ignored CI jobs (#1497)
Minor cleanups to `print_last_completed` (#1501)
Remove remaining `Arc<Volume>` instances (#1500)
Add `VolumeBuilder` type (#1492)
remove old unused scripts (#1495)
More multiple region support. (#1484)
Simplify matches (#1490)
Move complete job tracker to a helper object (#1489)
Expand summary and add documentation references to the README. (#1486)
Remove `GuestWorkId` (2/2) (#1482)
Remove `JobId` from `DownstairsIO` (1/2) (#1481)
Remove unused `#[derive(..)]` (#1483)
Update more tests to use dsc (#1480)
Crutest now Volume only (#1479)

Propolis changes
manually impl Deserialize for PciPath for validation purposes (#801)
phd: gate OS-specific tests, make others more OS-agnostic (#799)
lib: log vCPU diagnostics on triple fault and for some unhandled exit types (#795)
add marker trait to help check safety of guest memory reads (#794)
clippy fixes for 1.82 (#796)
lib: move cpuid::Set to cpuid_utils; prevent semantic subleaf conflicts (#782)
PHD: write efivars in one go (#786)
PHD: support guest-initiated reboot (#785)
server: accept CPUID values in instance specs and plumb them to bhyve (#780)
PHD: allow patched Crucible dependencies (#778)
server: add a first-class error type to machine init (#777)
PciPath to Bdf conversion is infallible; prove it and refactor (#774)
instance spec rework: flatten InstanceSpecV0 (#767)
Make PUT /instance/state 503 when waiting to init
Less anxiety-inducing `Vm::{get, state_watcher}`
leftwo added a commit that referenced this pull request Oct 30, 2024
Crucible changes
Add test and fix for replay race condition (#1519) Fix clippy warnings
(#1517)
Add edition to `crucible-workspace-hack` (#1516)
Split out Downstairs-specific stats (#1511)
Move remaining `GuestWork` functionality into `Downstairs` (#1510) Track
both jobs and bytes in each IO state (#1507) Fix new `rustc` and
`clippy` warnings (#1509)
Remove IOP/BW limits (for now) (#1506)
Move `GuestBlockRes` into `DownstairsIO` (#1502)
Update actions/checkout digest to eef6144 (#1499)
Update Rust crate hyper-staticfile to 0.10.1 (#1411) Turn off
test-up-2region-encrypted.sh (#1504)
Add `IOop::Barrier` (#1494)
Fix IPv6 addresses in `crutest` (#1503)
Add region set options to more tests. (#1496)
Simplify `CompleteJobs` (#1493)
Removed ignored CI jobs (#1497)
Minor cleanups to `print_last_completed` (#1501)
Remove remaining `Arc<Volume>` instances (#1500)
Add `VolumeBuilder` type (#1492)
remove old unused scripts (#1495)
More multiple region support. (#1484)
Simplify matches (#1490)
Move complete job tracker to a helper object (#1489) Expand summary and
add documentation references to the README. (#1486) Remove `GuestWorkId`
(2/2) (#1482)
Remove `JobId` from `DownstairsIO` (1/2) (#1481)
Remove unused `#[derive(..)]` (#1483)
Update more tests to use dsc (#1480)
Crutest now Volume only (#1479)

Propolis changes
manually impl Deserialize for PciPath for validation purposes (#801)
phd: gate OS-specific tests, make others more OS-agnostic (#799) lib:
log vCPU diagnostics on triple fault and for some unhandled exit types
(#795) add marker trait to help check safety of guest memory reads
(#794) clippy fixes for 1.82 (#796)
lib: move cpuid::Set to cpuid_utils; prevent semantic subleaf conflicts
(#782) PHD: write efivars in one go (#786)
PHD: support guest-initiated reboot (#785)
server: accept CPUID values in instance specs and plumb them to bhyve
(#780) PHD: allow patched Crucible dependencies (#778)
server: add a first-class error type to machine init (#777) PciPath to
Bdf conversion is infallible; prove it and refactor (#774) instance spec
rework: flatten InstanceSpecV0 (#767) Make PUT /instance/state 503 when
waiting to init
Less anxiety-inducing `Vm::{get, state_watcher}`

---------

Co-authored-by: Alan Hanson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nexus Related to nexus
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants