Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for "Self-assembling Zones" #1898

Closed
19 tasks done
smklein opened this issue Nov 1, 2022 · 0 comments · Fixed by #6162
Closed
19 tasks done

Tracking issue for "Self-assembling Zones" #1898

smklein opened this issue Nov 1, 2022 · 0 comments · Fixed by #6162
Assignees
Labels
bootstrap services For those occasions where you want the rack to turn on Sled Agent Related to the Per-Sled Configuration and Management
Milestone

Comments

@smklein
Copy link
Collaborator

smklein commented Nov 1, 2022

This topic was discussed in the latter half of https://drive.google.com/file/d/185bFxdvDo_1aA5-T5ywp9t3ZCJbc52B7/view , but I'll briefly summarize here:

  • Currently, after sled agent boots zones, they require additional configuration before they can be made usable.
  • This configuration includes: Creating IP addresses, setting up routes, setting SMF configuration options, etc.
  • These configuration changes are made by invoking zlogin to access the zone.
  • Across reboot, a portion of this information (namely the SMF properties) is saved into /var/oxide. This instructs the sled agent how to relaunch zones across reboots.

Instead, we should do the following:

  • Make zones "self-assembling" as much as possible.
    • Before the zone is booted, the sled agent should inject a file into /var/svc/profile/site.xml to load an SMF profile with the run-time parameters
    • The zone should rely on services starting from the manifest-import service
    • These dynamic parameters can then be processed by a "method script" inside the zone
    • As a result, we should not need to call zlogin with a set of commands to run inside the zone

What are the advantages of doing this?

  • Less likely to have race conditions with SMF. We have had issues in the past, like [sled-agent] Propolis server SMF service listen address is not always set correctly #1115 , related to asynchronous setting of SMF properties relative to launching the service. Additionally, we've dealt with race conditions between the sled agent and themanifest-import service. With this new mechanism, sled agent will set all configuration before starting the zone, so the ownership of the zone is more clear.
  • Less configuration outside the zone to re-launch the zone. Management of zones across reboots is a fair bit easier, no longer requiring auxiliary information in /var/oxide to know how to re-launch the zone.
  • Less calls to zlogin. This should be a minor efficiency boost.
  • Sled agent should become simpler. Hopefully. Let's see!

Tracking pieces:

@smklein smklein added Sled Agent Related to the Per-Sled Configuration and Management bootstrap services For those occasions where you want the rack to turn on labels Nov 1, 2022
@smklein smklein self-assigned this Nov 1, 2022
smklein added a commit that referenced this issue Apr 21, 2023
…1902)

Part of #1898
Relies on oxidecomputer/crucible#498

Converts Crucible, Cockroach, and Clickhouse to be (mostly)
self-assembling.

I'm happy to proceed and convert the rest of the zones we're launching
using a similar format, if we like how this looks.

Fixes #2886
@askfongjojo askfongjojo added this to the FCS milestone May 19, 2023
@morlandi7 morlandi7 modified the milestones: FCS, 1.0.3 Aug 15, 2023
@morlandi7 morlandi7 modified the milestones: 1.0.3, 3, 4 Oct 2, 2023
@morlandi7 morlandi7 modified the milestones: 4, 5 Nov 14, 2023
@smklein smklein removed their assignment Nov 21, 2023
@karencfv karencfv self-assigned this Nov 21, 2023
@morlandi7 morlandi7 modified the milestones: 5, 6 Nov 30, 2023
karencfv added a commit that referenced this issue Jan 15, 2024
As part of the work for [self assembling
zones](#1898), it was
[suggested](#4534 (comment))
to break the network configuration out into a separate service.

## Implementation

This PR introduces a new SMF service `oxide/zone-network-setup`, which
sets up the common initial zone networking configuration for each self
assembled zone.

Each of the "self assembled zone" services will now depend on this new
service to run, and all properties relating to zone network
configuration have been removed from these services.

The executable which does the actual zone networking setup, is built as
a tiny CLI. It takes advantage of clap's parsing validation to make sure
we have all of the properties present, and in the format they are
intended to be.

## Caveats

There are two remaining self assembled zones that don't depend on this
new service yet (crucible and crucible-pantry). As these two zones need
coordinated PRs with the crucible repo, I'd like to implement these in a
follow up PR once this one is approved and merged.
@karencfv karencfv modified the milestones: 6, 7 Jan 23, 2024
karencfv added a commit that referenced this issue Jan 25, 2024
karencfv added a commit that referenced this issue Feb 21, 2024
Create new packages for crucible and pantry to include the zone network
config service.

Depends on oxidecomputer/crucible#1096.

These two PRs should be merged in coordination

Related: #1898

### Crucible updates

This PR also merges a few changes from Crucible:

* fe0c5c7 - [smf] Use new zone network config service  
* 3d48060 - (upstream/main) Move a few methods into downstairs 
* b01e15c - Remove extra clone in upstairs read 
* b4f37b4 - Make `crucible-downstairs` not depend on upstairs 
* 733b7f9 - Update Rust crate rusqlite to 0.31 
* 961e971 - Update Rust crate reedline to 0.29.0 
* b946a04 - Update Rust crate clap to 4.5 
* 39f1f3f - Update Rust crate indicatif to 0.17.8 
* 4ea9387 - Update progenitor to bc0bb4b 
* ace10f4 - Do not 500 on snapshot delete for deleted region 
* 4105133 - Drop jobs from Offline downstairs. 
* 43dace9 - `Mutex<Work>` → `Work` 
* a1f3207 - Added a contributing.md 
* 13b8669 - Remove ExtentFlushClose::source_downstairs 
* 9b3f366 - Remove unnecessary mutexes from Downstairs
karencfv added a commit that referenced this issue Feb 23, 2024
### Overview

In addition to implementing the external DNS self assembling zone, this
PR contains a new SMF service called `opte-interface-setup`.

Closes: #2881
Related: #1898
### Implementation

This service makes use of the zone-network CLI tool to avoid having too
many CLIs doing things.

The CLI is now shipped independently so it can be called by two
different services.

The [`zone-networking opte-interface-set-up`](
https://github.com/oxidecomputer/omicron/pull/5059/files#diff-5fb7b70dc87176e02517181b0887ce250b6a4e4079e495990551deeca741dc8bR181-R202)
command sets up what the `ensure_address_for_port()` method used to set
up.

### Justification

The reasoning behind this new service is to avoid setting up too many
things via the method_script.sh file, and to avoid code duplication. The
Nexus zone will also be using this service to set up the OPTE interface.
karencfv added a commit that referenced this issue Mar 7, 2024
karencfv added a commit that referenced this issue Mar 7, 2024
@karencfv karencfv modified the milestones: 7, 8 Apr 1, 2024
karencfv added a commit that referenced this issue Apr 21, 2024
## Overview

This PR repurposes the zone-network CLI into a zone-setup CLI, in order
to remove as many zone start-up scripts as possible. This is also in
preparation to use this zone-setup CLI with the self assembling switch
zone.

Related: #1898

---------

Co-authored-by: Andy Fiddaman <[email protected]>
@karencfv karencfv modified the milestones: 8, 9 May 1, 2024
@morlandi7 morlandi7 modified the milestones: 9, 10 Jul 17, 2024
karencfv added a commit that referenced this issue Jul 22, 2024
## Overview

This PR migrates the switch zone to a self assembling format. There are
a few bits of old code I'll be cleaning up, more logs I'll be adding and
some documentation about how the switch zone flow works, but I'll do
this in follow up PRs to keep this one as compact as possible.

## Caveats

I've tested this in a local single node deployment and in the a4x2
testbed. Unfortunately, this is not enough testing to make sure all of
the services play nice together on a real rack. We'll have to keep an
eye on dogfood when this is deployed.

The only services that depend on the [common
networking](https://github.com/oxidecomputer/omicron/blob/30eb1ee38987201ac71f2115fdd89da4b08710c7/zone-setup/src/bin/zone-setup.rs#L578-L690)
[service](https://github.com/oxidecomputer/omicron/blob/30eb1ee38987201ac71f2115fdd89da4b08710c7/smf/zone-network-setup/manifest.xml)
are dendrite and MGS. While this makes sense on the a4x2 testbed, I'd
like to verify that these dependencies make sense when running on a real
rack.

As several people have worked on different parts of this zone, I've
tagged a whole bunch of people for review, sorry if this is overkill!
Just want to make sure I've got the right eyes on each service of the
zone.

Related: #1898
Closes: #2884

TODO:

- [x] Update Dendrite hashes after merging
oxidecomputer/dendrite#990
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bootstrap services For those occasions where you want the rack to turn on Sled Agent Related to the Per-Sled Configuration and Management
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants