Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sled-agent] Oximeter self-assembling zone #4534

Merged
merged 18 commits into from
Jan 25, 2024

Conversation

karencfv
Copy link
Contributor

@karencfv karencfv commented Nov 21, 2023

Related #1898

Closes: #2883

@jclulow
Copy link
Collaborator

jclulow commented Nov 21, 2023

I think we need to break the network configuration out into a separate service. It really shouldn't be in the method script for Clickhouse; to the extent that shortcut has been taken in other places we'll need to go unwind that as well.

@karencfv
Copy link
Contributor Author

@jclulow Thanks for taking a look! This configuration is in all of the self assembled zones right now (cockroachdb, clickhouse, clickhouse-keeper).

It looks like some of this configuration was meant to be short lived https://github.com/oxidecomputer/omicron/pull/3520/files#diff-2149edd4e04f9ab246b1eaf806daf472e0599252e3c72d3fccbc9db983ab5262R20-R25

@jmpesp, https://github.com/oxidecomputer/stlouis/issues/435 has been closed now. Can we remove some of the current configuration from the method script files?

karencfv added a commit that referenced this pull request Jan 15, 2024
As part of the work for [self assembling
zones](#1898), it was
[suggested](#4534 (comment))
to break the network configuration out into a separate service.

## Implementation

This PR introduces a new SMF service `oxide/zone-network-setup`, which
sets up the common initial zone networking configuration for each self
assembled zone.

Each of the "self assembled zone" services will now depend on this new
service to run, and all properties relating to zone network
configuration have been removed from these services.

The executable which does the actual zone networking setup, is built as
a tiny CLI. It takes advantage of clap's parsing validation to make sure
we have all of the properties present, and in the format they are
intended to be.

## Caveats

There are two remaining self assembled zones that don't depend on this
new service yet (crucible and crucible-pantry). As these two zones need
coordinated PRs with the crucible repo, I'd like to implement these in a
follow up PR once this one is approved and merged.
@@ -3682,7 +3723,7 @@ mod test {
const GLOBAL_ZONE_BOOTSTRAP_IP: Ipv6Addr = Ipv6Addr::LOCALHOST;
const SWITCH_ZONE_BOOTSTRAP_IP: Ipv6Addr = Ipv6Addr::LOCALHOST;

const EXPECTED_ZONE_NAME_PREFIX: &str = "oxz_oximeter";
const EXPECTED_ZONE_NAME_PREFIX: &str = "oxz_ntp";
Copy link
Contributor Author

@karencfv karencfv Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently the mock zones don't work very well with self-assembled zones.

I attempted to adapt the tests to work with self assembled zones but it appears to be that this is a much larger body of work out of the scope of this PR. Additionally, we can confirm this new zone works as the "helios / deploy" test is passing.

This is what the tests fail with:

Could not create service: ZoneInitialize([("oxz_oximeter_181462b0-2f83-4cc8-9c95-95884a038c26", Io { message: "Failed to setup Oximeter profile", err: Os { code: 2, kind: NotFound, message: "No such file or directory" } })])

Output from the logs:

{"msg":"Creating ServiceManager","v":0,"name":"test_ensure_service","level":30,"time":"2024-01-17T22:13:09.828335743Z","hostname":"pop-os","pid":676988,"component":"ServiceManager"}
{"msg":"sled agent started","v":0,"name":"test_ensure_service","level":30,"time":"2024-01-17T22:13:09.828532057Z","hostname":"pop-os","pid":676988,"component":"ServiceManager","underlay_address":"::1"}
{"msg":"new DNS resolver","v":0,"name":"test_ensure_service","level":30,"time":"2024-01-17T22:13:09.828553155Z","hostname":"pop-os","pid":676988,"component":"DnsResolver","component":"ServiceManager","addresses":"[[::1:0:0:0:1]:53, [::2:0:0:0:1]:53, [::3:0:0:0:1]:53, [::4:0:0:0:1]:53, [::5:0:0:0:1]:53]"}
{"msg":"No ledger in /tmp/.tmpvxo60Q/omicron-zones.json","v":0,"name":"test_ensure_service","level":30,"time":"2024-01-17T22:13:09.828568418Z","hostname":"pop-os","pid":676988,"component":"ServiceManager"}
{"msg":"Failed to read ledger: Not found in storage","v":0,"name":"test_ensure_service","level":20,"time":"2024-01-17T22:13:09.828581612Z","hostname":"pop-os","pid":676988,"component":"ServiceManager","path":"/tmp/.tmpvxo60Q/omicron-zones.json"}
{"msg":"No ledger in /tmp/.tmpvxo60Q/omicron-zones.json","v":0,"name":"test_ensure_service","level":30,"time":"2024-01-17T22:13:09.828601728Z","hostname":"pop-os","pid":676988,"component":"ServiceManager"}
{"msg":"Failed to read ledger: Not found in storage","v":0,"name":"test_ensure_service","level":20,"time":"2024-01-17T22:13:09.828614765Z","hostname":"pop-os","pid":676988,"component":"ServiceManager","path":"/tmp/.tmpvxo60Q/omicron-zones.json"}
{"msg":"ensure_all_omicron_zones_persistent","v":0,"name":"test_ensure_service","level":20,"time":"2024-01-17T22:13:09.828633403Z","hostname":"pop-os","pid":676988,"component":"ServiceManager","ledger_generation":"1","request_generation":"2"}
{"msg":"Configured to skip timesync checks","v":0,"name":"test_ensure_service","level":30,"time":"2024-01-17T22:13:09.828649126Z","hostname":"pop-os","pid":676988,"component":"ServiceManager"}
{"msg":"top of bundle cleanup loop","v":0,"name":"test_ensure_service","level":30,"time":"2024-01-17T22:13:09.828823024Z","hostname":"pop-os","pid":676988,"component":"auto-cleanup-task","time_to_next_cleanup":"599.999378173s","next_cleanup":"Instant { tv_sec: 263767, tv_nsec: 242078616 }"}
{"msg":"Profile for oxz_oximeter_181462b0-2f83-4cc8-9c95-95884a038c26:\n\n<service_bundle type="profile" name="omicron">\n <service version="1" type="service" name="oxide/zone-network-setup">\n <property_group type="application" name="config">\n <propval type="astring" name="datalink" value="oxControlService0"/>\n <propval type="astring" name="gateway" value="::1"/>\n <propval type="astring" name="static_addr" value="::1"/>\n </property_group>\n <instance enabled="true" name="default">\n \n \n <service version="1" type="service" name="network/ssh">\n <instance enabled="false" name="default">\n \n \n <service version="1" type="service" name="oxide/oximeter">\n <instance enabled="true" name="default">\n <property_group type="application" name="config">\n <propval type="astring" name="id" value="181462b0-2f83-4cc8-9c95-95884a038c26"/>\n <propval type="astring" name="address" value="[::1]:12223"/>\n </property_group>\n \n \n</service_bundle>","v":0,"name":"test_ensure_service","level":30,"time":"2024-01-17T22:13:09.829194091Z","hostname":"pop-os","pid":676988,"component":"ServiceManager"}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've traditionally used this function to manipulate the local filesystem paths to make any local file checking happier:

fn override_paths(&self, mgr: &ServiceManager) {
let dir = self.config_dir.path();
mgr.override_ledger_directory(dir.to_path_buf());
mgr.override_image_directory(dir.to_path_buf());
// We test launching "fake" versions of the zones, but the
// logic to find paths relies on checking the existence of
// files.
std::fs::write(dir.join("oximeter.tar.gz"), "Not a real file")
.unwrap();
std::fs::write(dir.join("ntp.tar.gz"), "Not a real file").unwrap();
}

Admittedly, this is trickier to do when manipulating paths in zones that haven't been started. I suspect that's what you're hitting -- instead of putting the profile xml file into the zone, we're finding that some /zone path doesn't exist.

I think "swapping the zone we're acting on" is fine, but if we make NTP self-assembling, we'll probably bump into this issue again.

I'd be fine with either:

  • Injecting a hook to manipulate "where that file gets dropped", so the test can inspect it, or
  • Doing a bigger overhaul of the tests, to validate the properties we want to check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "swapping the zone we're acting on" is fine, but if we make NTP self-assembling, we'll probably bump into this issue again.

yep! my idea was to do:

Doing a bigger overhaul of the tests, to validate the properties we want to check

In a follow up PR dedicated only to tests before porting the NTP zone to be self-assembling.

@karencfv karencfv marked this pull request as ready for review January 17, 2024 22:59
@karencfv karencfv requested a review from smklein January 17, 2024 22:59
@karencfv karencfv changed the title [sled-agent] Self-assembling zones [sled-agent] Oximeter self-assembling zone Jan 17, 2024
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a couple questions!

package-manifest.toml Outdated Show resolved Hide resolved
@@ -123,7 +123,7 @@ source.type = "local"
source.rust.binary_names = ["oximeter", "clickhouse-schema-updater"]
source.rust.release = true
source.paths = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(No action needed, just trying to grok this) - I thought you were trying to figure out how to bundle the network zone into each zone that needs it, but I don't see that as a dependency of this package anywhere.

I see that the call to zone_network_setup_install adds the SMF-level dependency, but where are we actually shoving the "rust network service binary" into this zone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooof... good catch! 😬

Why are the "helios / deploy" tests passing though? Shouldn't the zone crap out because of the SMF-level dependency?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a byproduct of making the zone self-assembling, the zone might be "booting" successfully, but then failing to start services correctly later. On that note -- I'm not sure if that presubmit test needs Oximeter to be functional to go green. "helios / deploy" basically "just starts an instance, then passes". It's possible to do that while Oximeter is out to lunch?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Practically speaking, oximeter being unavailable doesn't preclude launching an instance. Propolis was reworked some time ago to continue attempting to register as a producer in the background.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we need some tests that verify that all zones are running. I'll write up an issue

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! This is all working as intended, TL;DR is "don't trust the helios/deploy job to tell you that Oximeter is up".

@@ -116,15 +116,15 @@ setup_hint = """
- Run `pkg install library/postgresql-13` to download Postgres libraries
"""

[package.oximeter-collector]
[package.oximeter]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smklein I had to rename this because the other intermediary package seems to have to be named oximeter-collector because of the rust package name? Do you think this will cause problems down the line?

Copy link
Collaborator

@smklein smklein Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted offline about this, TL;DR:

  • The "real oximeter package" needs to be called package.oximeter-collector because that's the name of the rust crate we're building
  • The "real oximeter package" choice of service_name influences where the binary in this tarball gets placed

For the composite package, we have no such restrictions, and can pick whatever naming conventions we want (edit: as long as the package name doesn't conflict!)

@karencfv
Copy link
Contributor Author

After much fiddling around with packages I finally got to something that works 😅

Can confirm Oximeter zone is running successfully in the Helios deploy job. From the logs:

<...>
{"msg":"registered endpoint","v":0,"name":"oximeter","level":20,"time":"2024-01-25T02:33:03.345190123Z","hostname":"oxz_oximeter_7faae0da-ba40-4500-9cf9-516110aa835e","pid":24576,"local_addr":"[fd00:1122:3344:101::d]:12223","component":"dropshot","path":"/info","method":"GET"}
{"msg":"registered endpoint","v":0,"name":"oximeter","level":20,"time":"2024-01-25T02:33:03.345223044Z","hostname":"oxz_oximeter_7faae0da-ba40-4500-9cf9-516110aa835e","pid":24576,"local_addr":"[fd00:1122:3344:101::d]:12223","component":"dropshot","path":"/producers","method":"GET"}
{"msg":"registered endpoint","v":0,"name":"oximeter","level":20,"time":"2024-01-25T02:33:03.345238115Z","hostname":"oxz_oximeter_7faae0da-ba40-4500-9cf9-516110aa835e","pid":24576,"local_addr":"[fd00:1122:3344:101::d]:12223","component":"dropshot","path":"/producers","method":"POST"}
{"msg":"registered endpoint","v":0,"name":"oximeter","level":20,"time":"2024-01-25T02:33:03.345244555Z","hostname":"oxz_oximeter_7faae0da-ba40-4500-9cf9-516110aa835e","pid":24576,"local_addr":"[fd00:1122:3344:101::d]:12223","component":"dropshot","path":"/producers/{producer_id}","method":"DELETE"}
<...>

Further testing will be implemented in #4835

@karencfv karencfv merged commit 9e08978 into oxidecomputer:main Jan 25, 2024
15 checks passed
@karencfv karencfv deleted the self-assembling-zones-part-1 branch January 25, 2024 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert Oximeter zone to be self-assembling
4 participants