-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sled-agent] Oximeter self-assembling zone #4534
[sled-agent] Oximeter self-assembling zone #4534
Conversation
I think we need to break the network configuration out into a separate service. It really shouldn't be in the method script for Clickhouse; to the extent that shortcut has been taken in other places we'll need to go unwind that as well. |
@jclulow Thanks for taking a look! This configuration is in all of the self assembled zones right now (cockroachdb, clickhouse, clickhouse-keeper). It looks like some of this configuration was meant to be short lived https://github.com/oxidecomputer/omicron/pull/3520/files#diff-2149edd4e04f9ab246b1eaf806daf472e0599252e3c72d3fccbc9db983ab5262R20-R25 @jmpesp, https://github.com/oxidecomputer/stlouis/issues/435 has been closed now. Can we remove some of the current configuration from the method script files? |
As part of the work for [self assembling zones](#1898), it was [suggested](#4534 (comment)) to break the network configuration out into a separate service. ## Implementation This PR introduces a new SMF service `oxide/zone-network-setup`, which sets up the common initial zone networking configuration for each self assembled zone. Each of the "self assembled zone" services will now depend on this new service to run, and all properties relating to zone network configuration have been removed from these services. The executable which does the actual zone networking setup, is built as a tiny CLI. It takes advantage of clap's parsing validation to make sure we have all of the properties present, and in the format they are intended to be. ## Caveats There are two remaining self assembled zones that don't depend on this new service yet (crucible and crucible-pantry). As these two zones need coordinated PRs with the crucible repo, I'd like to implement these in a follow up PR once this one is approved and merged.
@@ -3682,7 +3723,7 @@ mod test { | |||
const GLOBAL_ZONE_BOOTSTRAP_IP: Ipv6Addr = Ipv6Addr::LOCALHOST; | |||
const SWITCH_ZONE_BOOTSTRAP_IP: Ipv6Addr = Ipv6Addr::LOCALHOST; | |||
|
|||
const EXPECTED_ZONE_NAME_PREFIX: &str = "oxz_oximeter"; | |||
const EXPECTED_ZONE_NAME_PREFIX: &str = "oxz_ntp"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently the mock zones don't work very well with self-assembled zones.
I attempted to adapt the tests to work with self assembled zones but it appears to be that this is a much larger body of work out of the scope of this PR. Additionally, we can confirm this new zone works as the "helios / deploy" test is passing.
This is what the tests fail with:
Could not create service: ZoneInitialize([("oxz_oximeter_181462b0-2f83-4cc8-9c95-95884a038c26", Io { message: "Failed to setup Oximeter profile", err: Os { code: 2, kind: NotFound, message: "No such file or directory" } })])
Output from the logs:
{"msg":"Creating ServiceManager","v":0,"name":"test_ensure_service","level":30,"time":"2024-01-17T22:13:09.828335743Z","hostname":"pop-os","pid":676988,"component":"ServiceManager"}
{"msg":"sled agent started","v":0,"name":"test_ensure_service","level":30,"time":"2024-01-17T22:13:09.828532057Z","hostname":"pop-os","pid":676988,"component":"ServiceManager","underlay_address":"::1"}
{"msg":"new DNS resolver","v":0,"name":"test_ensure_service","level":30,"time":"2024-01-17T22:13:09.828553155Z","hostname":"pop-os","pid":676988,"component":"DnsResolver","component":"ServiceManager","addresses":"[[::1:0:0:0:1]:53, [::2:0:0:0:1]:53, [::3:0:0:0:1]:53, [::4:0:0:0:1]:53, [::5:0:0:0:1]:53]"}
{"msg":"No ledger in /tmp/.tmpvxo60Q/omicron-zones.json","v":0,"name":"test_ensure_service","level":30,"time":"2024-01-17T22:13:09.828568418Z","hostname":"pop-os","pid":676988,"component":"ServiceManager"}
{"msg":"Failed to read ledger: Not found in storage","v":0,"name":"test_ensure_service","level":20,"time":"2024-01-17T22:13:09.828581612Z","hostname":"pop-os","pid":676988,"component":"ServiceManager","path":"/tmp/.tmpvxo60Q/omicron-zones.json"}
{"msg":"No ledger in /tmp/.tmpvxo60Q/omicron-zones.json","v":0,"name":"test_ensure_service","level":30,"time":"2024-01-17T22:13:09.828601728Z","hostname":"pop-os","pid":676988,"component":"ServiceManager"}
{"msg":"Failed to read ledger: Not found in storage","v":0,"name":"test_ensure_service","level":20,"time":"2024-01-17T22:13:09.828614765Z","hostname":"pop-os","pid":676988,"component":"ServiceManager","path":"/tmp/.tmpvxo60Q/omicron-zones.json"}
{"msg":"ensure_all_omicron_zones_persistent","v":0,"name":"test_ensure_service","level":20,"time":"2024-01-17T22:13:09.828633403Z","hostname":"pop-os","pid":676988,"component":"ServiceManager","ledger_generation":"1","request_generation":"2"}
{"msg":"Configured to skip timesync checks","v":0,"name":"test_ensure_service","level":30,"time":"2024-01-17T22:13:09.828649126Z","hostname":"pop-os","pid":676988,"component":"ServiceManager"}
{"msg":"top of bundle cleanup loop","v":0,"name":"test_ensure_service","level":30,"time":"2024-01-17T22:13:09.828823024Z","hostname":"pop-os","pid":676988,"component":"auto-cleanup-task","time_to_next_cleanup":"599.999378173s","next_cleanup":"Instant { tv_sec: 263767, tv_nsec: 242078616 }"}
{"msg":"Profile for oxz_oximeter_181462b0-2f83-4cc8-9c95-95884a038c26:\n\n<service_bundle type="profile" name="omicron">\n <service version="1" type="service" name="oxide/zone-network-setup">\n <property_group type="application" name="config">\n <propval type="astring" name="datalink" value="oxControlService0"/>\n <propval type="astring" name="gateway" value="::1"/>\n <propval type="astring" name="static_addr" value="::1"/>\n </property_group>\n <instance enabled="true" name="default">\n \n \n <service version="1" type="service" name="network/ssh">\n <instance enabled="false" name="default">\n \n \n <service version="1" type="service" name="oxide/oximeter">\n <instance enabled="true" name="default">\n <property_group type="application" name="config">\n <propval type="astring" name="id" value="181462b0-2f83-4cc8-9c95-95884a038c26"/>\n <propval type="astring" name="address" value="[::1]:12223"/>\n </property_group>\n \n \n</service_bundle>","v":0,"name":"test_ensure_service","level":30,"time":"2024-01-17T22:13:09.829194091Z","hostname":"pop-os","pid":676988,"component":"ServiceManager"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've traditionally used this function to manipulate the local filesystem paths to make any local file checking happier:
omicron/sled-agent/src/services.rs
Lines 3944 to 3955 in 213d5cc
fn override_paths(&self, mgr: &ServiceManager) { | |
let dir = self.config_dir.path(); | |
mgr.override_ledger_directory(dir.to_path_buf()); | |
mgr.override_image_directory(dir.to_path_buf()); | |
// We test launching "fake" versions of the zones, but the | |
// logic to find paths relies on checking the existence of | |
// files. | |
std::fs::write(dir.join("oximeter.tar.gz"), "Not a real file") | |
.unwrap(); | |
std::fs::write(dir.join("ntp.tar.gz"), "Not a real file").unwrap(); | |
} |
Admittedly, this is trickier to do when manipulating paths in zones that haven't been started. I suspect that's what you're hitting -- instead of putting the profile xml file into the zone, we're finding that some /zone
path doesn't exist.
I think "swapping the zone we're acting on" is fine, but if we make NTP self-assembling, we'll probably bump into this issue again.
I'd be fine with either:
- Injecting a hook to manipulate "where that file gets dropped", so the test can inspect it, or
- Doing a bigger overhaul of the tests, to validate the properties we want to check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "swapping the zone we're acting on" is fine, but if we make NTP self-assembling, we'll probably bump into this issue again.
yep! my idea was to do:
Doing a bigger overhaul of the tests, to validate the properties we want to check
In a follow up PR dedicated only to tests before porting the NTP zone to be self-assembling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a couple questions!
package-manifest.toml
Outdated
@@ -123,7 +123,7 @@ source.type = "local" | |||
source.rust.binary_names = ["oximeter", "clickhouse-schema-updater"] | |||
source.rust.release = true | |||
source.paths = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(No action needed, just trying to grok this) - I thought you were trying to figure out how to bundle the network zone into each zone that needs it, but I don't see that as a dependency of this package anywhere.
I see that the call to zone_network_setup_install
adds the SMF-level dependency, but where are we actually shoving the "rust network service binary" into this zone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oooof... good catch! 😬
Why are the "helios / deploy" tests passing though? Shouldn't the zone crap out because of the SMF-level dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a byproduct of making the zone self-assembling, the zone might be "booting" successfully, but then failing to start services correctly later. On that note -- I'm not sure if that presubmit test needs Oximeter to be functional to go green. "helios / deploy" basically "just starts an instance, then passes". It's possible to do that while Oximeter is out to lunch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Practically speaking, oximeter
being unavailable doesn't preclude launching an instance. Propolis was reworked some time ago to continue attempting to register as a producer in the background.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we need some tests that verify that all zones are running. I'll write up an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup! This is all working as intended, TL;DR is "don't trust the helios/deploy job to tell you that Oximeter is up".
@@ -116,15 +116,15 @@ setup_hint = """ | |||
- Run `pkg install library/postgresql-13` to download Postgres libraries | |||
""" | |||
|
|||
[package.oximeter-collector] | |||
[package.oximeter] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smklein I had to rename this because the other intermediary package seems to have to be named oximeter-collector
because of the rust package name? Do you think this will cause problems down the line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chatted offline about this, TL;DR:
- The "real oximeter package" needs to be called
package.oximeter-collector
because that's the name of the rust crate we're building - The "real oximeter package" choice of
service_name
influences where the binary in this tarball gets placed
For the composite package, we have no such restrictions, and can pick whatever naming conventions we want (edit: as long as the package name doesn't conflict!)
After much fiddling around with packages I finally got to something that works 😅 Can confirm Oximeter zone is running successfully in the Helios deploy job. From the logs:
Further testing will be implemented in #4835 |
Related #1898
Closes: #2883