-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zone network configuration service #4677
Zone network configuration service #4677
Conversation
smf/cockroachdb/method_script.sh
Outdated
@@ -18,15 +18,17 @@ if [[ $DATALINK == unknown ]] || [[ $GATEWAY == unknown ]]; then | |||
fi | |||
|
|||
# TODO remove when https://github.com/oxidecomputer/stlouis/issues/435 is addressed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmpesp I see this issue has been closed, are we OK to remove this part now?
I have had a quick squiz at the PR, and I think adding a Rust program to do the work is indeed much better than a shell script! Thanks for digging into this. Before looking too closely at the code, I have some architectural thoughts. Separate SMF serviceI think we should do the networking setup in a wholly separate SMF service. In the PR as it stands, we've replaced some direct ipadm invocations in the method script for the Cockroach DB SMF service -- but the moving parts are still essentially a part of the Cockroach DB service. What I would sort of expect is to add a new SMF service, with just a default instance; e.g., svc:/oxide/network-setup:default. This would be enabled by default. It should be of the transient duration (see svc.startd(8)), exiting successfully when it completes setup. It should be idempotent, so that if it is restarted (whether after interruption or not) it does the right thing. It probably does not need to tear things down in a stop method, and it may not presently need to provide a functional refresh method at this time. To the extent that we need to communicate any information to this service from outside (e.g., addresses or subnets or interface names) we would do so via properties in the default service instance. The name, value, type, and behaviour of each of those properties (and indeed the FMRI of the whole service instance) would essentially become a cross-version stable interface; if sled agent needs to populate them, we should commit to them so that the version of sled agent and the version of the zone image aren't locked together. I suspect this new service would depend on the existing svc:/network/physical service that ships with the operating system. We would probably then make the OS milestone service, svc:/milestone/network, depend on our new service here. Then we would redo any other services, like Cockroach DB, that probably don't want to start until the network is configured, such that they depend on the network milestone. That's roughly what we do with a similar piece of software, the illumos metadata agent, except in a different context -- the metadata agent is a replacement for cloud-init, and runs in the global zone of virtualised guests on cloud platforms. In this case, the network setup service is running inside a non-global zone on top of an Omicron-managed system. No shell at allI don't think we need to use a shell script as the start/stop method for the new service. I would restructure the new Rust program to be invoked as the start method directly, rather than invoked several times from a new shell script. I also wouldn't do any argument construction or parsing, except for perhaps passing the method name (see %m in Method Tokens of smf_method(7)), but instead just use the smf crate (which uses libscf(3LIB) directly) to access any properties that we need. There is, I suspect, a lot to digest here. I've tried to link some manual pages that contain some of the details. When reading and writing SMF manifests, it may also be worth looking at the service_bundle(5) XML DTD, which is shipped in the OS and also in the source base; it contains a lot of comments about the structure of those files. Definitely happy to answer any questions or talk more about this whenever it would be helpful! Let me know! |
Hey @jclulow thanks a bunch for the detailed feedback! There's definitely a lot of Illumos knowledge that I'm missing here. I'm going to read the links you left me here, and I'll definitely take you up on that offer to catch up as I'll most likely have a lot of questions 😄 |
@jclulow, as I understand it, as it stands today, a new SMF service would have to be created by RSS. This would mean that we cannot have a new service running on the current racks until zones are provisioned by Nexus instead of RSS 😞. Hopefully I'm wrong about this, WDYT? |
I don't know much about RSS, but an SMF service is defined by an XML file that can just be shipped in the correct place in the zone image. If you put it in |
There's a bit of term overloading here - we can't currently start new control plane services / service zones on deployed racks because only RSS sets them up today, but SMF services are a lower level thing. Sled agent already configures and starts various SMF services within many of the zones it starts (e.g., all the services within the switch zone, where the list of switch zone SMF services has grown over time without having to deal with RSS or control plane service management). I think the hard line is probably "does sled-agent have all the information it needs to configure this {SMF,control plane} service", which is usually/always true for "SMF service" and false for "control plane service". |
Ah, perfect! Thanks for the clarification @jclulow @jgallagher 🙇♀️ |
@jclulow thanks a bunch for your help! I've updated the PR to reflect the feedback you gave me.
Let me know what you think :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this, it's going to make things a lot cleaner around self-assembly zones.
I have a few comments, some of which are just suggestions (Github doesn't make it easy to differentiate). If anything isn't clear or you don't agree, please let me know.
(and apologies for you now having to re-base this on top of my disable-sshd PR that just merged!)
Thanks for the review @citrus-it ! I think I've addressed all of your comments, let me know what you think :) |
This looks good, thanks. Just added a comment about the I think it's fine to just have a single static address for now, since there's no consumer for multiple addresses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, this is looking good to me.
Thanks for the review @citrus-it! Tiny ping for @jclulow , @davepacheco @jgallagher or @smklein to review on the rust side of things, and I'll merge this :) |
I will take a look soon! |
heyo! 👋 Sorry for the pressure here I know there's so much going on! Would it be possible to get a review on the rust side of things? @citrus-it kindly went over the Illumos side of this PR, so there's less to review! The thing is, I need this approved so I can move on with the rest of the work for the self assembled zones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
illumos-utils/src/route.rs
Outdated
// If the entry is not found in the table, | ||
// the exit status of the command will be 3 (ESRCH). | ||
// When that is the case, we'll add the route. | ||
Some(3) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a nitpick -- errno values aren't necessarily the same between systems. I think this value of 3 "happens to be" right, but could we use the constant from https://docs.rs/libc/0.2.152/libc/constant.ESRCH.html instead?
fn zone_network_setup_install( | ||
info: &SledAgentInfo, | ||
zone: &InstalledZone, | ||
static_addr: &String, | ||
) -> Result<ServiceBuilder, Error> { | ||
let datalink = zone.get_control_vnic_name(); | ||
let gateway = &info.underlay_address.to_string(); | ||
|
||
let mut config_builder = PropertyGroupBuilder::new("config"); | ||
config_builder = config_builder | ||
.add_property("datalink", "astring", datalink) | ||
.add_property("gateway", "astring", gateway) | ||
.add_property("static_addr", "astring", static_addr); | ||
|
||
Ok(ServiceBuilder::new("oxide/zone-network-setup") | ||
.add_property_group(config_builder) | ||
.add_instance(ServiceInstanceBuilder::new("default"))) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rad, this looks great (totally makes sense to have this be a static config in zones before they start)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙇♀️
Thanks everyone! |
oxidecomputer/omicron#4677 implements a new zone network configuration setup service so control plane services don't have to set this up themselves. This PR updates crucible to use said services.
As part of the work for self assembling zones, it was suggested to break the network configuration out into a separate service.
Implementation
This PR introduces a new SMF service
oxide/zone-network-setup
, which sets up the common initial zone networking configuration for each self assembled zone.Each of the "self assembled zone" services will now depend on this new service to run, and all properties relating to zone network configuration have been removed from these services.
The executable which does the actual zone networking setup, is built as a tiny CLI. It takes advantage of clap's parsing validation to make sure we have all of the properties present, and in the format they are intended to be.
Caveats
There are two remaining self assembled zones that don't depend on this new service yet (crucible and crucible-pantry). As these two zones need coordinated PRs with the crucible repo, I'd like to implement these in a follow up PR once this one is approved and merged.