Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zone network configuration service #4677

Merged
merged 21 commits into from
Jan 15, 2024

Conversation

karencfv
Copy link
Contributor

@karencfv karencfv commented Dec 12, 2023

As part of the work for self assembling zones, it was suggested to break the network configuration out into a separate service.

Implementation

This PR introduces a new SMF service oxide/zone-network-setup, which sets up the common initial zone networking configuration for each self assembled zone.

Each of the "self assembled zone" services will now depend on this new service to run, and all properties relating to zone network configuration have been removed from these services.

The executable which does the actual zone networking setup, is built as a tiny CLI. It takes advantage of clap's parsing validation to make sure we have all of the properties present, and in the format they are intended to be.

Caveats

There are two remaining self assembled zones that don't depend on this new service yet (crucible and crucible-pantry). As these two zones need coordinated PRs with the crucible repo, I'd like to implement these in a follow up PR once this one is approved and merged.

@@ -18,15 +18,17 @@ if [[ $DATALINK == unknown ]] || [[ $GATEWAY == unknown ]]; then
fi

# TODO remove when https://github.com/oxidecomputer/stlouis/issues/435 is addressed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmpesp I see this issue has been closed, are we OK to remove this part now?

@jclulow
Copy link
Collaborator

jclulow commented Dec 14, 2023

I have had a quick squiz at the PR, and I think adding a Rust program to do the work is indeed much better than a shell script! Thanks for digging into this. Before looking too closely at the code, I have some architectural thoughts.

Separate SMF service

I think we should do the networking setup in a wholly separate SMF service. In the PR as it stands, we've replaced some direct ipadm invocations in the method script for the Cockroach DB SMF service -- but the moving parts are still essentially a part of the Cockroach DB service.

What I would sort of expect is to add a new SMF service, with just a default instance; e.g., svc:/oxide/network-setup:default. This would be enabled by default. It should be of the transient duration (see svc.startd(8)), exiting successfully when it completes setup. It should be idempotent, so that if it is restarted (whether after interruption or not) it does the right thing. It probably does not need to tear things down in a stop method, and it may not presently need to provide a functional refresh method at this time.

To the extent that we need to communicate any information to this service from outside (e.g., addresses or subnets or interface names) we would do so via properties in the default service instance. The name, value, type, and behaviour of each of those properties (and indeed the FMRI of the whole service instance) would essentially become a cross-version stable interface; if sled agent needs to populate them, we should commit to them so that the version of sled agent and the version of the zone image aren't locked together.

I suspect this new service would depend on the existing svc:/network/physical service that ships with the operating system. We would probably then make the OS milestone service, svc:/milestone/network, depend on our new service here. Then we would redo any other services, like Cockroach DB, that probably don't want to start until the network is configured, such that they depend on the network milestone. That's roughly what we do with a similar piece of software, the illumos metadata agent, except in a different context -- the metadata agent is a replacement for cloud-init, and runs in the global zone of virtualised guests on cloud platforms. In this case, the network setup service is running inside a non-global zone on top of an Omicron-managed system.

No shell at all

I don't think we need to use a shell script as the start/stop method for the new service. I would restructure the new Rust program to be invoked as the start method directly, rather than invoked several times from a new shell script. I also wouldn't do any argument construction or parsing, except for perhaps passing the method name (see %m in Method Tokens of smf_method(7)), but instead just use the smf crate (which uses libscf(3LIB) directly) to access any properties that we need.


There is, I suspect, a lot to digest here. I've tried to link some manual pages that contain some of the details. When reading and writing SMF manifests, it may also be worth looking at the service_bundle(5) XML DTD, which is shipped in the OS and also in the source base; it contains a lot of comments about the structure of those files.

Definitely happy to answer any questions or talk more about this whenever it would be helpful! Let me know!

@karencfv
Copy link
Contributor Author

Hey @jclulow thanks a bunch for the detailed feedback! There's definitely a lot of Illumos knowledge that I'm missing here. I'm going to read the links you left me here, and I'll definitely take you up on that offer to catch up as I'll most likely have a lot of questions 😄

@karencfv
Copy link
Contributor Author

I think we should do the networking setup in a wholly separate SMF service. In the PR as it stands, we've replaced some direct ipadm invocations in the method script for the Cockroach DB SMF service -- but the moving parts are still essentially a part of the Cockroach DB service.

@jclulow, as I understand it, as it stands today, a new SMF service would have to be created by RSS. This would mean that we cannot have a new service running on the current racks until zones are provisioned by Nexus instead of RSS 😞. Hopefully I'm wrong about this, WDYT?

@jclulow
Copy link
Collaborator

jclulow commented Dec 14, 2023

I don't know much about RSS, but an SMF service is defined by an XML file that can just be shipped in the correct place in the zone image. If you put it in /lib/svc/manifest/site it ought, I expect, to be imported early enough to impose itself as a dependency of the network milestone. I believe manifests in that directory are imported by the early-manifest-import service at zone boot.

@jgallagher
Copy link
Contributor

@jclulow, as I understand it, as it stands today, a new SMF service would have to be created by RSS. This would mean that we cannot have a new service running on the current racks until zones are provisioned by Nexus instead of RSS 😞. Hopefully I'm wrong about this, WDYT?

There's a bit of term overloading here - we can't currently start new control plane services / service zones on deployed racks because only RSS sets them up today, but SMF services are a lower level thing. Sled agent already configures and starts various SMF services within many of the zones it starts (e.g., all the services within the switch zone, where the list of switch zone SMF services has grown over time without having to deal with RSS or control plane service management). I think the hard line is probably "does sled-agent have all the information it needs to configure this {SMF,control plane} service", which is usually/always true for "SMF service" and false for "control plane service".

@karencfv
Copy link
Contributor Author

Ah, perfect! Thanks for the clarification @jclulow @jgallagher 🙇‍♀️

@karencfv
Copy link
Contributor Author

karencfv commented Dec 21, 2023

@jclulow thanks a bunch for your help! I've updated the PR to reflect the feedback you gave me.

If this architecture makes more sense, I'll update the rest of the self assembled zones to follow this pattern and move the PR out of draft mode. Update: It wasn't much so I've updated everything and the PR is ready

Let me know what you think :)

@karencfv karencfv marked this pull request as ready for review December 21, 2023 23:37
Copy link
Contributor

@citrus-it citrus-it left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, it's going to make things a lot cleaner around self-assembly zones.
I have a few comments, some of which are just suggestions (Github doesn't make it easy to differentiate). If anything isn't clear or you don't agree, please let me know.

(and apologies for you now having to re-base this on top of my disable-sshd PR that just merged!)

illumos-utils/src/route.rs Show resolved Hide resolved
illumos-utils/src/route.rs Outdated Show resolved Hide resolved
sled-agent/src/services.rs Outdated Show resolved Hide resolved
smf/zone-network-setup/manifest.xml Outdated Show resolved Hide resolved
smf/zone-network-setup/method_script.sh Outdated Show resolved Hide resolved
@karencfv
Copy link
Contributor Author

karencfv commented Jan 9, 2024

Thanks for the review @citrus-it ! I think I've addressed all of your comments, let me know what you think :)

package-manifest.toml Outdated Show resolved Hide resolved
@citrus-it
Copy link
Contributor

Thanks for the review @citrus-it ! I think I've addressed all of your comments, let me know what you think :)

This looks good, thanks. Just added a comment about the package-manifest.toml.

I think it's fine to just have a single static address for now, since there's no consumer for multiple addresses.

Copy link
Contributor

@citrus-it citrus-it left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, this is looking good to me.

@karencfv
Copy link
Contributor Author

Thanks for the review @citrus-it!

Tiny ping for @jclulow , @davepacheco @jgallagher or @smklein to review on the rust side of things, and I'll merge this :)

@jclulow
Copy link
Collaborator

jclulow commented Jan 10, 2024

I will take a look soon!

@karencfv
Copy link
Contributor Author

heyo! 👋

Sorry for the pressure here I know there's so much going on! Would it be possible to get a review on the rust side of things? @citrus-it kindly went over the Illumos side of this PR, so there's less to review!

The thing is, I need this approved so I can move on with the rest of the work for the self assembled zones.

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

// If the entry is not found in the table,
// the exit status of the command will be 3 (ESRCH).
// When that is the case, we'll add the route.
Some(3) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a nitpick -- errno values aren't necessarily the same between systems. I think this value of 3 "happens to be" right, but could we use the constant from https://docs.rs/libc/0.2.152/libc/constant.ESRCH.html instead?

Comment on lines +1422 to +1439
fn zone_network_setup_install(
info: &SledAgentInfo,
zone: &InstalledZone,
static_addr: &String,
) -> Result<ServiceBuilder, Error> {
let datalink = zone.get_control_vnic_name();
let gateway = &info.underlay_address.to_string();

let mut config_builder = PropertyGroupBuilder::new("config");
config_builder = config_builder
.add_property("datalink", "astring", datalink)
.add_property("gateway", "astring", gateway)
.add_property("static_addr", "astring", static_addr);

Ok(ServiceBuilder::new("oxide/zone-network-setup")
.add_property_group(config_builder)
.add_instance(ServiceInstanceBuilder::new("default")))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rad, this looks great (totally makes sense to have this be a static config in zones before they start)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙇‍♀️

@karencfv karencfv merged commit 3b4b935 into oxidecomputer:main Jan 15, 2024
17 checks passed
@karencfv karencfv deleted the zone-networking-service branch January 15, 2024 23:18
@karencfv
Copy link
Contributor Author

Thanks everyone!

karencfv added a commit to oxidecomputer/crucible that referenced this pull request Feb 20, 2024
oxidecomputer/omicron#4677 implements a new zone network configuration setup service so control plane services don't have to set this up themselves. This PR updates crucible to use said services.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants