-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sled agent] Fixes to enable the reboot after RSS initialization #3466
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sean this is awesome. I'm very excited to see this go in. Just one question for my understanding purposes.
} else { | ||
// If the underlay doesn't exist, no routing occurs. | ||
info!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the switch zone eventually get an underlay address and add a route?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, and the answer is: Yes, it will.
When we first boot up, the BootstrapAgent
is responsible for "hardware monitoring", which starts the switch zone iff the tofino is detected:
omicron/sled-agent/src/bootstrap/agent.rs
Lines 397 to 404 in 3e020c9
let hardware_monitor = Self::hardware_monitor( | |
&ba_log, | |
&config.link, | |
&sled_config, | |
global_zone_bootstrap_link_local_address, | |
storage_key_requester.clone(), | |
) | |
.await?; |
This calls HardwareMonitor::new
, which itself has a reference to the ServiceManager
structure:
omicron/sled-agent/src/bootstrap/hardware.rs
Line 211 in 92258b4
let service_manager = ServiceManager::new( |
(Note, this is the same ServiceManager
structure that the SledAgent
will later use to manage services when we've fully booted)
The bootstrap's HardwareManager
spawns a tokio task which monitors hardware for the Tofino, and in response, launches the switch zone:
omicron/sled-agent/src/bootstrap/hardware.rs
Lines 72 to 79 in 92258b4
let baseboard = self.hardware.baseboard(); | |
let switch_zone_ip = None; | |
if let Err(e) = self.services.activate_switch( | |
switch_zone_ip, | |
baseboard, | |
).await { | |
warn!(self.log, "Failed to activate switch: {e}"); | |
} |
It's important to note -- the let switch_zone_ip = None
line is critical. If the bootstrap agent is launching the switch zone, it might be doing so before the underlay is up.
Later on, when the BootstrapAgent
tries to launch the SledAgent
. It stops monitoring hardware at this point.
omicron/sled-agent/src/bootstrap/agent.rs
Lines 526 to 527 in 92258b4
// Stop the bootstrap agent from monitoring for hardware, and | |
// pass control of service management to the sled agent. |
And when the SledAgent
starts, it does hardware monitoring for itself:
omicron/sled-agent/src/sled_agent.rs
Lines 336 to 340 in 92258b4
// Begin monitoring the underlying hardware, and reacting to changes. | |
let sa = sled_agent.clone(); | |
tokio::spawn(async move { | |
sa.hardware_monitor_task(log).await; | |
}); |
If the switch zone already launched, the SledAgent
will call activate_switch
with the underlay IP address:
omicron/sled-agent/src/sled_agent.rs
Lines 353 to 365 in 92258b4
let scrimlet = self.inner.hardware.is_scrimlet_driver_loaded(); | |
if scrimlet { | |
let baseboard = self.inner.hardware.baseboard(); | |
let switch_zone_ip = Some(self.inner.switch_zone_ip()); | |
if let Err(e) = self | |
.inner | |
.services | |
.activate_switch(switch_zone_ip, baseboard) | |
.await | |
{ | |
warn!(log, "Failed to activate switch: {e}"); | |
} |
(this is poking at the same ServiceManager
used by the bootstrap agent earlier!)
This case is handled explicitly: If the switch zone is running, and needs to be refreshed to add an underlay address and route. It's the same path we use during cold boot / RSS!
omicron/sled-agent/src/services.rs
Lines 2742 to 2748 in 2d46f8c
(SledLocalZone::Running { request, zone }, Some(new_request)) | |
if request.addresses != new_request.addresses => | |
{ | |
// If the switch zone is running but we have new addresses, it | |
// means we're moving from the bootstrap to the underlay | |
// network. We need to add an underlay address and route in the | |
// switch zone, so dendrite can communicate with nexus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha! It was those last two steps I was missing. Thanks for pointing this out!
This PR consists of changes to the Sled Agent to allow internal services to re-launch successfully after a reboot.
This was tested with the following script, on an initialized system running with SoftNPU:
Before this PR
This PR
start_rack_initialize
request if and RSS configuration file is included, no longer throws an error if RSS has already been initialized.Fixes #3461
Fixes #3106
Part of #725