-
Notifications
You must be signed in to change notification settings - Fork 40
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[gateway] ingest sensor measurements from SPs into oximeter (#6354)
This branch adds code to the Management Gateway Service for periodically polling sensor measurements from SPs and emitting it to Oximeter. In particular, this consists of: - a task for managing the metrics endpoint, waiting until MGS knows its underlay network address to bind the endpoint and register it with the control plane, - tasks for polling sensor measurements from each individual SP that MGS knows about, - a task that waits until SP discovery has completed and the rack ID to be known, and then spawns a poller task for every discovered SP slot The SP poller tasks send samples to the Oximeter producer endpoint using a `tokio::sync::broadcast` channel, which I've chosen primarily because it can be used as a bounded ring buffer that actually overwrites the *oldest* value when the buffer is full. This mostway, we use a bounded amount of memory for samples, but prioritize the most recent samples if we have to throw anything away because Oximeter hasn't come along to collect them recently. The poller tasks cache the component inventory and identifying information from the SP, so that we don't have to re-read all this data from the SP on every poll. While MGS, running on a host, would probably be fine with doing this, it seems better to avoid making the SP do unnecessary work at a 1Hz poll frequency, especially when *both* switch zones are polling them. Instead, every time we poll sensor data from an SP, we first ask it for its current state, and only invalidate our cached understanding of the SP when the state changes. This way, if a SP starts reporting new metrics due to a firmware update, or gets replaced with a different chassis with a new serial number, revision, etc, we won't continue to report metrics for stale targets, but we don't have to reload all of that once per second. To detect scenarios where the SP's state and/or identity has changed in the midst of polling its sensors (which may result in mislabeled metrics), we check whether the SP's state at the end of the poll matches its state at the beginning, and if it's not, we poll again immediately with its new identity. At present, the timestamps for these metric samples is generated by MGS --- it's the time when MGS received the sensor data from the SP, as MGS understands it. Because we don't currently collect data that was recorded prior to the switch zone coming up, we don't need to worry about figuring out timestamps for data recorded by the SP prior to the existence of a wall clock. Figuring out the SP/MGS timebase synchronization is probably a lot of additional work, although it would be nice to do in the future. At present, [metrics emitted by sled-agent prior to NTP sync will also be from 1987][1], so I think it's fine to do something similar here, especially because the potential solutions to that [also have their fair share of tradeoffs][2]. The new metrics use a schema in `oximeter/oximeter/schema/hardware-component.toml`. The target of these metrics is a `hardware_component` that includes: - the rack ID and the identity of the MGS instance that collected the metric, - information identifying the chassis[^1] and of the SP that recorded them (its serial number, model number, revision, and whether it's a switch, a sled, or a power shelf), - the SP's Hubris archive version (since the reported sensor data may change in future firmware releases) - the SP's ID for the hardware component (e.g. "dev-7"), the kind of device (e.g. "tmp117", "max5970"), and the humman-readable description (e.g. "Southeast temperature sensor", "U.2 Sharkfin A hot swap controller", etc.) reported by the SP Each kind of sensor reading has an individual metric (`hardware_component:temperature`, `hardware_component:current`, `hardware_component:voltage`, and so on). These metrics are labeled with the SP-reported name of the individual sensor measurement channel. For instance, a MAX5970 hotswap controller on sharkfin will have a voltage and current metric named "V12_U2A_A0" for the 12V rail, and a voltage and current metric named "V3P3_U2A_A0" for the 3.3V rail. Finally, a `hardware_component:sensor_errors` metric records sensor errors reported by the SP, labeled with the sensor name, what kind of sensor it is, and a string representation of the error. [1]: #6354 (comment) [2]: #6354 (comment) [^1]: I'm using "chassis" as a generic term to refer to "switch, sled, or power shelf".
- Loading branch information
Showing
28 changed files
with
1,990 additions
and
33 deletions.
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.