-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gateway] ingest sensor measurements from SPs into oximeter #6354
Conversation
021d9ce
to
62c1e7c
Compare
While working on #6354, it turned out that having to manually assign these was really easy to mess up when you want to simulate a lot of sensors, and there wasn't an easy way to get "real" ID values from an inventory of a real life SP, so the config writer has to assign them sequentially anyway. #6354 ended up using `component_details` rather than `read_sensor`, so being able to assign sensor IDs identical to those from Hubris turned out to not be that important, since (IIUC) the only way to get the `SensorId`s is to look at the Hubris archive, while *component* IDs can be discovered dynamically from the inventory APIs.
Hmm, testing on
It appears that the inventory contains non-present SPs for every slot and we are endlessly starting new pollers for them. Whoopsie. |
Also, it looks like we're getting 500 errors from Nexus whilst attempting to register the Oximeter producer. That seems weird...
|
Second thing appears to be because Nexus doesn't understand the "management_gateway" producer type. Seems like an oversight on my part:
|
Hmm, it looks like the OpenAPI specs for both Lines 281 to 287 in b87dc22
omicron/openapi/nexus-internal.json Lines 4377 to 4383 in b87dc22
So, I'm not really sure what's gone wrong there... |
a38faf5
to
fe35438
Compare
hey look, it works! (on
All the output from that OxQL query:
|
I meant to mention this after your wonderful demo on Friday, so thanks for the demo and apologies for the delay! I think it'd help to have a more descriptive target name than
Happy to bikeshed a name with you if that's helpful, and I'm also sure others have better ideas than me, especially @rmustacc. The naming schema in |
@bnaecker I think you're right that a more descriptive name is in order --- I had originally called it Of your recommendations, I'm inclined inclined to go with Incidentally, you'll note that in the current schema, the actual sensor name is a field on the metrics, rather than the component, so that components that provide multiple sensor readings share a target. Now that I think of it, perhaps the |
I might also rename the |
I think that makes sense! It gives us flexibility to add non-sensor metrics, and is also more accurate right now.
Yeah, I agree with you, and would vote for |
At present, it's just |
My inclination would be to add I asked in chat during your demo, but didn't get a good answer. What does the |
The component is a unique identifier of the individual component on the board. For instance, if we look at the components discovered from a Real Life Gimlet on eliza@jeeves ~ $ faux-mgs --interface=madrid_sw0tp0 --discovery-addr='[fe80::aa40:25ff:fe04:147%15]:11111' inventory
Aug 20 22:02:51.993 INFO creating SP handle on interface madrid_sw0tp0, component: faux-mgs
Aug 20 22:02:52.025 INFO initial discovery complete, addr: [fe80::aa40:25ff:fe04:147%15]:11111, interface: madrid_sw0tp0, component: faux-mgs
COMPONENT STATUS DEVICE DESCRIPTION (CAPABILITIES)
sp Present sp Service Processor (DeviceCapabilities(1))
sp3-host-cpu Present sp3-host-cpu Gimlet SP3 host cpu (DeviceCapabilities(4))
host-boot-flash Present host-boot-flash Gimlet host boot flash (DeviceCapabilities(1))
system-led Present system-led System attention LED (DeviceCapabilities(8))
dev-0 Present tmp117 Southwest temperature sensor (DeviceCapabilities(2))
dev-1 Present tmp117 South temperature sensor (DeviceCapabilities(2))
dev-2 Present tmp117 Southeast temperature sensor (DeviceCapabilities(2))
dev-3 Present pca9545 U.2 ABCD mux (DeviceCapabilities(0))
dev-4 Present pca9545 U.2 EFGH mux (DeviceCapabilities(0))
dev-5 Present pca9545 U.2 IJ/FRUID mux (DeviceCapabilities(0))
dev-6 Present at24csw080 U.2 Sharkfin A VPD (DeviceCapabilities(0))
dev-7 Present max5970 U.2 Sharkfin A hot swap controller (DeviceCapabilities(2))
dev-8 Present nvme_bmc U.2 A NVMe Basic Management Command (DeviceCapabilities(2))
dev-9 Present at24csw080 U.2 Sharkfin B VPD (DeviceCapabilities(0))
dev-10 Present max5970 U.2 Sharkfin B hot swap controller (DeviceCapabilities(2))
dev-11 Present nvme_bmc U.2 B NVMe Basic Management Control (DeviceCapabilities(2))
dev-12 Present at24csw080 U.2 Sharkfin C VPD (DeviceCapabilities(0))
dev-13 Present max5970 U.2 Sharkfin C hot swap controller (DeviceCapabilities(2))
dev-14 Present nvme_bmc U.2 C NVMe Basic Management Control (DeviceCapabilities(2))
dev-15 Present at24csw080 U.2 Sharkfin D VPD (DeviceCapabilities(0))
dev-16 Present max5970 U.2 Sharkfin D hot swap controller (DeviceCapabilities(2))
dev-17 Present nvme_bmc U.2 D NVMe Basic Management Control (DeviceCapabilities(2))
dev-18 Present at24csw080 U.2 Sharkfin E VPD (DeviceCapabilities(0))
dev-19 Present max5970 U.2 Sharkfin E hot swap controller (DeviceCapabilities(2))
dev-20 Present nvme_bmc U.2 E NVMe Basic Management Control (DeviceCapabilities(2))
dev-21 Present at24csw080 U.2 Sharkfin F VPD (DeviceCapabilities(0))
dev-22 Present max5970 U.2 Sharkfin F hot swap controller (DeviceCapabilities(2))
dev-23 Present nvme_bmc U.2 F NVMe Basic Management Control (DeviceCapabilities(2))
dev-24 Present at24csw080 U.2 Sharkfin G VPD (DeviceCapabilities(0))
dev-25 Present max5970 U.2 Sharkfin G hot swap controller (DeviceCapabilities(2))
dev-26 Present nvme_bmc U.2 G NVMe Basic Management Control (DeviceCapabilities(2))
dev-27 Present at24csw080 U.2 Sharkfin H VPD (DeviceCapabilities(0))
dev-28 Present max5970 U.2 Sharkfin H hot swap controller (DeviceCapabilities(2))
dev-29 Present nvme_bmc U.2 H NVMe Basic Management Control (DeviceCapabilities(2))
dev-30 Present at24csw080 U.2 Sharkfin I VPD (DeviceCapabilities(0))
dev-31 Present max5970 U.2 Sharkfin I hot swap controller (DeviceCapabilities(2))
dev-32 Present nvme_bmc U.2 I NVMe Basic Management Control (DeviceCapabilities(2))
dev-33 Present at24csw080 U.2 Sharkfin J VPD (DeviceCapabilities(0))
dev-34 Present max5970 U.2 Sharkfin J hot swap controller (DeviceCapabilities(2))
dev-35 Present nvme_bmc U.2 J NVMe Basic Management Control (DeviceCapabilities(2))
dev-36 Present at24csw080 Gimlet VPD (DeviceCapabilities(0))
dev-37 Present pca9545 M.2 mux (DeviceCapabilities(0))
dev-38 Present at24csw080 Fan VPD (DeviceCapabilities(0))
dev-39 Present tmp451 T6 temperature sensor (DeviceCapabilities(2))
dev-40 Present tps546b24a A2 3.3V rail (DeviceCapabilities(2))
dev-41 Present tps546b24a A0 3.3V rail (DeviceCapabilities(2))
dev-42 Present tps546b24a A2 5V rail (DeviceCapabilities(2))
dev-43 Present tps546b24a A2 1.8V rail (DeviceCapabilities(2))
dev-44 Present max5970 M.2 hot plug controller (DeviceCapabilities(2))
dev-45 Present sbrmi CPU via SB-RMI (DeviceCapabilities(0))
dev-46 Present sbtsi CPU temperature sensor (DeviceCapabilities(2))
dev-47 Present idt8a34003 Clock generator (DeviceCapabilities(0))
dev-48 Present raa229618 CPU power controller (DeviceCapabilities(2))
dev-49 Present raa229618 SoC power controller (DeviceCapabilities(2))
dev-50 Present isl68224 DIMM/SP3 1.8V A0 power controller (DeviceCapabilities(2))
dev-51 Present adm1272 Fan hot swap controller (DeviceCapabilities(2))
dev-52 Present adm1272 Sled hot swap controller (DeviceCapabilities(2))
dev-53 Present max31790 Fan controller (DeviceCapabilities(2))
dev-54 Present tps546b24a T6 power controller (DeviceCapabilities(2))
dev-55 Present tmp117 Northeast temperature sensor (DeviceCapabilities(2))
dev-56 Present tmp117 North temperature sensor (DeviceCapabilities(2))
dev-57 Present tmp117 Northwest temperature sensor (DeviceCapabilities(2))
dev-58 Present bmr491 Intermediate bus converter (DeviceCapabilities(2))
dev-59 Present tse2004av DIMM A0 (DeviceCapabilities(2))
dev-60 Present tse2004av DIMM A1 (DeviceCapabilities(2))
dev-61 Present tse2004av DIMM B0 (DeviceCapabilities(2))
dev-62 Present tse2004av DIMM B1 (DeviceCapabilities(2))
dev-63 Present tse2004av DIMM C0 (DeviceCapabilities(2))
dev-64 Present tse2004av DIMM C1 (DeviceCapabilities(2))
dev-65 Present tse2004av DIMM D0 (DeviceCapabilities(2))
dev-66 Present tse2004av DIMM D1 (DeviceCapabilities(2))
dev-67 Present tse2004av DIMM E0 (DeviceCapabilities(2))
dev-68 Present tse2004av DIMM E1 (DeviceCapabilities(2))
dev-69 Present tse2004av DIMM F0 (DeviceCapabilities(2))
dev-70 Present tse2004av DIMM F1 (DeviceCapabilities(2))
dev-71 Present tse2004av DIMM G0 (DeviceCapabilities(2))
dev-72 Present tse2004av DIMM G1 (DeviceCapabilities(2))
dev-73 Present tse2004av DIMM H0 (DeviceCapabilities(2))
dev-74 Present tse2004av DIMM H1 (DeviceCapabilities(2))
dev-75 Unavailable m2_hp_only M.2 A NVMe Basic Management Command (DeviceCapabilities(2))
dev-76 Unavailable m2_hp_only M.2 B NVMe Basic Management Command (DeviceCapabilities(2))
eliza@jeeves ~ $ Note that there are a bunch of entries with the same |
Thanks @hawkw, that's some helpful context! I'm looking at the output you shared. Under the column labeled "Description," I see labels, which do uniquely identify all the entries. Are those something we have access to, and could provide in a field for these timeseries? Or is that awkward or impossible to get at the point we generate the data? This isn't a huge issue. If we need a unique-ifying key, and that's what we have, then let's use it. But it does appear to me that the Thanks for entertaining my questions :) |
Not sure if this is my fault, or a test flake: https://buildomat.eng.oxide.computer/wg/0/details/01J5RV4BWD351SBNY9T9XRJKHY/m0GQSxegFXQysQ1iiYDzeGpXvP15C3kRyedbTzT6EDwTqdqF/01J5RV4Q4DCXC4292E8E0G30RD#S5060 Will investigate further. |
@bnaecker These are great questions, thanks!
We do also have access to the "description" field in MGS, and we could also add them to the timeseries labels. I didn't originally, because I figured they could be hydrated from the component ID later using the inventory collected by Nexus. And, since it's how the SP identifies the component in its RPC interface, it's guaranteed to be unique, while the description fields just happen to be unique. I agree with your point that the component IDs are less user-friendly, since they don't actually describe what the thing is. But, they are also included in Nexus' inventory collection. And, I would argue they're not a Hubris implementation detail, in the sense that they're not totally internal to Hubris; they are an implementation detail of the Hubris <--> MGS communication protocol. For instance, if you wanted to use the I'd like to keep the component IDs in the schema, since they are guaranteed to be unique relative to the SP's other components (while the description string is not), and because they might be cross-referenced against inventory collection or the MGS HTTP API. But, wth that said, I think it would be fine to include the description field in the metric target as well! Including it is probably a good idea. I didn't because I thought it was extra data to store in Clickhouse, and it could theoretically be hydrated from the inventory collection later...but we don't currently have a mechanism for doing that, so maybe we're better off throwing it in? I guess, since Clickhouse is a TSDB, target fields don't comprise most of the data it stores anyway and maybe it doesn't matter that much? |
This is a good point! If the descriptions are not guaranteed to be unique, then we definitely should not use them to identify the timeseries. And the fact that we already collect them in the inventory is really good to know! That means someone can always correlate it with the collection to get more information about the component itself. I agree that it's not a Hubris implementation detail anymore, we're pulling it up into the control plane.
I think this is a good idea too. Correlating the data today would have to be done by hand, using And yeah, these fields will all be deduplicated over time. ClickHouse will periodically merge parts of the table to sort and compress them, and for these tables, it takes the most recent row with the same sorting key. So we will ultimately end up with one row per timeseries, per field, a minor cost! |
@bnaecker regarding naming, do you have a preference between the following schemes:
|
I went with |
I've got a mild preference for (2), but this also seems fine to me since it's unlikely to matter if one mentally applies the description to something other than the component itself! I'm also an outlier on the willingness-to-type-long-names axis :) I'll take a more detailed look at the code soon, though I trust your code is all great. The fact that we see the data we want in OxQL is excellent! |
oh no Ok, so, this test is super finicky, and I've gone back and reworked it a handful of times to make it not flaky. But the implementation is still super fragile, and it makes assumptions about who's talking to the simulated SP at various points during the test, so I suspect that if this branch causes some "extra" (from that test's perspective) traffic to the sim SP, it's throwing off the bits that are trying to step through progress values. Given the test by lines of code volume is probably like 30% test, 70% "make the test not flaky", maybe we should just delete the test, or maybe replace it with a unit test? Alternatively, if it were easy to tell |
So, I'm happy to add the option to disable the metrics subsystem, and set that option for this test. I'm a bit concerned that if the metrics stuff does break update behavior in real life, we wouldn't catch it there. But, for now, I'll, go ahead and do that. |
If this turns out to be more painful than it sounds, I'd lean toward removing the test (and filing an issue to add some unit tests to cover what it was checking; |
As suggested by @jgallagher in #6354 (comment)
Nah, disabling the metrics stuff from the config file is pretty easy, so I'm going to do that for now. My concern is more that, if there's a possibility that having the metrics stuff running can break SP updates (e.g. because the SP is kept too busy to handle the update requests), I'd really like to be able to catch that... |
For the specific example there (keeping the SP too busy), I don't think this test is valid anyway, because it's hitting the simulated SP which has a lot more oomph than a real one. |
I realized the test could read the actual SP simulator config file, and figure out how many metrics should be present. That way, if someone changes the config file later, it'll still work.
Yeah, I was kinda worried when I saw it was failing, but that quickly went from "worried" to "surprised" because the SP sim shouldn't get overloaded...unless the actual UDP packets were getting dropped? I would still like to know what's going on there. But, for now, I just made it turn off metrics collection. |
When I said the test was fragile, I really meant it... the test hooks into the simulated SP and limits the number of packets it can receive so that it can inspect intermediate state, and it does that based on the number of packets the test expects the SP to receive at various points. If there's extra traffic the test doesn't know about, the simulated SP will certainly give the appearance of dropped packets (because it will ignore them!). |
Ah, okay, yeah, that explains why I would have broken it (and is kinda ghastly!). I feel okay about just turning off the metrics stuff during that test. |
there's really no reason to have a separate `dev` struct inside another layer of Option...
Poking around a bit on |
I'd only deal with this if it's very easy. We already produce data from 1986 from the sled agents, so this won't be the only thing. Also, many of these are gauges, and we're almost always interested in the very recent past, in which case we'll likely only see timestamps from after sync. OTOH, if you want to put a tiny check somewhere that's, "If not sync, or this sample has a time < 2000, don't produce anything," then far be it for me to stand in your way :) |
I think if we're not being fastidious about avoiding samples from The Land Before Timesync in other components, my preference would be to not worry about it, because the potential solutions I can think of involve collecting less data points. The easiest solution, IMO, is to make the metrics code wait until it's not 1986 anymore before it starts polling SPs, which seems like a bit of a bummer, because then we're just not collecting data at all. Alternatively, we could buffer all the data points along with a monotonic timestamp, until we see a date that's in this millennium, and then go back and update their timestamps like So, therefore, I think my preference for now is to accept that samples from prior to NTP sync will be listening to "Rock Me Amadeus" and watching Top Gun, if the alternative is either buffering them forever or throwing them away. We can always go back and implement one of those strategies later, if data from 1986 turns out to be too egregious, but we should probably also do something similar for sled-agent's metrics... |
Now that #6354 has added an Oximeter producer endpoint to MGS for publishing SP sensor metrics, it seemed like a nice idea to also instrument the MGS HTTP server, similar to the existing instrumentation for other control plane services. I don't think we'll be doing a lot of tuning of MGS performance, but the metrics seem like they could still be useful because they also include the distribution of HTTP status codes, and in many cases, the latency measurements also serve as a proxy for how long it takes the *SP* to perform a certain operation, which could be a valuable signal. This commit adds an `oximeter_instruments::http::LatencyTracker` to the MGS HTTP servers. To test that it works, I started a local Clickhouse and a standalone Oximeter, and ran MGS and the SP simulator using `cargo xtask mgs-dev run`. Then, I made a few HTTP requests to various MGS APIs using `curl`; most of which were expected to succeed, and a few for SP slots that the simulator wasn't configured to simulate a SP in (to ensure that the request would fail). We can see the new metrics in OxQL: ``` 0x〉\d hardware_component:current hardware_component:fan_speed hardware_component:sensor_error_count hardware_component:temperature hardware_component:voltage http_service:request_latency_histogram oximeter_collector:collections 0x〉get http_service:request_latency_histogram | last 1 http_service:request_latency_histogram id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263 name: management-gateway-service operation_id: sp_get status_code: 200 [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125731231]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 1, 0.0006000000000000001: 1, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000556233, max: 0.000603704, mean: 0.0005799685000000001, std_dev: 0.00002373549999999997, p50: 0, p90: 0.000603704, p99: 0.000603704] id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263 name: management-gateway-service operation_id: ignition_list status_code: 200 [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125290346]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 1, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000427249, max: 0.000427249, mean: 0.000427249, std_dev: 0, p50: 0, p90: 0.000427249, p99: 0.000427249] id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263 name: management-gateway-service operation_id: sp_get status_code: 400 [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.126114126]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 2, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000020368, max: 0.000021581, mean: 0.0000209745, std_dev: 0.0000006064999999999992, p50: 0, p90: 0.000021581, p99: 0.000021581] 0x〉exit ```
Now that #6354 has added an Oximeter producer endpoint to MGS for publishing SP sensor metrics, it seemed like a nice idea to also instrument the MGS HTTP server, similar to the existing instrumentation for other control plane services. I don't think we'll be doing a lot of tuning of MGS performance, but the metrics seem like they could still be useful because they also include the distribution of HTTP status codes, and in many cases, the latency measurements also serve as a proxy for how long it takes the *SP* to perform a certain operation, which could be a valuable signal. This commit adds an `oximeter_instruments::http::LatencyTracker` to the MGS HTTP servers. To test that it works, I started a local Clickhouse and a standalone Oximeter, and ran MGS and the SP simulator using `cargo xtask mgs-dev run`. Then, I made a few HTTP requests to various MGS APIs using `curl`; most of which were expected to succeed, and a few for SP slots that the simulator wasn't configured to simulate a SP in (to ensure that the request would fail). We can see the new metrics in OxQL: ``` 0x〉\d hardware_component:current hardware_component:fan_speed hardware_component:sensor_error_count hardware_component:temperature hardware_component:voltage http_service:request_latency_histogram oximeter_collector:collections 0x〉get http_service:request_latency_histogram | last 1 http_service:request_latency_histogram id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263 name: management-gateway-service operation_id: sp_get status_code: 200 [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125731231]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 1, 0.0006000000000000001: 1, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000556233, max: 0.000603704, mean: 0.0005799685000000001, std_dev: 0.00002373549999999997, p50: 0, p90: 0.000603704, p99: 0.000603704] id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263 name: management-gateway-service operation_id: ignition_list status_code: 200 [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125290346]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 1, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000427249, max: 0.000427249, mean: 0.000427249, std_dev: 0, p50: 0, p90: 0.000427249, p99: 0.000427249] id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263 name: management-gateway-service operation_id: sp_get status_code: 400 [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.126114126]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 2, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000020368, max: 0.000021581, mean: 0.0000209745, std_dev: 0.0000006064999999999992, p50: 0, p90: 0.000021581, p99: 0.000021581] 0x〉exit ```
Now that oxidecomputer#6354 has added an Oximeter producer endpoint to MGS for publishing SP sensor metrics, it seemed like a nice idea to also instrument the MGS HTTP server, similar to the existing instrumentation for other control plane services. I don't think we'll be doing a lot of tuning of MGS performance, but the metrics seem like they could still be useful because they also include the distribution of HTTP status codes, and in many cases, the latency measurements also serve as a proxy for how long it takes the *SP* to perform a certain operation, which could be a valuable signal. This commit adds an `oximeter_instruments::http::LatencyTracker` to the MGS HTTP servers. To test that it works, I started a local Clickhouse and a standalone Oximeter, and ran MGS and the SP simulator using `cargo xtask mgs-dev run`. Then, I made a few HTTP requests to various MGS APIs using `curl`; most of which were expected to succeed, and a few for SP slots that the simulator wasn't configured to simulate a SP in (to ensure that the request would fail). We can see the new metrics in OxQL: ``` 0x〉\d hardware_component:current hardware_component:fan_speed hardware_component:sensor_error_count hardware_component:temperature hardware_component:voltage http_service:request_latency_histogram oximeter_collector:collections 0x〉get http_service:request_latency_histogram | last 1 http_service:request_latency_histogram id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263 name: management-gateway-service operation_id: sp_get status_code: 200 [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125731231]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 1, 0.0006000000000000001: 1, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000556233, max: 0.000603704, mean: 0.0005799685000000001, std_dev: 0.00002373549999999997, p50: 0, p90: 0.000603704, p99: 0.000603704] id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263 name: management-gateway-service operation_id: ignition_list status_code: 200 [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125290346]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 1, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000427249, max: 0.000427249, mean: 0.000427249, std_dev: 0, p50: 0, p90: 0.000427249, p99: 0.000427249] id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263 name: management-gateway-service operation_id: sp_get status_code: 400 [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.126114126]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 2, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000020368, max: 0.000021581, mean: 0.0000209745, std_dev: 0.0000006064999999999992, p50: 0, p90: 0.000021581, p99: 0.000021581] 0x〉exit ```
This branch adds code to the Management Gateway Service for periodically
polling sensor measurements from SPs and emitting it to Oximeter. In
particular, this consists of:
underlay network address to bind the endpoint and register it with the
control plane,
knows about,
be known, and then spawns a poller task for every discovered SP slot
The SP poller tasks send samples to the Oximeter producer endpoint using
a
tokio::sync::broadcast
channel, which I've chosen primarily becauseit can be used as a bounded ring buffer that actually overwrites the
oldest value when the buffer is full. This mostway, we use a bounded
amount of memory for samples, but prioritize the most recent samples if
we have to throw anything away because Oximeter hasn't come along to
collect them recently.
The poller tasks cache the component inventory and identifying
information from the SP, so that we don't have to re-read all this data
from the SP on every poll. While MGS, running on a host, would probably
be fine with doing this, it seems better to avoid making the SP do
unnecessary work at a 1Hz poll frequency, especially when both switch
zones are polling them. Instead, every time we poll sensor data from an
SP, we first ask it for its current state, and only invalidate our
cached understanding of the SP when the state changes. This way, if a SP
starts reporting new metrics due to a firmware update, or gets replaced
with a different chassis with a new serial number, revision, etc, we
won't continue to report metrics for stale targets, but we don't have to
reload all of that once per second. To detect scenarios where the SP's
state and/or identity has changed in the midst of polling its sensors
(which may result in mislabeled metrics), we check whether the SP's
state at the end of the poll matches its state at the beginning, and if
it's not, we poll again immediately with its new identity.
At present, the timestamps for these metric samples is generated by MGS
--- it's the time when MGS received the sensor data from the SP, as MGS
understands it. Because we don't currently collect data that was
recorded prior to the switch zone coming up, we don't need to worry
about figuring out timestamps for data recorded by the SP prior to the
existence of a wall clock. Figuring out the SP/MGS timebase
synchronization is probably a lot of additional work, although it would
be nice to do in the future. At present, metrics emitted by sled-agent
prior to NTP sync will also be from 1987, so I think it's fine to do
something similar here, especially because the potential solutions to
that also have their fair share of tradeoffs.
The new metrics use a schema in
oximeter/oximeter/schema/hardware-component.toml
. The target of thesemetrics is a
hardware_component
that includes:metric,
them (its serial number, model number, revision, and whether it's a
switch, a sled, or a power shelf),
change in future firmware releases)
device (e.g. "tmp117", "max5970"), and the humman-readable description
(e.g. "Southeast temperature sensor", "U.2 Sharkfin A hot swap
controller", etc.) reported by the SP
Each kind of sensor reading has an individual metric
(
hardware_component:temperature
,hardware_component:current
,hardware_component:voltage
, and so on). These metrics are labeled withthe SP-reported name of the individual sensor measurement channel. For
instance, a MAX5970 hotswap controller on sharkfin will have a voltage
and current metric named "V12_U2A_A0" for the 12V rail, and a voltage
and current metric named "V3P3_U2A_A0" for the 3.3V rail. Finally, a
hardware_component:sensor_errors
metric records sensor errors reportedby the SP, labeled with the sensor name, what kind of sensor it is, and
a string representation of the error.
Footnotes
I'm using "chassis" as a generic term to refer to "switch, sled,
or power shelf". ↩