Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU temperature timeseries is not in degrees C #6634

Closed
bnaecker opened this issue Sep 23, 2024 · 3 comments · Fixed by #6656
Closed

CPU temperature timeseries is not in degrees C #6634

bnaecker opened this issue Sep 23, 2024 · 3 comments · Fixed by #6656
Assignees

Comments

@bnaecker
Copy link
Collaborator

Today, we report the CPU temperature via the SP, in the hardware_component:temperature timeseries, specifically those with the "CPU" sensor name. Like all the other temperature measurements, we report those in degrees C. Unfortunately, that's not correct for the CPU temperature metrics the SP has.

There's a lot of subtlety here, which Robert and Keith laid out in this stlouis block comment and this issue. The high-level summary is that these values are a "control temperature", a synthetic number that AMD computes in an unknown way on the basis of the actual hardware sensors a given CPU has. The point of this appears to be to provide a single value that is generic across all the different parts and resource groups (classes of CPUs with different TDP, for example). That number can be used in a thermal control loop, rather than the requiring the loop to figure out which part it's reading from and combine all the junction temperatures itself. (Those actually are in degrees C.)

The best thing to do here is probably cordon off the CPU temperature into its own timeseries. We definitely need to make clear that this is dimensionless, in both the actual unit and the description of the timeseries values as well.

@hawkw
Copy link
Member

hawkw commented Sep 23, 2024

Whoops, this one's on me. I guess we ought to have the MGS metrics task special-case sensor readings with the name "CPU". I'll look into fixing that.

@hawkw hawkw self-assigned this Sep 23, 2024
@hawkw
Copy link
Member

hawkw commented Sep 23, 2024

A quick fix would be to just change MGS' sensors task to special-case temperature readings with the name "CPU" and device kind "sbtsi", which I took a first pass at in 9c502de. In the long run, though, I think it would be more ideologically correct to change the gateway-sp-comms protocol and add a new MeasurementKind variant for synthetic, dimensionless (so-called) measurements. That's probably a bigger change as it involves plumbing that through Hubris in various places.

@hawkw
Copy link
Member

hawkw commented Sep 25, 2024

#6656 will fix this by having MGS special-case the SP-reported sensor name and device kind strings that represent Tctl measurements. In the future, I'd like to change how the SP reports these metrics in the gateway_sp_comms protocol, so that other tools (like faux-mgs) will also not see Tctl values as degrees Celsius. But, that should probably be tracked on the Hubris or management-gateway-service repos, rather than in Omicron.

hawkw added a commit that referenced this issue Sep 25, 2024
Currently, the AMD SP3 host CPU's T<sub>ctl</sub> value is reported to
Oximeter as an instance of the `hardware_component:temperature` metric
with `sensor = "CPU"`. This metric's unit is in degrees Celsius.

This is incorrect. T<sub>ctl</sub> is not a physical measurement from a
temperature sensor, but an internal parameter of the CPU's thermal
control loop in synthetic dimensionless units that range from 0-100. For
details, refer to [this comment][1] and oxidecomputer/stlouis#5.

This branch adds a new `hardware_component:amd_cpu_tctl` metric. Unlike
the `hardware_component:temperature` metric, this metric has no unit, as
it is a dimensionless value. The SP sensor metrics task in MGS has been
changed to special-case temperature measurements where the sensor name
is "CPU" and the device kind is "sbtsi" so that they are reported using
the `hardware_componentamd_:cpu_tctl` metric rather than the
`hardware_component:temperature` metric.

In the future, I think a more ideologically correct solution to this
would be to add a variant to the
`gateway_messages::measurement::MeasurementKind` enum to represent
dimensionless measurements (or, perhaps, specifically for
T<sub>ctl</sub>?), and change Hubris to report the SB-TSI
T<sub>ctl</sub> using that instead. However, this looks like it would
probably be a somewhat more complex change in Hubris, as the thermal
control loop would still need to consider that measurement. This change
introduces the new metric type and fixes the problem of us reporting
incorrectly-labeled metrics, so I think it's worth handling it this way
and then going back and making the Hubris change later.

Fixes #6634

[1]: https://github.com/illumos/illumos-gate/blob/fabe1c1dce7b8e0bda70c23798d91f63c2a5a2c5/usr/src/uts/intel/io/amdzen/smntemp.c#L21-L57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants