-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU temperature timeseries is not in degrees C #6634
Comments
Whoops, this one's on me. I guess we ought to have the MGS metrics task special-case sensor readings with the name "CPU". I'll look into fixing that. |
A quick fix would be to just change MGS' sensors task to special-case temperature readings with the name "CPU" and device kind "sbtsi", which I took a first pass at in 9c502de. In the long run, though, I think it would be more ideologically correct to change the |
#6656 will fix this by having MGS special-case the SP-reported sensor name and device kind strings that represent Tctl measurements. In the future, I'd like to change how the SP reports these metrics in the |
Currently, the AMD SP3 host CPU's T<sub>ctl</sub> value is reported to Oximeter as an instance of the `hardware_component:temperature` metric with `sensor = "CPU"`. This metric's unit is in degrees Celsius. This is incorrect. T<sub>ctl</sub> is not a physical measurement from a temperature sensor, but an internal parameter of the CPU's thermal control loop in synthetic dimensionless units that range from 0-100. For details, refer to [this comment][1] and oxidecomputer/stlouis#5. This branch adds a new `hardware_component:amd_cpu_tctl` metric. Unlike the `hardware_component:temperature` metric, this metric has no unit, as it is a dimensionless value. The SP sensor metrics task in MGS has been changed to special-case temperature measurements where the sensor name is "CPU" and the device kind is "sbtsi" so that they are reported using the `hardware_componentamd_:cpu_tctl` metric rather than the `hardware_component:temperature` metric. In the future, I think a more ideologically correct solution to this would be to add a variant to the `gateway_messages::measurement::MeasurementKind` enum to represent dimensionless measurements (or, perhaps, specifically for T<sub>ctl</sub>?), and change Hubris to report the SB-TSI T<sub>ctl</sub> using that instead. However, this looks like it would probably be a somewhat more complex change in Hubris, as the thermal control loop would still need to consider that measurement. This change introduces the new metric type and fixes the problem of us reporting incorrectly-labeled metrics, so I think it's worth handling it this way and then going back and making the Hubris change later. Fixes #6634 [1]: https://github.com/illumos/illumos-gate/blob/fabe1c1dce7b8e0bda70c23798d91f63c2a5a2c5/usr/src/uts/intel/io/amdzen/smntemp.c#L21-L57
Today, we report the CPU temperature via the SP, in the
hardware_component:temperature
timeseries, specifically those with the"CPU"
sensor name. Like all the other temperature measurements, we report those in degrees C. Unfortunately, that's not correct for the CPU temperature metrics the SP has.There's a lot of subtlety here, which Robert and Keith laid out in this stlouis block comment and this issue. The high-level summary is that these values are a "control temperature", a synthetic number that AMD computes in an unknown way on the basis of the actual hardware sensors a given CPU has. The point of this appears to be to provide a single value that is generic across all the different parts and resource groups (classes of CPUs with different TDP, for example). That number can be used in a thermal control loop, rather than the requiring the loop to figure out which part it's reading from and combine all the junction temperatures itself. (Those actually are in degrees C.)
The best thing to do here is probably cordon off the CPU temperature into its own timeseries. We definitely need to make clear that this is dimensionless, in both the actual unit and the description of the timeseries values as well.
The text was updated successfully, but these errors were encountered: