Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gimlet SPs should not report AMD CPU Tctl as a temperature measurement in Celsius #1881

Open
hawkw opened this issue Sep 26, 2024 · 0 comments
Assignees
Labels

Comments

@hawkw
Copy link
Member

hawkw commented Sep 26, 2024

The CPU temperature value we get from AMD host CPUs over SB-TSI represents the CPU's Tctl, which is not a measurement of physical temperature in Celsius like other temperature sensors. Instead, it's a dimensionless value from 0-100 calculated by the CPU's thermal control loop that represents a sort of "thermal throttling danger level". See oxidecomputer/omicron#6634,oxidecomputer/stlouis#5, or this block comment in stlouis for details on what this value actually means.

Right now, the control-plane-agent task will report Tctl values as a sensor measurement with MeasurementKind::Temperature, which suggests to other software that this is a physical temperature measurement in degrees Celsius. We should change this so that it's clearer that this is a dimensionless measurement. As of oxidecomputer/omicron#6656, the MGS subsystem that collects Oximeter metrics from SP sensor measurements special-cases Tctl based on the sensor name and device ID, so the Oximeter metrics are not recorded with incorrect units. This is probably the most important place to get it right, since Oximeter metrics are the primary way we expect customers to consume SP sensor data.

However, because we are currently special-casing Tctl only in the MGS sensor metrics task, these values will still be treated as physical temperature in Celsius elsewhere --- for instance, in faux-mgs, humility sensors, and to other consumers of the MGS HTTP API for reading SP component details. This is a bit of a bummer, so we should probably fix this in a more principled way.

Doing this in Hubris is probably a bit annoying, as we still want to represent the SB-TSI Tctl value as a
"temperature sensor" as far as task-thermal is concerned, as it's an important control parameter for the SP thermal control code...but, we would need to indicate that task-sensors and control-plane-agent should report the value from that particular temperature sensor with different units. Entirely doable, but just requires some plumbing.

While we're here, we probably want to consider changing the Hubris-reported names for this measurement to make it more clear that it's specifically the CPU's Tctl value. Right now, SB-TSI measurements are reported as a component with device: "sbtsi" and description: "CPU temperature sensor", which has one measurement channel with sensor: "CPU" (the Tctl value. We should probably consider changing this to sensor: "Tctl" or sensor: "CPU Tctl" to make this clearer --- especially if we were to support reading other CPU thermal measurements, such as Tdie or Tccd_n_ from Hubris in the future.1

Footnotes

  1. Note that, as I understand it, those values can't be read through SB-TSI over SMBus, and would instead need to come over IPCC from the host or something. So, I'm not sure if/when we'll actually expose them to Hubris...

@hawkw hawkw self-assigned this Sep 26, 2024
@hawkw hawkw added the gimlet label Sep 26, 2024
hawkw added a commit to oxidecomputer/management-gateway-service that referenced this issue Sep 26, 2024
This commit adds a `Dimensionless` variant to the `MeasurementKind` enum
to represent measurements without units. This is intended to be used to
represent AMD CPU T<sub>ctl</sub> thermal values, which are a unitless
value from 0-100 representing a kind of "abstract thermal throttling
danger level" --- see oxidecomputer/hubris#1881 for details on why we
need to differentiate this from other temperature measurements.

I wasn't totally sure as to the best approach for naming this variant. I
note that the name of the enum is "measurement kind", not "measurement
unit", so it doesn't directly represent the unit of the measurement but
rather, the kind of quantity being measured. So, perhaps this ought to
be a `CpuTctl` variant or similar, to *specifically* represent
T<sub>ctl</sub> measurements. If other dimensionless values show up in
future, they would then have their own variant. I'm open to being
convicned either way --- what do others think?
hawkw added a commit to oxidecomputer/management-gateway-service that referenced this issue Sep 30, 2024
This commit adds a `CpuTctl` variant to the `MeasurementKind` enum. This
is intended to be used to represent AMD CPU T<sub>ctl</sub> thermal
values, which are a unitless value from 0-100 representing a kind of
"abstract thermal throttling danger level" --- see
oxidecomputer/hubris#1881 for details on why we need to differentiate
this from other temperature measurements.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant