-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gateway] Report CPU Tctl as a dimensionless metric #6656
Conversation
Agh --- this test failure is because i forgot to add the simulated component to the MGS tests. Will fix that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a solid, very practical fix, thanks! I agree that we can look for a better long-term solution, but this is definitely good enough for now. Just one small comment suggestion, but LGTM!
@bnaecker Thanks for the speedy review! I belatedly realized that we should also distinguish between Tctl and other temperature metrics for the purpose of missing samples and error metrics, so 6552ab0 fixes that. I don't know if you feel the need to review again or if you're still good with the previous review. |
Currently, the AMD SP3 host CPU's Tctl value is reported to
Oximeter as an instance of the
hardware_component:temperature
metricwith
sensor = "CPU"
. This metric's unit is in degrees Celcius.This is incorrect. Tctl is not a physical measurement from a
temperature sensor, but an internal parameter of the CPU's thermal
control loop in synthetic dimensionless units that range from 0-100. For
details, refer to this comment and oxidecomputer/stlouis#5.
This branch adds a new
hardware_component:cpu_tctl
metric. Unlike thehardware_component:temperature
metric, this metric has no unit, as itis a dimensionless value. The SP sensor metrics task in MGS has been
changed to special-case temperature measurements where the sensor name
is "CPU" and the device kind is "sbtsi" so that they are reported using
the
hardware_component:cpu_tctl
metric rather than thehardware_component:temperature
metric.In the future, I think a more ideologically correct solution to this
would be to add a variant to the
gateway_messages::measurement::MeasurementKind
enum to representdimensionless measurements (or, perhaps, specifically for
Tctl?), and change Hubris to report the SB-TSI
Tctl using that instead. However, this looks like it would
probably be a somewhat more complex change in Hubris, as the thermal
control loop would still need to consider that measurement. This change
introduces the new metric type and fixes the problem of us reporting
incorrectly-labeled metrics, so I think it's worth handling it this way
and then going back and making the Hubris change later.
Fixes #6634