Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gateway] Report CPU Tctl as a dimensionless metric #6656

Merged
merged 10 commits into from
Sep 25, 2024
Merged

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Sep 24, 2024

Currently, the AMD SP3 host CPU's Tctl value is reported to
Oximeter as an instance of the hardware_component:temperature metric
with sensor = "CPU". This metric's unit is in degrees Celcius.

This is incorrect. Tctl is not a physical measurement from a
temperature sensor, but an internal parameter of the CPU's thermal
control loop in synthetic dimensionless units that range from 0-100. For
details, refer to this comment and oxidecomputer/stlouis#5.

This branch adds a new hardware_component:cpu_tctl metric. Unlike the
hardware_component:temperature metric, this metric has no unit, as it
is a dimensionless value. The SP sensor metrics task in MGS has been
changed to special-case temperature measurements where the sensor name
is "CPU" and the device kind is "sbtsi" so that they are reported using
the hardware_component:cpu_tctl metric rather than the
hardware_component:temperature metric.

In the future, I think a more ideologically correct solution to this
would be to add a variant to the
gateway_messages::measurement::MeasurementKind enum to represent
dimensionless measurements (or, perhaps, specifically for
Tctl?), and change Hubris to report the SB-TSI
Tctl using that instead. However, this looks like it would
probably be a somewhat more complex change in Hubris, as the thermal
control loop would still need to consider that measurement. This change
introduces the new metric type and fixes the problem of us reporting
incorrectly-labeled metrics, so I think it's worth handling it this way
and then going back and making the Hubris change later.

Fixes #6634

@hawkw hawkw added the oximeter label Sep 24, 2024
@hawkw hawkw self-assigned this Sep 24, 2024
@hawkw
Copy link
Member Author

hawkw commented Sep 24, 2024

Agh --- this test failure is because i forgot to add the simulated component to the MGS tests. Will fix that.

Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a solid, very practical fix, thanks! I agree that we can look for a better long-term solution, but this is definitely good enough for now. Just one small comment suggestion, but LGTM!

gateway/src/metrics.rs Outdated Show resolved Hide resolved
@hawkw
Copy link
Member Author

hawkw commented Sep 24, 2024

@bnaecker Thanks for the speedy review! I belatedly realized that we should also distinguish between Tctl and other temperature metrics for the purpose of missing samples and error metrics, so 6552ab0 fixes that. I don't know if you feel the need to review again or if you're still good with the previous review.

@hawkw hawkw enabled auto-merge (squash) September 25, 2024 17:48
@hawkw hawkw merged commit dbbb77b into main Sep 25, 2024
16 checks passed
@hawkw hawkw deleted the eliza/cpu-temp-metric branch September 25, 2024 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CPU temperature timeseries is not in degrees C
2 participants