[receiver/hostmetrics] Divide by logical cores when calculating process.cpu.utilization #31378

BinaryFissionGames · 2024-02-22T01:08:33Z

Description:
When calculating the process.cpu.utilization metric, values over 1 were possible since the number of cores was not taken into account (a single process may run on multiple logical cores, this effectively multplying the maximum amount of CPU time the process may take).

This PR adds a division by the number of logical cores to the calculation for cpu utilization.

Link to tracking Issue: Closes #31368

Testing:

Added some unit tests
Tested locally on my system with the program I posted in the issue:

{
  "name": "process.cpu.utilization",
  "description": "Percentage of total CPU time used by the process since last scrape, expressed as a value between 0 and 1. On the first scrape, no data point is emitted for this metric.",
  "unit": "1",
  "gauge": {
    "dataPoints": [
      {
        "attributes": [{ "key": "state", "value": { "stringValue": "user" } }],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0.8811268516953904
      },
      {
        "attributes": [
          { "key": "state", "value": { "stringValue": "system" } }
        ],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0.0029471002907659667
      },
      {
        "attributes": [{ "key": "state", "value": { "stringValue": "wait" } }],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0
      }
    ]
  }
}

In activity monitor, this process was clocking in around ~1000% - ~1100% cpu, on my machine that has 12 logical cores. So the value of around 90% total utilization seems correct here.

Documentation:
N/A

djaglowski

This PR would be correct according to the semantic conventions which define the metric as:

"Difference in process.cpu.time since the last measurement, divided by the elapsed time and number of CPUs available to the process."

braydonk

I would definitely wait for @dmitryax to chime in, as part of system semantic conventions work he will be going through and clearing up all the definitions of CPU utilization and process.

Personally I think it's fine to merge this PR because based on the description of this metric in the current metadata.yaml, the existing implementation is categorically wrong and this fixes it as far as I can tell. I'm a bit fuzzy on whether this technically constitutes a breaking change, since we are now fixing something that was incorrect but technically users could be relying on that incorrect number. I'm not sure on that judgement call.

On a personal level I actually am not a huge fan of the described behaviour of this metric. At least in Linux-land, a process' CPU utilization is often interpreted as the percentage of a single core, thus the percentage actually could be over 1 if the process uses more than 1 full core. I discussed this in the last system semconv meeting, and I will open an issue about it for us to work out there.

All that aside, my opinion is that this PR should be merged to match the description, and the semconv working group should make a decision what we want this metric to be in the final process semantic conventions. Definitely wait for Dmitrii to weigh in though.

andrzej-stencel

I agree that this PR aligns the receiver's behavior with the current description in the Semantic Conventions for OS Process Metrics.

The only caveat is that the OS process convention itself is apparently in "Experimental" status. We probably don't want to change this behavior twice. ;)

Perhaps worth marking this as a breaking change to make it more visible to users who might be relying on the current behavior.

BinaryFissionGames · 2024-03-05T14:26:39Z

Bump @dmitryax - Would like to get your thoughts on this one!

dmitryax

I think it's the right fix given that we don't have an attribute for a particular core as we have for system.cpu.utilization

dmitryax · 2024-03-07T21:08:38Z

But, given that it's a widely used component, should we make the change with a feature gate? @djaglowski @astencel-sumo @braydonk

andrzej-stencel · 2024-03-08T09:35:53Z

But, given that it's a widely used component, should we make the change with a feature gate? @djaglowski @astencel-sumo @braydonk

Yeah I'm afraid I agree. 😉 It's always additional work adding the feature gate and changing it in later stages, but probably it's justified in this case. This might be quite a disruptive change for users relying on this metric.

BinaryFissionGames · 2024-03-08T16:41:37Z

@dmitryax I've updated the PR to add a new featuregate, and optionally normalize the metric based on the feature gate.

receiver/hostmetricsreceiver/internal/scraper/processscraper/ucal/cpu_utilization_calculator.go

.../hostmetricsreceiver/internal/scraper/processscraper/ucal/cpu_utilization_calculator_test.go

andrzej-stencel · 2024-03-08T20:35:00Z

@BinaryFissionGames , thanks for adding the feature gate. I've added a couple comments to make naming more consistent.

Can you also please add documentation describing the behavior and the schedule of the feature gate. See e.g. here for inspiration: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.87.0/extension/storage/filestorage/README.md#extensionfilestoragereplaceunsafecharacters.

BinaryFissionGames · 2024-03-08T20:54:48Z

receiver/hostmetricsreceiver/README.md

+The schedule for this feature gate is:
+- Introduced in v0.97.0 (March 2024) as `alpha` - disabled by default.
+- Moved to `beta` in v0.99.0 (April 2024) - enabled by default.
+- Moved to `stable` in v0.101.0 (May 2024) - cannot be disabled.
+- Removed three releases after `stable`.


Let me know if this schedule seems too compressed. Tried to go with the 2 release per advancement approach.

andrzej-stencel

Love it! Thank you @BinaryFissionGames 👏

receiver/hostmetricsreceiver/README.md

…ss.cpu.utilization (open-telemetry#31378) **Description:** When calculating the process.cpu.utilization metric, values over 1 were possible since the number of cores was not taken into account (a single process may run on multiple logical cores, this effectively multplying the maximum amount of CPU time the process may take). This PR adds a division by the number of logical cores to the calculation for cpu utilization. **Link to tracking Issue:** Closes open-telemetry#31368 **Testing:** * Added some unit tests * Tested locally on my system with the program I posted in the issue: ```json { "name": "process.cpu.utilization", "description": "Percentage of total CPU time used by the process since last scrape, expressed as a value between 0 and 1. On the first scrape, no data point is emitted for this metric.", "unit": "1", "gauge": { "dataPoints": [ { "attributes": [{ "key": "state", "value": { "stringValue": "user" } }], "startTimeUnixNano": "1708562810521000000", "timeUnixNano": "1708562890771386000", "asDouble": 0.8811268516953904 }, { "attributes": [ { "key": "state", "value": { "stringValue": "system" } } ], "startTimeUnixNano": "1708562810521000000", "timeUnixNano": "1708562890771386000", "asDouble": 0.0029471002907659667 }, { "attributes": [{ "key": "state", "value": { "stringValue": "wait" } }], "startTimeUnixNano": "1708562810521000000", "timeUnixNano": "1708562890771386000", "asDouble": 0 } ] } } ``` In activity monitor, this process was clocking in around ~1000% - ~1100% cpu, on my machine that has 12 logical cores. So the value of around 90% total utilization seems correct here. **Documentation:** N/A --------- Co-authored-by: Daniel Jaglowski <[email protected]>

BinaryFissionGames requested a review from dmitryax as a code owner February 22, 2024 01:08

BinaryFissionGames requested a review from a team February 22, 2024 01:08

github-actions bot assigned mx-psi Feb 22, 2024

github-actions bot added the receiver/hostmetrics label Feb 22, 2024

github-actions bot requested a review from braydonk February 22, 2024 01:08

djaglowski approved these changes Feb 22, 2024

View reviewed changes

braydonk approved these changes Feb 23, 2024

View reviewed changes

andrzej-stencel approved these changes Feb 26, 2024

View reviewed changes

dmitryax approved these changes Mar 7, 2024

View reviewed changes

BinaryFissionGames added 2 commits March 8, 2024 10:55

divide by logical cores when calculating cpu utilization

f3f1414

Add feature gate for normalizing CPU utilization

1db0017

BinaryFissionGames force-pushed the fix/cpu-utilization-divide-by-cores branch from d646652 to 1db0017 Compare March 8, 2024 15:55

BinaryFissionGames added 2 commits March 8, 2024 11:12

fix lint

15e31d8

tidy

e841756

andrzej-stencel reviewed Mar 8, 2024

View reviewed changes

receiver/hostmetricsreceiver/internal/scraper/processscraper/ucal/cpu_utilization_calculator.go Outdated Show resolved Hide resolved

andrzej-stencel reviewed Mar 8, 2024

View reviewed changes

receiver/hostmetricsreceiver/internal/scraper/processscraper/ucal/cpu_utilization_calculator.go Outdated Show resolved Hide resolved

andrzej-stencel reviewed Mar 8, 2024

View reviewed changes

receiver/hostmetricsreceiver/internal/scraper/processscraper/ucal/cpu_utilization_calculator.go Outdated Show resolved Hide resolved

andrzej-stencel self-requested a review March 8, 2024 20:28

andrzej-stencel reviewed Mar 8, 2024

View reviewed changes

.../hostmetricsreceiver/internal/scraper/processscraper/ucal/cpu_utilization_calculator_test.go Outdated Show resolved Hide resolved

apply suggestions.

d3b07d0

BinaryFissionGames commented Mar 8, 2024

View reviewed changes

andrzej-stencel approved these changes Mar 8, 2024

View reviewed changes

djaglowski approved these changes Mar 12, 2024

View reviewed changes

Merge branch 'main' into fix/cpu-utilization-divide-by-cores

0301e11

dmitryax approved these changes Mar 12, 2024

View reviewed changes

dmitryax reviewed Mar 12, 2024

View reviewed changes

receiver/hostmetricsreceiver/README.md Outdated Show resolved Hide resolved

revert auto-formatting changes

344be93

djaglowski merged commit 757ddd1 into open-telemetry:main Mar 12, 2024
142 checks passed

github-actions bot added this to the next release milestone Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[receiver/hostmetrics] Divide by logical cores when calculating process.cpu.utilization #31378

[receiver/hostmetrics] Divide by logical cores when calculating process.cpu.utilization #31378

BinaryFissionGames commented Feb 22, 2024

djaglowski left a comment

braydonk left a comment •

edited

Loading

andrzej-stencel left a comment

BinaryFissionGames commented Mar 5, 2024

dmitryax left a comment

dmitryax commented Mar 7, 2024

andrzej-stencel commented Mar 8, 2024

BinaryFissionGames commented Mar 8, 2024

andrzej-stencel commented Mar 8, 2024

BinaryFissionGames Mar 8, 2024

andrzej-stencel left a comment

[receiver/hostmetrics] Divide by logical cores when calculating process.cpu.utilization #31378

[receiver/hostmetrics] Divide by logical cores when calculating process.cpu.utilization #31378

Conversation

BinaryFissionGames commented Feb 22, 2024

djaglowski left a comment

Choose a reason for hiding this comment

braydonk left a comment • edited Loading

Choose a reason for hiding this comment

andrzej-stencel left a comment

Choose a reason for hiding this comment

BinaryFissionGames commented Mar 5, 2024

dmitryax left a comment

Choose a reason for hiding this comment

dmitryax commented Mar 7, 2024

andrzej-stencel commented Mar 8, 2024

BinaryFissionGames commented Mar 8, 2024

andrzej-stencel commented Mar 8, 2024

BinaryFissionGames Mar 8, 2024

Choose a reason for hiding this comment

andrzej-stencel left a comment

Choose a reason for hiding this comment

braydonk left a comment •

edited

Loading