-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[receiver/hostmetrics] Divide by logical cores when calculating process.cpu.utilization #31378
[receiver/hostmetrics] Divide by logical cores when calculating process.cpu.utilization #31378
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR would be correct according to the semantic conventions which define the metric as:
"Difference in process.cpu.time since the last measurement, divided by the elapsed time and number of CPUs available to the process."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would definitely wait for @dmitryax to chime in, as part of system semantic conventions work he will be going through and clearing up all the definitions of CPU utilization and process.
Personally I think it's fine to merge this PR because based on the description of this metric in the current metadata.yaml, the existing implementation is categorically wrong and this fixes it as far as I can tell. I'm a bit fuzzy on whether this technically constitutes a breaking change, since we are now fixing something that was incorrect but technically users could be relying on that incorrect number. I'm not sure on that judgement call.
On a personal level I actually am not a huge fan of the described behaviour of this metric. At least in Linux-land, a process' CPU utilization is often interpreted as the percentage of a single core, thus the percentage actually could be over 1 if the process uses more than 1 full core. I discussed this in the last system semconv meeting, and I will open an issue about it for us to work out there.
All that aside, my opinion is that this PR should be merged to match the description, and the semconv working group should make a decision what we want this metric to be in the final process semantic conventions. Definitely wait for Dmitrii to weigh in though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this PR aligns the receiver's behavior with the current description in the Semantic Conventions for OS Process Metrics.
The only caveat is that the OS process convention itself is apparently in "Experimental" status. We probably don't want to change this behavior twice. ;)
Perhaps worth marking this as a breaking
change to make it more visible to users who might be relying on the current behavior.
Bump @dmitryax - Would like to get your thoughts on this one! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's the right fix given that we don't have an attribute for a particular core as we have for system.cpu.utilization
But, given that it's a widely used component, should we make the change with a feature gate? @djaglowski @astencel-sumo @braydonk |
Yeah I'm afraid I agree. 😉 It's always additional work adding the feature gate and changing it in later stages, but probably it's justified in this case. This might be quite a disruptive change for users relying on this metric. |
d646652
to
1db0017
Compare
@dmitryax I've updated the PR to add a new featuregate, and optionally normalize the metric based on the feature gate. |
receiver/hostmetricsreceiver/internal/scraper/processscraper/ucal/cpu_utilization_calculator.go
Outdated
Show resolved
Hide resolved
receiver/hostmetricsreceiver/internal/scraper/processscraper/ucal/cpu_utilization_calculator.go
Outdated
Show resolved
Hide resolved
receiver/hostmetricsreceiver/internal/scraper/processscraper/ucal/cpu_utilization_calculator.go
Outdated
Show resolved
Hide resolved
.../hostmetricsreceiver/internal/scraper/processscraper/ucal/cpu_utilization_calculator_test.go
Outdated
Show resolved
Hide resolved
@BinaryFissionGames , thanks for adding the feature gate. I've added a couple comments to make naming more consistent. Can you also please add documentation describing the behavior and the schedule of the feature gate. See e.g. here for inspiration: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.87.0/extension/storage/filestorage/README.md#extensionfilestoragereplaceunsafecharacters. |
The schedule for this feature gate is: | ||
- Introduced in v0.97.0 (March 2024) as `alpha` - disabled by default. | ||
- Moved to `beta` in v0.99.0 (April 2024) - enabled by default. | ||
- Moved to `stable` in v0.101.0 (May 2024) - cannot be disabled. | ||
- Removed three releases after `stable`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if this schedule seems too compressed. Tried to go with the 2 release per advancement approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it! Thank you @BinaryFissionGames 👏
…ss.cpu.utilization (open-telemetry#31378) **Description:** When calculating the process.cpu.utilization metric, values over 1 were possible since the number of cores was not taken into account (a single process may run on multiple logical cores, this effectively multplying the maximum amount of CPU time the process may take). This PR adds a division by the number of logical cores to the calculation for cpu utilization. **Link to tracking Issue:** Closes open-telemetry#31368 **Testing:** * Added some unit tests * Tested locally on my system with the program I posted in the issue: ```json { "name": "process.cpu.utilization", "description": "Percentage of total CPU time used by the process since last scrape, expressed as a value between 0 and 1. On the first scrape, no data point is emitted for this metric.", "unit": "1", "gauge": { "dataPoints": [ { "attributes": [{ "key": "state", "value": { "stringValue": "user" } }], "startTimeUnixNano": "1708562810521000000", "timeUnixNano": "1708562890771386000", "asDouble": 0.8811268516953904 }, { "attributes": [ { "key": "state", "value": { "stringValue": "system" } } ], "startTimeUnixNano": "1708562810521000000", "timeUnixNano": "1708562890771386000", "asDouble": 0.0029471002907659667 }, { "attributes": [{ "key": "state", "value": { "stringValue": "wait" } }], "startTimeUnixNano": "1708562810521000000", "timeUnixNano": "1708562890771386000", "asDouble": 0 } ] } } ``` In activity monitor, this process was clocking in around ~1000% - ~1100% cpu, on my machine that has 12 logical cores. So the value of around 90% total utilization seems correct here. **Documentation:** N/A --------- Co-authored-by: Daniel Jaglowski <[email protected]>
…ss.cpu.utilization (open-telemetry#31378) **Description:** When calculating the process.cpu.utilization metric, values over 1 were possible since the number of cores was not taken into account (a single process may run on multiple logical cores, this effectively multplying the maximum amount of CPU time the process may take). This PR adds a division by the number of logical cores to the calculation for cpu utilization. **Link to tracking Issue:** Closes open-telemetry#31368 **Testing:** * Added some unit tests * Tested locally on my system with the program I posted in the issue: ```json { "name": "process.cpu.utilization", "description": "Percentage of total CPU time used by the process since last scrape, expressed as a value between 0 and 1. On the first scrape, no data point is emitted for this metric.", "unit": "1", "gauge": { "dataPoints": [ { "attributes": [{ "key": "state", "value": { "stringValue": "user" } }], "startTimeUnixNano": "1708562810521000000", "timeUnixNano": "1708562890771386000", "asDouble": 0.8811268516953904 }, { "attributes": [ { "key": "state", "value": { "stringValue": "system" } } ], "startTimeUnixNano": "1708562810521000000", "timeUnixNano": "1708562890771386000", "asDouble": 0.0029471002907659667 }, { "attributes": [{ "key": "state", "value": { "stringValue": "wait" } }], "startTimeUnixNano": "1708562810521000000", "timeUnixNano": "1708562890771386000", "asDouble": 0 } ] } } ``` In activity monitor, this process was clocking in around ~1000% - ~1100% cpu, on my machine that has 12 logical cores. So the value of around 90% total utilization seems correct here. **Documentation:** N/A --------- Co-authored-by: Daniel Jaglowski <[email protected]>
Description:
When calculating the process.cpu.utilization metric, values over 1 were possible since the number of cores was not taken into account (a single process may run on multiple logical cores, this effectively multplying the maximum amount of CPU time the process may take).
This PR adds a division by the number of logical cores to the calculation for cpu utilization.
Link to tracking Issue: Closes #31368
Testing:
In activity monitor, this process was clocking in around ~1000% - ~1100% cpu, on my machine that has 12 logical cores. So the value of around 90% total utilization seems correct here.
Documentation:
N/A