[Internal Documentation] Data flow of CPU metrics presented in Fleet UI #4005

ycombinator · 2024-01-04T19:50:13Z

What does this PR do?

This PR documents the end-to-end data flow of CPU metrics presented in the Fleet UI, including any transformations/calculations/aggregations that are done at the various stages along the way.

Why is it important?

To understand how the CPU utilization % metric presented in the Fleet UI for every Agent in the Agent Listing view is ultimately calculated, as this has resulted in confusion for a lot of users.

Related issues

Closes #4001

mergify · 2024-01-04T19:50:53Z

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

nimarezainia · 2024-01-08T00:48:29Z

fyi @kpollich

cmacknz · 2024-01-08T18:12:30Z

Relates: elastic/kibana#174458

docs/cpu-metrics-in-fleet.md

ycombinator · 2024-01-09T08:02:50Z

@cmacknz @fearful-symmetry @nchaulet I've completed documenting the data path taken by CPU utilization metrics all the way from /proc/{pid}/stat files (for Agent and Beats processes running on Linux) to the single value shown in the Fleet UI.

I've also included, at the end of this documentation, sections on my observations and suggested improvements. Please review the entire document but especially the observations and suggested improvements sections and provide any feedback. I'd like to file issues (or reuse existing ones) for, and link to them from, each of the improvements in this documentation.

The short/medium-term goal is that next time there's an SDH on Agent CPU utilization metrics in the Fleet UI being confusing, we can link to this documentation. The medium/long-term goal would be to implement the improvements and then delete them from this documentation, while also updating the rest of the documentation as needed.

Thank you!

elasticmachine · 2024-01-09T08:03:02Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

fearful-symmetry · 2024-01-10T16:21:23Z

docs/cpu-metrics-in-fleet.md

+  the values for the Beat processes (along with the Agent process), we would be exposing internals of Agent to users, which
+  may be confusing for some users.
+
+* We should enhance collection and aggregation to include CPU utilization for Beats, which are Agent components managed by


Can you elaborate on this? You mean provide the CPU metrics in the stats metrics, instead of requiring fleet to perform its own aggregation query?

Sorry, I don't think I wrote this in the clearest way. I've changed the language now so please re-review.

What I want to convey is that we should also include and present CPU metrics about service-runtime components (e.g. Endpoint), not just command-runtime components (e.g. Beats). And I deliberately didn't want to get into exactly how we should do that.

I think we account for endpoint CPU now as well, which will even further complicate users attempt to correlate what is shown in Fleet to what is in top.

Top will never account for endpoint resource usage because it is in a different process and cgroup.

Hmmm, I'm not seeing Endpoint CPU being recorded today.

I just added the Elastic Defend integration to my policy and then, a few minutes later, queried the metrics-elastic_agent* indices with the following query:

POST metrics-elastic_agent*/_search?filter_path=hits.hits._source { "sort": [ { "@timestamp": { "order": "desc" } } ], "_source": [ "@timestamp", "elastic_agent.process", "component.id" ], "collapse": { "field": "component.id" }, "query": { "bool": { "must": [ { "terms": { "elastic_agent.id": [ "c406741d-7673-447f-b7e0-43417eeb10d6" ] } } ] } } }

Results:

{ "hits": { "hits": [ { "_source": { "@timestamp": "2024-01-10T22:03:45.069Z", "component": { "id": "log-default" }, "elastic_agent": { "process": "filebeat" } } }, { "_source": { "@timestamp": "2024-01-10T22:03:45.069Z", "component": { "id": "elastic-agent" }, "elastic_agent": { "process": "elastic-agent" } } } ] } }

[EDIT] And I verified that the Endpoint component was, in fact, running.

$ sudo elastic-agent status --output full ┌─ fleet │ └─ status: (HEALTHY) Connected └─ elastic-agent ├─ status: (HEALTHY) Running ├─ info │ ├─ id: c406741d-7673-447f-b7e0-43417eeb10d6 │ ├─ version: 8.11.3 │ └─ commit: f4f6fbb3e6c81f37cec57a3c244f009b14abd74f ├─ beat/metrics-monitoring │ ├─ status: (HEALTHY) Healthy: communicating with pid '1285688' │ ├─ beat/metrics-monitoring │ │ ├─ status: (HEALTHY) Healthy │ │ └─ type: OUTPUT │ └─ beat/metrics-monitoring-metrics-monitoring-beats │ ├─ status: (HEALTHY) Healthy │ └─ type: INPUT ├─ endpoint-default │ ├─ status: (HEALTHY) Healthy: communicating with endpoint service │ ├─ endpoint-default │ │ ├─ status: (HEALTHY) Applied policy {a85e47d2-b214-4837-bbf2-8f424a9764a2} │ │ └─ type: OUTPUT │ └─ endpoint-default-a85e47d2-b214-4837-bbf2-8f424a9764a2 │ ├─ status: (HEALTHY) Applied policy {a85e47d2-b214-4837-bbf2-8f424a9764a2} │ └─ type: INPUT ├─ http/metrics-monitoring │ ├─ status: (HEALTHY) Healthy: communicating with pid '1285698' │ ├─ http/metrics-monitoring │ │ ├─ status: (HEALTHY) Healthy │ │ └─ type: OUTPUT │ └─ http/metrics-monitoring-metrics-monitoring-agent │ ├─ status: (HEALTHY) Healthy │ └─ type: INPUT └─ log-default ├─ status: (HEALTHY) Healthy: communicating with pid '1285682' ├─ log-default │ ├─ status: (HEALTHY) Healthy │ └─ type: OUTPUT └─ log-default-logfile-logs-82c5d927-b806-4c08-96b7-4972d7767739 ├─ status: (HEALTHY) Healthy └─ type: INPUT

fearful-symmetry · 2024-01-10T16:23:46Z

docs/cpu-metrics-in-fleet.md

+
+## Suggested improvements
+
+* We should document the observations above in an appropriate location and perhaps link to this documentation from the "i"


Ideally, the tooltip text should also be improved, as right now it's pretty vague.

Yeah, if we can briefly explain what the metric represents and a bit about how it's calculated, we should put all that in the tooltip. Otherwise, I think we should just replace the tooltip with a link to a documentation page.

IMO if we have to link to a documentation page we are probably presenting the wrong value. Most likely the CPU should just document itself as the 5 minute average CPU utilization, or switch to just presenting an instantaneous value from the last checkin with a link to see a graph of the values over time.

Agreed, I guess this suggestion should be more of a fallback if we can't come up with a good UX in the Fleet UI itself.

... just presenting an instantaneous value from the last checkin

I'm partial to idea of showing a single value that's the sum the of the latest CPU utilization %s for Agent + all it's component processes with a tooltip explaining that the value can range from 0 to the number of cores * 100. This value should be easily correlatable to what's seen in the output of top / htop.

... with a link to see a graph of the values over time.

++, single value could link to a chart showing the CPU utilization %, per process, over time, in case the user wants to drill down further.

fearful-symmetry · 2024-01-10T16:26:13Z

docs/cpu-metrics-in-fleet.md

+* Also, CPU utilization is rarely constant. If the output of `top` or `htop` for Agent and Beat processes is observed over
+  time, the CPU utilization % shown varies for each process over time.
+
+* To relate the output seen in `top` or `htop` for Agent and Beat processes with the single value shown in the Fleet UI,


I wonder if users also find this confusing? Do we need to average over 5 minutes? It smooths out spikes, but also might confuse existing users to used to peeping at metrics.

Added suggestion in e490363.

cmacknz · 2024-01-10T21:13:01Z

docs/cpu-metrics-in-fleet.md

+* We should document the observations above in an appropriate location and perhaps link to this documentation from the "i"
+  icon in CPU column in the Agent Listing page in the Fleet UI.
+
+* Rather than showing a single value for every Agent in the Agent Listing page in the Fleet UI, we should consider showing


We also have to account for there being multiple instances of each individual Beat process when we display it them. We have to break it down by unique input-output combination as that is how we create the processes.

Added suggestion in 2ded3c4.

cmacknz · 2024-01-10T21:16:11Z

docs/cpu-metrics-in-fleet.md

+    }
+  ],
+  "collapse": {
+    "field": "elastic_agent.process"


Does this account for there being multiple instances of each Beat process? It could depending on if we use the component ID (filestream-default) or the just the process name (e.g. filebeat) as the value.

For reference the default Fleet policy with the system integration and monitoring runs 3 metricbeat instances (system/metrics-default, http/metrics-monitoring, beat/metrics-monitoring) and 2 filebeat instances (filestream-monitoring, log-default).

If this is averaging the 3 metricbeats and 2 filebeats together as if they are one process instead of summing the usage of each distinct process I don't think it is correct.

The memory usage query has this problem as described in elastic/kibana#174458

It doesn't. The value of this field is just filebeat. The query should be using the component.id field itself, which has values such as log-default. I will add this fix to the list of suggestions.

Looks like this is already tracked as a TODO in elastic/kibana#174458 now that I look closer at it

If you're referring to elastic/kibana#174458 (comment), I added that based on this conversation here :).

One more question: did you observe us reporting CPU for the monitoring components?

elastic/kibana#174458 (comment)

There is some evidence we are not reporting them, causing us to undercount CPU and memory usage by omitting the monitoring component contributions.

I did not observe us reporting CPU for monitoring components themselves.

Added a suggestion for collecting CPU usage metrics for monitoring components.

ycombinator · 2024-01-10T21:46:38Z

@fearful-symmetry @cmacknz Thanks for all the great feedback! Based on it, I've reworked the suggestions section in the doc now. Please re-review and I will iterate further as needed. Once you're both happy with the suggestions, I'll file issues for them and link them from this doc before merging this PR.

cmacknz

I only have questions about why we are doing what we are doing at this point, the content here LGTM.

nchaulet · 2024-01-11T19:28:05Z

docs/cpu-metrics-in-fleet.md

+  30-second or 1-minute average (making corresponding adjustments to the `calendar_interval` value in the `cpu_time_series`
+  aggregation). This would result in a value closer to what's observed in `top` / `htop` output.
+
+* We should link the value shown in the Fleet UI to a chart that breaks it down for that Agent by `component.id` over time,


should this be the elastic_agent dashboard?

Yes, specifically the Agent Metrics one, e.g.:

fearful-symmetry

I would say this is all pretty reasonable. Do we plan to break down the Suggested Improvements down to a meta-issue or something?

ycombinator · 2024-01-12T21:55:00Z

I would say this is all pretty reasonable. Do we plan to break down the Suggested Improvements down to a meta-issue or something?

Yes, I plan to file separate issues for them and then link to all of them from this doc.

…PU usage

ycombinator · 2024-01-12T22:59:25Z

Filed issues for each of the suggested improvements and linked to them from this doc.

elastic-sonarqube · 2024-01-12T23:29:58Z

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

cmacknz · 2024-01-15T14:08:17Z

Force merging, documentation change only.

ycombinator added Team:Elastic-Agent Label for the Agent team docs labels Jan 4, 2024

mergify bot assigned ycombinator Jan 4, 2024

mergify bot added the backport-skip label Jan 4, 2024

ycombinator added the skip-changelog label Jan 4, 2024

jillguyonnet mentioned this pull request Jan 8, 2024

[Fleet] Fix agent memory query elastic/kibana#174458

Open

cmacknz reviewed Jan 8, 2024

View reviewed changes

docs/cpu-metrics-in-fleet.md Outdated Show resolved Hide resolved

ycombinator requested review from nchaulet, cmacknz and fearful-symmetry January 9, 2024 07:56

ycombinator marked this pull request as ready for review January 9, 2024 08:02

ycombinator requested a review from a team as a code owner January 9, 2024 08:03

ycombinator requested a review from leehinman January 9, 2024 08:03

ycombinator force-pushed the doc-cpu-metrics branch from 0716aee to b1a9d38 Compare January 9, 2024 08:03

fearful-symmetry reviewed Jan 10, 2024

View reviewed changes

ycombinator requested a review from fearful-symmetry January 10, 2024 20:35

cmacknz requested a review from kpollich January 10, 2024 20:39

cmacknz reviewed Jan 10, 2024

View reviewed changes

ycombinator force-pushed the doc-cpu-metrics branch from b1a9d38 to 6747ce3 Compare January 10, 2024 21:24

ycombinator requested a review from cmacknz January 10, 2024 21:46

cmacknz approved these changes Jan 11, 2024

View reviewed changes

nchaulet reviewed Jan 11, 2024

View reviewed changes

fearful-symmetry approved these changes Jan 12, 2024

View reviewed changes

ycombinator added 15 commits January 12, 2024 13:56

WIP

e5a57fc

More documentation

e677a80

Fixing filename

5b6941c

Answering question about values shown in /proc/{pid}/stats

4edc924

More info

b502c59

Another level of comparison

a51eef5

Remove Unanswered Questions section

0f1255d

Rename heading

35c98c3

Reorg

120b580

Document last stage + observations + suggestions

20f0744

Clarify suggestion

e8de6cd

Add suggestion for breaking down by Beat input type+output combination

1441a72

Add suggestion for not using 5-minute average

68906ff

Reworking suggestions

04d16c3

Add suggestion for counting monitoring components' contributions to C…

2267b1a

…PU usage

ycombinator force-pushed the doc-cpu-metrics branch from 52fb942 to 2267b1a Compare January 12, 2024 21:58

Add links to issues

a422d0f

ycombinator enabled auto-merge (squash) January 12, 2024 23:01

cmacknz disabled auto-merge January 15, 2024 14:08

cmacknz merged commit 17f0480 into elastic:main Jan 15, 2024
8 of 9 checks passed

ycombinator mentioned this pull request Feb 13, 2024

Agent should collect and report CPU and memory usage of monitoring components #4082

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Internal Documentation] Data flow of CPU metrics presented in Fleet UI #4005

[Internal Documentation] Data flow of CPU metrics presented in Fleet UI #4005

ycombinator commented Jan 4, 2024 •

edited

Loading

mergify bot commented Jan 4, 2024

nimarezainia commented Jan 8, 2024

cmacknz commented Jan 8, 2024

ycombinator commented Jan 9, 2024

elasticmachine commented Jan 9, 2024

fearful-symmetry Jan 10, 2024

ycombinator Jan 10, 2024 •

edited

Loading

cmacknz Jan 10, 2024

ycombinator Jan 10, 2024 •

edited

Loading

fearful-symmetry Jan 10, 2024

ycombinator Jan 10, 2024

cmacknz Jan 10, 2024

ycombinator Jan 10, 2024

fearful-symmetry Jan 10, 2024

ycombinator Jan 10, 2024

cmacknz Jan 10, 2024

ycombinator Jan 10, 2024

cmacknz Jan 10, 2024

ycombinator Jan 10, 2024

cmacknz Jan 10, 2024

ycombinator Jan 10, 2024

cmacknz Jan 11, 2024

ycombinator Jan 11, 2024

ycombinator Jan 12, 2024

ycombinator commented Jan 10, 2024

cmacknz left a comment

nchaulet Jan 11, 2024

ycombinator Jan 11, 2024

fearful-symmetry left a comment

ycombinator commented Jan 12, 2024 •

edited

Loading

ycombinator commented Jan 12, 2024

elastic-sonarqube bot commented Jan 12, 2024

cmacknz commented Jan 15, 2024


		## Suggested improvements

		* We should document the observations above in an appropriate location and perhaps link to this documentation from the "i"

[Internal Documentation] Data flow of CPU metrics presented in Fleet UI #4005

[Internal Documentation] Data flow of CPU metrics presented in Fleet UI #4005

Conversation

ycombinator commented Jan 4, 2024 • edited Loading

What does this PR do?

Why is it important?

Related issues

mergify bot commented Jan 4, 2024

nimarezainia commented Jan 8, 2024

cmacknz commented Jan 8, 2024

ycombinator commented Jan 9, 2024

elasticmachine commented Jan 9, 2024

Choose a reason for hiding this comment

ycombinator Jan 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ycombinator Jan 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ycombinator commented Jan 10, 2024

cmacknz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fearful-symmetry left a comment

Choose a reason for hiding this comment

ycombinator commented Jan 12, 2024 • edited Loading

ycombinator commented Jan 12, 2024

elastic-sonarqube bot commented Jan 12, 2024

Quality Gate passed

cmacknz commented Jan 15, 2024

ycombinator commented Jan 4, 2024 •

edited

Loading

ycombinator Jan 10, 2024 •

edited

Loading

ycombinator Jan 10, 2024 •

edited

Loading

ycombinator commented Jan 12, 2024 •

edited

Loading