-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Internal Documentation] Data flow of CPU metrics presented in Fleet UI #4005
Conversation
This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
NOTE: |
fyi @kpollich |
Relates: elastic/kibana#174458 |
@cmacknz @fearful-symmetry @nchaulet I've completed documenting the data path taken by CPU utilization metrics all the way from I've also included, at the end of this documentation, sections on my observations and suggested improvements. Please review the entire document but especially the observations and suggested improvements sections and provide any feedback. I'd like to file issues (or reuse existing ones) for, and link to them from, each of the improvements in this documentation. The short/medium-term goal is that next time there's an SDH on Agent CPU utilization metrics in the Fleet UI being confusing, we can link to this documentation. The medium/long-term goal would be to implement the improvements and then delete them from this documentation, while also updating the rest of the documentation as needed. Thank you! |
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
0716aee
to
b1a9d38
Compare
docs/cpu-metrics-in-fleet.md
Outdated
the values for the Beat processes (along with the Agent process), we would be exposing internals of Agent to users, which | ||
may be confusing for some users. | ||
|
||
* We should enhance collection and aggregation to include CPU utilization for Beats, which are Agent components managed by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this? You mean provide the CPU metrics in the stats
metrics, instead of requiring fleet to perform its own aggregation query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't think I wrote this in the clearest way. I've changed the language now so please re-review.
What I want to convey is that we should also include and present CPU metrics about service-runtime components (e.g. Endpoint), not just command-runtime components (e.g. Beats). And I deliberately didn't want to get into exactly how we should do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we account for endpoint CPU now as well, which will even further complicate users attempt to correlate what is shown in Fleet to what is in top.
Top will never account for endpoint resource usage because it is in a different process and cgroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I'm not seeing Endpoint CPU being recorded today.
I just added the Elastic Defend integration to my policy and then, a few minutes later, queried the metrics-elastic_agent*
indices with the following query:
POST metrics-elastic_agent*/_search?filter_path=hits.hits._source
{
"sort": [
{
"@timestamp": {
"order": "desc"
}
}
],
"_source": [
"@timestamp",
"elastic_agent.process",
"component.id"
],
"collapse": {
"field": "component.id"
},
"query": {
"bool": {
"must": [
{
"terms": {
"elastic_agent.id": [
"c406741d-7673-447f-b7e0-43417eeb10d6"
]
}
}
]
}
}
}
Results:
{
"hits": {
"hits": [
{
"_source": {
"@timestamp": "2024-01-10T22:03:45.069Z",
"component": {
"id": "log-default"
},
"elastic_agent": {
"process": "filebeat"
}
}
},
{
"_source": {
"@timestamp": "2024-01-10T22:03:45.069Z",
"component": {
"id": "elastic-agent"
},
"elastic_agent": {
"process": "elastic-agent"
}
}
}
]
}
}
[EDIT] And I verified that the Endpoint component was, in fact, running.
$ sudo elastic-agent status --output full
┌─ fleet
│ └─ status: (HEALTHY) Connected
└─ elastic-agent
├─ status: (HEALTHY) Running
├─ info
│ ├─ id: c406741d-7673-447f-b7e0-43417eeb10d6
│ ├─ version: 8.11.3
│ └─ commit: f4f6fbb3e6c81f37cec57a3c244f009b14abd74f
├─ beat/metrics-monitoring
│ ├─ status: (HEALTHY) Healthy: communicating with pid '1285688'
│ ├─ beat/metrics-monitoring
│ │ ├─ status: (HEALTHY) Healthy
│ │ └─ type: OUTPUT
│ └─ beat/metrics-monitoring-metrics-monitoring-beats
│ ├─ status: (HEALTHY) Healthy
│ └─ type: INPUT
├─ endpoint-default
│ ├─ status: (HEALTHY) Healthy: communicating with endpoint service
│ ├─ endpoint-default
│ │ ├─ status: (HEALTHY) Applied policy {a85e47d2-b214-4837-bbf2-8f424a9764a2}
│ │ └─ type: OUTPUT
│ └─ endpoint-default-a85e47d2-b214-4837-bbf2-8f424a9764a2
│ ├─ status: (HEALTHY) Applied policy {a85e47d2-b214-4837-bbf2-8f424a9764a2}
│ └─ type: INPUT
├─ http/metrics-monitoring
│ ├─ status: (HEALTHY) Healthy: communicating with pid '1285698'
│ ├─ http/metrics-monitoring
│ │ ├─ status: (HEALTHY) Healthy
│ │ └─ type: OUTPUT
│ └─ http/metrics-monitoring-metrics-monitoring-agent
│ ├─ status: (HEALTHY) Healthy
│ └─ type: INPUT
└─ log-default
├─ status: (HEALTHY) Healthy: communicating with pid '1285682'
├─ log-default
│ ├─ status: (HEALTHY) Healthy
│ └─ type: OUTPUT
└─ log-default-logfile-logs-82c5d927-b806-4c08-96b7-4972d7767739
├─ status: (HEALTHY) Healthy
└─ type: INPUT
docs/cpu-metrics-in-fleet.md
Outdated
|
||
## Suggested improvements | ||
|
||
* We should document the observations above in an appropriate location and perhaps link to this documentation from the "i" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, the tooltip text should also be improved, as right now it's pretty vague.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if we can briefly explain what the metric represents and a bit about how it's calculated, we should put all that in the tooltip. Otherwise, I think we should just replace the tooltip with a link to a documentation page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO if we have to link to a documentation page we are probably presenting the wrong value. Most likely the CPU should just document itself as the 5 minute average CPU utilization, or switch to just presenting an instantaneous value from the last checkin with a link to see a graph of the values over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I guess this suggestion should be more of a fallback if we can't come up with a good UX in the Fleet UI itself.
... just presenting an instantaneous value from the last checkin
I'm partial to idea of showing a single value that's the sum the of the latest CPU utilization %s for Agent + all it's component processes with a tooltip explaining that the value can range from 0 to the number of cores * 100. This value should be easily correlatable to what's seen in the output of top
/ htop
.
... with a link to see a graph of the values over time.
++, single value could link to a chart showing the CPU utilization %, per process, over time, in case the user wants to drill down further.
* Also, CPU utilization is rarely constant. If the output of `top` or `htop` for Agent and Beat processes is observed over | ||
time, the CPU utilization % shown varies for each process over time. | ||
|
||
* To relate the output seen in `top` or `htop` for Agent and Beat processes with the single value shown in the Fleet UI, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if users also find this confusing? Do we need to average over 5 minutes? It smooths out spikes, but also might confuse existing users to used to peeping at metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added suggestion in e490363.
docs/cpu-metrics-in-fleet.md
Outdated
* We should document the observations above in an appropriate location and perhaps link to this documentation from the "i" | ||
icon in CPU column in the Agent Listing page in the Fleet UI. | ||
|
||
* Rather than showing a single value for every Agent in the Agent Listing page in the Fleet UI, we should consider showing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have to account for there being multiple instances of each individual Beat process when we display it them. We have to break it down by unique input-output combination as that is how we create the processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added suggestion in 2ded3c4.
} | ||
], | ||
"collapse": { | ||
"field": "elastic_agent.process" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this account for there being multiple instances of each Beat process? It could depending on if we use the component ID (filestream-default) or the just the process name (e.g. filebeat) as the value.
For reference the default Fleet policy with the system integration and monitoring runs 3 metricbeat instances (system/metrics-default, http/metrics-monitoring, beat/metrics-monitoring) and 2 filebeat instances (filestream-monitoring, log-default).
If this is averaging the 3 metricbeats and 2 filebeats together as if they are one process instead of summing the usage of each distinct process I don't think it is correct.
The memory usage query has this problem as described in elastic/kibana#174458
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't. The value of this field is just filebeat
. The query should be using the component.id
field itself, which has values such as log-default
. I will add this fix to the list of suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is already tracked as a TODO in elastic/kibana#174458 now that I look closer at it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're referring to elastic/kibana#174458 (comment), I added that based on this conversation here :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more question: did you observe us reporting CPU for the monitoring components?
elastic/kibana#174458 (comment)
There is some evidence we are not reporting them, causing us to undercount CPU and memory usage by omitting the monitoring component contributions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not observe us reporting CPU for monitoring components themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a suggestion for collecting CPU usage metrics for monitoring components.
b1a9d38
to
6747ce3
Compare
@fearful-symmetry @cmacknz Thanks for all the great feedback! Based on it, I've reworked the suggestions section in the doc now. Please re-review and I will iterate further as needed. Once you're both happy with the suggestions, I'll file issues for them and link them from this doc before merging this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only have questions about why we are doing what we are doing at this point, the content here LGTM.
30-second or 1-minute average (making corresponding adjustments to the `calendar_interval` value in the `cpu_time_series` | ||
aggregation). This would result in a value closer to what's observed in `top` / `htop` output. | ||
|
||
* We should link the value shown in the Fleet UI to a chart that breaks it down for that Agent by `component.id` over time, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be the elastic_agent
dashboard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say this is all pretty reasonable. Do we plan to break down the Suggested Improvements
down to a meta-issue or something?
Yes, I plan to file separate issues for them and then link to all of them from this doc. |
52fb942
to
2267b1a
Compare
Filed issues for each of the suggested improvements and linked to them from this doc. |
Quality Gate passedKudos, no new issues were introduced! 0 New issues |
Force merging, documentation change only. |
What does this PR do?
This PR documents the end-to-end data flow of CPU metrics presented in the Fleet UI, including any transformations/calculations/aggregations that are done at the various stages along the way.
Why is it important?
To understand how the CPU utilization % metric presented in the Fleet UI for every Agent in the Agent Listing view is ultimately calculated, as this has resulted in confusion for a lot of users.
Related issues
Closes #4001