Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Internal Documentation] Data flow of CPU metrics presented in Fleet UI #4005

Merged
merged 16 commits into from
Jan 15, 2024

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Jan 4, 2024

What does this PR do?

This PR documents the end-to-end data flow of CPU metrics presented in the Fleet UI, including any transformations/calculations/aggregations that are done at the various stages along the way.

Why is it important?

To understand how the CPU utilization % metric presented in the Fleet UI for every Agent in the Agent Listing view is ultimately calculated, as this has resulted in confusion for a lot of users.

Related issues

Closes #4001

@ycombinator ycombinator added Team:Elastic-Agent Label for the Agent team docs labels Jan 4, 2024
Copy link
Contributor

mergify bot commented Jan 4, 2024

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@nimarezainia
Copy link
Contributor

fyi @kpollich

@cmacknz
Copy link
Member

cmacknz commented Jan 8, 2024

Relates: elastic/kibana#174458

@ycombinator
Copy link
Contributor Author

@cmacknz @fearful-symmetry @nchaulet I've completed documenting the data path taken by CPU utilization metrics all the way from /proc/{pid}/stat files (for Agent and Beats processes running on Linux) to the single value shown in the Fleet UI.

I've also included, at the end of this documentation, sections on my observations and suggested improvements. Please review the entire document but especially the observations and suggested improvements sections and provide any feedback. I'd like to file issues (or reuse existing ones) for, and link to them from, each of the improvements in this documentation.

The short/medium-term goal is that next time there's an SDH on Agent CPU utilization metrics in the Fleet UI being confusing, we can link to this documentation. The medium/long-term goal would be to implement the improvements and then delete them from this documentation, while also updating the rest of the documentation as needed.

Thank you!

@ycombinator ycombinator marked this pull request as ready for review January 9, 2024 08:02
@ycombinator ycombinator requested a review from a team as a code owner January 9, 2024 08:03
@ycombinator ycombinator requested a review from leehinman January 9, 2024 08:03
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

the values for the Beat processes (along with the Agent process), we would be exposing internals of Agent to users, which
may be confusing for some users.

* We should enhance collection and aggregation to include CPU utilization for Beats, which are Agent components managed by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this? You mean provide the CPU metrics in the stats metrics, instead of requiring fleet to perform its own aggregation query?

Copy link
Contributor Author

@ycombinator ycombinator Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't think I wrote this in the clearest way. I've changed the language now so please re-review.

What I want to convey is that we should also include and present CPU metrics about service-runtime components (e.g. Endpoint), not just command-runtime components (e.g. Beats). And I deliberately didn't want to get into exactly how we should do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we account for endpoint CPU now as well, which will even further complicate users attempt to correlate what is shown in Fleet to what is in top.

Top will never account for endpoint resource usage because it is in a different process and cgroup.

Copy link
Contributor Author

@ycombinator ycombinator Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I'm not seeing Endpoint CPU being recorded today.

I just added the Elastic Defend integration to my policy and then, a few minutes later, queried the metrics-elastic_agent* indices with the following query:

POST metrics-elastic_agent*/_search?filter_path=hits.hits._source
{
  "sort": [
    {
      "@timestamp": {
        "order": "desc"
      }
    }
  ],
  "_source": [
    "@timestamp", 
    "elastic_agent.process",
    "component.id"
  ],
  "collapse": {
    "field": "component.id"
  },
  "query": {
    "bool": {
      "must": [
        {
          "terms": {
            "elastic_agent.id": [
              "c406741d-7673-447f-b7e0-43417eeb10d6"
            ]
          }
        }
      ]
    }
  }
}

Results:

{
  "hits": {
    "hits": [
      {
        "_source": {
          "@timestamp": "2024-01-10T22:03:45.069Z",
          "component": {
            "id": "log-default"
          },
          "elastic_agent": {
            "process": "filebeat"
          }
        }
      },
      {
        "_source": {
          "@timestamp": "2024-01-10T22:03:45.069Z",
          "component": {
            "id": "elastic-agent"
          },
          "elastic_agent": {
            "process": "elastic-agent"
          }
        }
      }
    ]
  }
}

[EDIT] And I verified that the Endpoint component was, in fact, running.

$ sudo elastic-agent status --output full
┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: c406741d-7673-447f-b7e0-43417eeb10d6
   │  ├─ version: 8.11.3
   │  └─ commit: f4f6fbb3e6c81f37cec57a3c244f009b14abd74f
   ├─ beat/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '1285688'
   │  ├─ beat/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ endpoint-default
   │  ├─ status: (HEALTHY) Healthy: communicating with endpoint service
   │  ├─ endpoint-default
   │  │  ├─ status: (HEALTHY) Applied policy {a85e47d2-b214-4837-bbf2-8f424a9764a2}
   │  │  └─ type: OUTPUT
   │  └─ endpoint-default-a85e47d2-b214-4837-bbf2-8f424a9764a2
   │     ├─ status: (HEALTHY) Applied policy {a85e47d2-b214-4837-bbf2-8f424a9764a2}
   │     └─ type: INPUT
   ├─ http/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '1285698'
   │  ├─ http/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   └─ log-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '1285682'
      ├─ log-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      └─ log-default-logfile-logs-82c5d927-b806-4c08-96b7-4972d7767739
         ├─ status: (HEALTHY) Healthy
         └─ type: INPUT


## Suggested improvements

* We should document the observations above in an appropriate location and perhaps link to this documentation from the "i"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, the tooltip text should also be improved, as right now it's pretty vague.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if we can briefly explain what the metric represents and a bit about how it's calculated, we should put all that in the tooltip. Otherwise, I think we should just replace the tooltip with a link to a documentation page.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO if we have to link to a documentation page we are probably presenting the wrong value. Most likely the CPU should just document itself as the 5 minute average CPU utilization, or switch to just presenting an instantaneous value from the last checkin with a link to see a graph of the values over time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I guess this suggestion should be more of a fallback if we can't come up with a good UX in the Fleet UI itself.

... just presenting an instantaneous value from the last checkin

I'm partial to idea of showing a single value that's the sum the of the latest CPU utilization %s for Agent + all it's component processes with a tooltip explaining that the value can range from 0 to the number of cores * 100. This value should be easily correlatable to what's seen in the output of top / htop.

... with a link to see a graph of the values over time.

++, single value could link to a chart showing the CPU utilization %, per process, over time, in case the user wants to drill down further.

* Also, CPU utilization is rarely constant. If the output of `top` or `htop` for Agent and Beat processes is observed over
time, the CPU utilization % shown varies for each process over time.

* To relate the output seen in `top` or `htop` for Agent and Beat processes with the single value shown in the Fleet UI,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if users also find this confusing? Do we need to average over 5 minutes? It smooths out spikes, but also might confuse existing users to used to peeping at metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added suggestion in e490363.

* We should document the observations above in an appropriate location and perhaps link to this documentation from the "i"
icon in CPU column in the Agent Listing page in the Fleet UI.

* Rather than showing a single value for every Agent in the Agent Listing page in the Fleet UI, we should consider showing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have to account for there being multiple instances of each individual Beat process when we display it them. We have to break it down by unique input-output combination as that is how we create the processes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added suggestion in 2ded3c4.

}
],
"collapse": {
"field": "elastic_agent.process"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this account for there being multiple instances of each Beat process? It could depending on if we use the component ID (filestream-default) or the just the process name (e.g. filebeat) as the value.

For reference the default Fleet policy with the system integration and monitoring runs 3 metricbeat instances (system/metrics-default, http/metrics-monitoring, beat/metrics-monitoring) and 2 filebeat instances (filestream-monitoring, log-default).

If this is averaging the 3 metricbeats and 2 filebeats together as if they are one process instead of summing the usage of each distinct process I don't think it is correct.

The memory usage query has this problem as described in elastic/kibana#174458

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't. The value of this field is just filebeat. The query should be using the component.id field itself, which has values such as log-default. I will add this fix to the list of suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is already tracked as a TODO in elastic/kibana#174458 now that I look closer at it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're referring to elastic/kibana#174458 (comment), I added that based on this conversation here :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more question: did you observe us reporting CPU for the monitoring components?

elastic/kibana#174458 (comment)

There is some evidence we are not reporting them, causing us to undercount CPU and memory usage by omitting the monitoring component contributions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not observe us reporting CPU for monitoring components themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a suggestion for collecting CPU usage metrics for monitoring components.

@ycombinator
Copy link
Contributor Author

@fearful-symmetry @cmacknz Thanks for all the great feedback! Based on it, I've reworked the suggestions section in the doc now. Please re-review and I will iterate further as needed. Once you're both happy with the suggestions, I'll file issues for them and link them from this doc before merging this PR.

@ycombinator ycombinator requested a review from cmacknz January 10, 2024 21:46
Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have questions about why we are doing what we are doing at this point, the content here LGTM.

30-second or 1-minute average (making corresponding adjustments to the `calendar_interval` value in the `cpu_time_series`
aggregation). This would result in a value closer to what's observed in `top` / `htop` output.

* We should link the value shown in the Fleet UI to a chart that breaks it down for that Agent by `component.id` over time,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be the elastic_agent dashboard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, specifically the Agent Metrics one, e.g.:

Screenshot 2024-01-11 at 11 51 54

Copy link
Contributor

@fearful-symmetry fearful-symmetry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say this is all pretty reasonable. Do we plan to break down the Suggested Improvements down to a meta-issue or something?

@ycombinator
Copy link
Contributor Author

ycombinator commented Jan 12, 2024

I would say this is all pretty reasonable. Do we plan to break down the Suggested Improvements down to a meta-issue or something?

Yes, I plan to file separate issues for them and then link to all of them from this doc.

@ycombinator
Copy link
Contributor Author

Filed issues for each of the suggested improvements and linked to them from this doc.

@ycombinator ycombinator enabled auto-merge (squash) January 12, 2024 23:01
Copy link

Quality Gate passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No Coverage information No data about Coverage
No Duplication information No data about Duplication

See analysis details on SonarQube

@cmacknz
Copy link
Member

cmacknz commented Jan 15, 2024

Force merging, documentation change only.

@cmacknz cmacknz disabled auto-merge January 15, 2024 14:08
@cmacknz cmacknz merged commit 17f0480 into elastic:main Jan 15, 2024
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Internal documentation] end-to-end data flow and calculations of CPU metrics shown in Fleet UI
6 participants