[Fleet] Fix agent memory query #174458

jillguyonnet · 2024-01-08T15:02:05Z

Context

Following the investgation carried out for https://github.com/elastic/sdh-beats/issues/4209, the agent memory reported in Fleet's agent table and agent details appears to be about 3-4 times under its actual value. One comparison point is the memory reported by running systemctl status elastic-agent.

The first round of analysis (see details below) suggests that the current query used to calculate the total memory for the agent incorrectly aggregates separate Beat instances together.

Furthermore, the agent memory displayed in the [Elastic Agent] Agent metrics dashboard appears to be similarly undervalued (which is the original issue raised by https://github.com/elastic/sdh-beats/issues/4209). Since the query should be very similar, this should be fixed as well.

It is likely that the agent CPU, which is calculated from the same query, should also be corrected. Note that this metric has also been reported to have unrealistic values (https://github.com/elastic/sdh-beats/issues/3834) and there is an ongoing effort to document how it works (elastic/elastic-agent#4005). It would make sense to do the same for agent memory (either as part of this issue or a followup documentation issue).

Details

Steps to reproduce

Run an Elastic stack with a Fleet server and enroll an agent (easiest might be to use Multipass):

multipass launch --name agent1 --disk 10G
multipass shell agent1
// enroll agent with commands listed in Kibana (replace x86_64 with arm64 if needed)

Once the agent is started, measure its memory with systemctl (from Multipass shell):
```
systemctl status elastic-agent
```
Compare the value with the one reported in Fleet's agent table and the agent's details page: it should be between 3 and 4 times higher.

Analysis

The issue seems to arise from the query used to calculate the agent's memory and CPU. This query computes, for each agent, two values called memory_size_byte_avg and cpu_avg.

In plain words, this query aggregates over the processes of the Elastic Agent (elastic-agent, filebeat and metricbeat), takes the average of system.process.memory.size for each process, and then sums these averages together.

The problem is that elastic_agent.process is not unique per Beat. For example, with a setup as described in the steps above, running sudo elastic-agent status --output=full shows that the system integration and monitoring runs 3 metricbeat instances (system/metrics-default, http/metrics-monitoring, beat/metrics-monitoring) and 2 filebeat instances (filestream-monitoring, log-default):

Output of elastic-agent status --output=full

┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: 8d0b2d8a-b3b2-4fa1-8ca5-db5179bd856c
   │  ├─ version: 8.11.3
   │  └─ commit: f4f6fbb3e6c81f37cec57a3c244f009b14abd74f
   ├─ beat/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '1739'
   │  ├─ beat/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ filestream-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '1731'
   │  ├─ filestream-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ http/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '1744'
   │  ├─ http/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ log-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '1719'
   │  ├─ log-default
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ log-default-logfile-system-b2274470-459c-4c26-ade3-7ddce7f1c614
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   └─ system/metrics-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '1724'
      ├─ system/metrics-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      └─ system/metrics-default-system/metrics-system-b2274470-459c-4c26-ade3-7ddce7f1c614
         ├─ status: (HEALTHY) Healthy
         └─ type: INPUT

See also this comment for added context and details.

It is possible (and helpful) to play with the query in the Console in order to tweak the aggregation. Here is a simplified version (memory only):

Agent memory query

GET metrics-elastic_agent.*/_search
{
  "size": 0, 
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "_tier": "data_hot"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now-5m"
            }
          }
        },
        {
          "term": {
            "elastic_agent.id": "<agentId>"
          }
        },
        {
          "bool": {
            "filter": [
              {
                "bool": {
                  "should": [
                    {
                      "term": {
                        "data_stream.dataset": "elastic_agent.elastic_agent"
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "agents": {
      "terms": {
        "field": "elastic_agent.id"
      },
      "aggs": {
        "sum_memory_size": {
          "sum_bucket": {
            "buckets_path": "processes>avg_memory_size"
          }
        },
        "processes": {
          "terms": {
            "field": "elastic_agent.process"
          },
          "aggs": {
            "avg_memory_size": {
              "avg": {
                "field": "system.process.memory.size"
              }
            }
          }
        }
      }
    }
  }
}

Acceptance criteria

Tasks

Give feedback

Fix the query builder for agent memory
Assess the impact of the fix on the agent CPU value
Fix the query used by the [Elastic Agent] Agent metrics dashboard for agent memory and, if relevant, for agent CPU
Follow up on documentation of agent CPU ([Internal Documentation] Data flow of CPU metrics presented in Fleet UI elastic-agent#4005) + file a similar PR or issue for agent memory (consider whether these are discoverable enough for the Fleet team)
Options

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-01-08T15:02:09Z

Pinging @elastic/fleet (Team:Fleet)

ycombinator · 2024-01-10T21:34:16Z

As mentioned in elastic/elastic-agent#4005 (comment), the processes aggregation should use the field component.id instead of elastic_agent.process.

jillguyonnet · 2024-01-11T09:59:10Z

As mentioned in elastic/elastic-agent#4005 (comment), the processes aggregation should use the field component.id instead of elastic_agent.process.

👍 FYI I reported a quick comparison of the terms aggregation of component.id vs. elastic_agent.process in https://github.com/elastic/sdh-beats/issues/4209#issuecomment-1880727961 - which had the same output in this case.

cmacknz · 2024-01-11T18:19:58Z

I see the expected 5 component ids: log-default, system/metrics-default, filestream-monitoring, beat/metrics-monitoring, http/metrics-monitoring.

However, I can only see 3 component ids when querying for the agent memory, and aggregating over these yields the same results as aggregating over processes:

We are missing the monitoring components. http/metrics-monitoring would have to be reporting the metrics collected from itself since it is doing the reporting for the others. It is possible we are not collecting stats for filestream-monitoring and beat/metrics-monitoring which would be incorrect because they aren't free from a resource usage perspective. I will see if I can confirm that the agent is omitting these.

jen-huang · 2024-01-16T23:12:18Z

@cmacknz Were you able to confirm that these are omitted?

cmacknz · 2024-01-17T02:42:01Z

Yes, this needs a fix on the agent side.

Agent should collect and report CPU and memory usage of monitoring components elastic-agent#4082
Agent should collect and report CPU and memory usage of service runtime components elastic-agent#4083

jillguyonnet added bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team labels Jan 8, 2024

cmacknz mentioned this issue Jan 8, 2024

[Internal Documentation] Data flow of CPU metrics presented in Fleet UI elastic/elastic-agent#4005

Merged

criamico mentioned this issue Feb 15, 2024

[Fleet UI] CPU/memory usage query for Agent Listing page should only consider only latest data #174799

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Fix agent memory query #174458

[Fleet] Fix agent memory query #174458

jillguyonnet commented Jan 8, 2024 •

edited

Loading

Tasks

elasticmachine commented Jan 8, 2024

ycombinator commented Jan 10, 2024

jillguyonnet commented Jan 11, 2024

cmacknz commented Jan 11, 2024

jen-huang commented Jan 16, 2024

cmacknz commented Jan 17, 2024

[Fleet] Fix agent memory query #174458

[Fleet] Fix agent memory query #174458

Comments

jillguyonnet commented Jan 8, 2024 • edited Loading

Context

Details

Steps to reproduce

Analysis

Acceptance criteria

Tasks

elasticmachine commented Jan 8, 2024

ycombinator commented Jan 10, 2024

jillguyonnet commented Jan 11, 2024

cmacknz commented Jan 11, 2024

jen-huang commented Jan 16, 2024

cmacknz commented Jan 17, 2024

jillguyonnet commented Jan 8, 2024 •

edited

Loading