Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Fix agent memory query #174458

Open
4 tasks
jillguyonnet opened this issue Jan 8, 2024 · 6 comments
Open
4 tasks

[Fleet] Fix agent memory query #174458

jillguyonnet opened this issue Jan 8, 2024 · 6 comments
Labels
bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@jillguyonnet
Copy link
Contributor

jillguyonnet commented Jan 8, 2024

Context

Following the investgation carried out for https://github.com/elastic/sdh-beats/issues/4209, the agent memory reported in Fleet's agent table and agent details appears to be about 3-4 times under its actual value. One comparison point is the memory reported by running systemctl status elastic-agent.

The first round of analysis (see details below) suggests that the current query used to calculate the total memory for the agent incorrectly aggregates separate Beat instances together.

Furthermore, the agent memory displayed in the [Elastic Agent] Agent metrics dashboard appears to be similarly undervalued (which is the original issue raised by https://github.com/elastic/sdh-beats/issues/4209). Since the query should be very similar, this should be fixed as well.

It is likely that the agent CPU, which is calculated from the same query, should also be corrected. Note that this metric has also been reported to have unrealistic values (https://github.com/elastic/sdh-beats/issues/3834) and there is an ongoing effort to document how it works (elastic/elastic-agent#4005). It would make sense to do the same for agent memory (either as part of this issue or a followup documentation issue).

Details

Steps to reproduce

  1. Run an Elastic stack with a Fleet server and enroll an agent (easiest might be to use Multipass):
    multipass launch --name agent1 --disk 10G
    multipass shell agent1
    // enroll agent with commands listed in Kibana (replace x86_64 with arm64 if needed)
    
  2. Once the agent is started, measure its memory with systemctl (from Multipass shell):
    systemctl status elastic-agent
    
  3. Compare the value with the one reported in Fleet's agent table and the agent's details page: it should be between 3 and 4 times higher.

Analysis

The issue seems to arise from the query used to calculate the agent's memory and CPU. This query computes, for each agent, two values called memory_size_byte_avg and cpu_avg.

In plain words, this query aggregates over the processes of the Elastic Agent (elastic-agent, filebeat and metricbeat), takes the average of system.process.memory.size for each process, and then sums these averages together.

The problem is that elastic_agent.process is not unique per Beat. For example, with a setup as described in the steps above, running sudo elastic-agent status --output=full shows that the system integration and monitoring runs 3 metricbeat instances (system/metrics-default, http/metrics-monitoring, beat/metrics-monitoring) and 2 filebeat instances (filestream-monitoring, log-default):

Output of elastic-agent status --output=full
┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: 8d0b2d8a-b3b2-4fa1-8ca5-db5179bd856c
   │  ├─ version: 8.11.3
   │  └─ commit: f4f6fbb3e6c81f37cec57a3c244f009b14abd74f
   ├─ beat/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '1739'
   │  ├─ beat/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ filestream-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '1731'
   │  ├─ filestream-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ http/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '1744'
   │  ├─ http/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ log-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '1719'
   │  ├─ log-default
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ log-default-logfile-system-b2274470-459c-4c26-ade3-7ddce7f1c614
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   └─ system/metrics-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '1724'
      ├─ system/metrics-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      └─ system/metrics-default-system/metrics-system-b2274470-459c-4c26-ade3-7ddce7f1c614
         ├─ status: (HEALTHY) Healthy
         └─ type: INPUT

See also this comment for added context and details.

It is possible (and helpful) to play with the query in the Console in order to tweak the aggregation. Here is a simplified version (memory only):

Agent memory query
GET metrics-elastic_agent.*/_search
{
  "size": 0, 
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "_tier": "data_hot"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now-5m"
            }
          }
        },
        {
          "term": {
            "elastic_agent.id": "<agentId>"
          }
        },
        {
          "bool": {
            "filter": [
              {
                "bool": {
                  "should": [
                    {
                      "term": {
                        "data_stream.dataset": "elastic_agent.elastic_agent"
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "agents": {
      "terms": {
        "field": "elastic_agent.id"
      },
      "aggs": {
        "sum_memory_size": {
          "sum_bucket": {
            "buckets_path": "processes>avg_memory_size"
          }
        },
        "processes": {
          "terms": {
            "field": "elastic_agent.process"
          },
          "aggs": {
            "avg_memory_size": {
              "avg": {
                "field": "system.process.memory.size"
              }
            }
          }
        }
      }
    }
  }
}

Acceptance criteria

Tasks

Preview Give feedback
@jillguyonnet jillguyonnet added bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team labels Jan 8, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@ycombinator
Copy link
Contributor

As mentioned in elastic/elastic-agent#4005 (comment), the processes aggregation should use the field component.id instead of elastic_agent.process.

@jillguyonnet
Copy link
Contributor Author

As mentioned in elastic/elastic-agent#4005 (comment), the processes aggregation should use the field component.id instead of elastic_agent.process.

👍 FYI I reported a quick comparison of the terms aggregation of component.id vs. elastic_agent.process in https://github.com/elastic/sdh-beats/issues/4209#issuecomment-1880727961 - which had the same output in this case.

@cmacknz
Copy link
Member

cmacknz commented Jan 11, 2024

I see the expected 5 component ids: log-default, system/metrics-default, filestream-monitoring, beat/metrics-monitoring, http/metrics-monitoring.

However, I can only see 3 component ids when querying for the agent memory, and aggregating over these yields the same results as aggregating over processes:

We are missing the monitoring components. http/metrics-monitoring would have to be reporting the metrics collected from itself since it is doing the reporting for the others. It is possible we are not collecting stats for filestream-monitoring and beat/metrics-monitoring which would be incorrect because they aren't free from a resource usage perspective. I will see if I can confirm that the agent is omitting these.

@jen-huang
Copy link
Contributor

@cmacknz Were you able to confirm that these are omitted?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests

5 participants