Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes metadata overwhelms memory limits in the Agent process #4729

Closed
3 tasks
faec opened this issue May 9, 2024 · 35 comments
Closed
3 tasks

Kubernetes metadata overwhelms memory limits in the Agent process #4729

faec opened this issue May 9, 2024 · 35 comments
Assignees
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@faec
Copy link
Contributor

faec commented May 9, 2024

Diagnostics from production Agents running on Kubernetes show:

  • The elastic-agent process itself uses more memory than all its configured inputs combined.
  • Within the elastic-agent process, more than 90% of memory use is in Kubernetes helpers. 70% of that is from elastic-agent-autodiscover and the other 20% is from helpers internal to elastic-agent.

We need to understand why the Kubernetes helpers are using so much memory, and find a way to mitigate it.

Definition of done

  • Provide steps for a reproducible setup that can demonstrate the aforementioned memory usage with an Agent diagnostic
  • Attach Agent diagnostic to this issue to use as a baseline, so we can compare against it when improvements are made
  • Reduce memory use by Kubernetes helpers from 90% to TBD% (TBD, at the moment, until we've done more investigation)
@faec faec added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels May 9, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz
Copy link
Member

cmacknz commented May 9, 2024

Possible related, an increase starting in 8.14.0 was detected by the ECK integration tests #4730

@faec
Copy link
Contributor Author

faec commented May 16, 2024

Possible related, an increase starting in 8.14.0 was detected by the ECK integration tests

FWIW the diagnostics described by this issue were from 8.13.3.

@jlind23 jlind23 added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team and removed Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels May 21, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@jlind23
Copy link
Contributor

jlind23 commented May 21, 2024

After chatting with @cmacknz and @pierrehilbert, assigning this to you @faec and making it a high priority for the next sprint.

@bturquet
Copy link

cc @gizas

@faec
Copy link
Contributor Author

faec commented May 22, 2024

Agent's variable provider API is very opaque, which is probably a big part of this. Agent's Coordinator doesn't provide any constraints on what variables might be requested, hence the Kubernetes helpers make (and cache) very large / verbose state queries. #2887 is related -- a possible Agent-side solution is to implement better policy parsing to validate the full configuration and give variables providers like Kubernetes a list of variables that are used.

@bturquet / @gizas, if we add hooks to the variable provider API for the Coordinator to give a list of possible variables, what work would be needed to restrict Kubernetes queries to those variables?

@gizas
Copy link
Contributor

gizas commented May 23, 2024

@faec trying to understand here how we can combine those pieces. So lets say the the parsing changes and there is a list of variables that the provider will need to populate.
On kubernetes provider here we start the watchers but with general arguments.

The other metadata enrichment we do with enrichers again is unrelated with the flow you describe here.

Maybe we can sync offline for me to understand more about this?

cc @MichaelKatsoulis

@alexsapran
Copy link
Contributor

Hi all,

I was looking at this, and I wanted to know if we are applying any filtering on the data we receive from the k8s metadata.
Does all data need to be cached locally in the local k8s cache? I'd like to know if we can apply any transformation to nullify some of the fields and keep only the ones we care about; this way, the RSS memory of the Elastic Agent will hold only the data we care about and will not be influenced by the size of the k8s cluster.

@neiljbrookes
Copy link

Hello all @faec @ycombinator
Is there any update on this issue ? I am planing an upgrade to 8.14.1 this week, do we anticipate any improvements ?

@pierrehilbert
Copy link
Contributor

Fae is currently in PTO and unfortunately she didn't have time to investigate on this yet.
This is planned in the current sprint (that started today).

@rgarcia89
Copy link

We are facing this issue too. We see the elastic agents hitting the current memory limit of 1200Mi. I would greatly appreciate it if this topic could be given higher priority, as it is quite annoying to see the agents using that much memory.

@pierrehilbert
Copy link
Contributor

Hello @rgarcia89,
This topic has an high priority but as you can imagine, this was not the only one.
@faec will soon start to look at this so I hope we will soon be able to share good news.

@nimdanitro
Copy link

FWIW, I think we could apply some meaningful transformers in the informers. We did something very similar in our mki-cost-exporter project: https://github.com/nimdanitro/mki-cost-exporter/blob/feat/poc/pkg/costmeter/meter.go#L124C14-L124C26
here is an example of the cache.TransformFunc which we set to our informers: https://github.com/nimdanitro/mki-cost-exporter/blob/feat/poc/pkg/informers/transformer/transform.go#L34

Obviously, we could ignore a large portion of the information for our specific use case.

@yuvielastic
Copy link

Hey Team, Any update on this issue? Given it's been acknowledged as a high priority but there are no updates on it since months is very worrisome.

Can we please prioritize this as we would need to get the agent footprint down as much as possible as provisioning 4 GB of memory would reduce the overall usable RAM available that can be used for customer workloads.

@amitkanfer
Copy link
Contributor

It's still prioritized. Unfortunately there were other more urgent matters that we're still wrapping up.

@zez3
Copy link

zez3 commented Aug 22, 2024

@faec any updates from your part?

@blakerouse
Copy link
Contributor

@faec There is one issue that I filled a while ago that I think would help reduce memory usage in the case that a specific provider is not even being used - #3609. By doing that unless the policy references a provider then there is no reason for it to even be running.

I think using the same logic as above it could build off your idea of recording exactly which variables will be referenced from the policy. Then inside of the variables storage system used by the composable module, could use this determined information to only store what is needed without having to even change the providers (it could just drop the fields not needed).

The issue is the case where a policy now starts referencing a new variable and now that information has been dropped, but the provider already provided all the required information. This is where I do believe the providers will need to be adjusted to be given the list of variables that are referenced in the policy. That will allow them to only do the minimal work required as well as notice if a new variables is added requiring it to push an update to the variable storage so that variable information is now present.

@EvelienSchellekens
Copy link

I’m running into some memory issues with Elastic Agent 8.15. It’s running on Kubernetes, and we limit the memory to 700Mi in the manifest file in Kibana. However, when enabling the system metrics + Kubernetes integration, the process keeps crashing and I get almost no data in. When I raise the limit to 800Mi, it runs stable. This seems related to this GH issue.

Here are my test results:

Elastic Agent 8.15.0 (only system metrics integration), limit 700Mi:

NAME                  CPU(cores)   MEMORY(bytes)   
elastic-agent-hkfsw   21m          442Mi 

Elastic Agent 8.15.0 (system metrics integration + K8s integration), limit 700Mi:
-> keeps crashing, no data

NAME                  CPU(cores)   MEMORY(bytes)   
elastic-agent-hkfsw   236m         699Mi

Elastic Agent 8.15.0 (system metrics integration + K8s integration), limit 800Mi:
-> runs stable

NAME                  CPU(cores)   MEMORY(bytes)   
elastic-agent-dbzzm   52m          703Mi 

This setup is being used for (marketing) workshops and it's not a great look to ask attendees to increase the memory limit when the Elastic Agent only uses 2 integrations.

@gizas
Copy link
Contributor

gizas commented Sep 12, 2024

We had run some scaling tests in the past that propose resource configuration ( based on 8.7) as reference point to compare.

At the moment @elastic/obs-ds-hosted-services focus is the Otel native Kubernetes Collection of logs/ metrics and we have no plans to run any scaling tests for elastic agent + integrations (cc @mlunadia ) in current iteration.

We can wait and see otel elastic agent memory consumption with latest config and also check current resourcing of elastic agent with system+k8s integration.

@ycombinator ycombinator assigned swiatekm and unassigned faec Sep 12, 2024
@ycombinator ycombinator added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team and removed Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Sep 12, 2024
@LucaWintergerst
Copy link

LucaWintergerst commented Sep 16, 2024

this issue occurs even with very very small workloads, so it's not really about scale testing.

This is reproducible on a single node k8s cluster, with 26 total pods running

@swiatekm
Copy link
Contributor

swiatekm commented Sep 17, 2024

Posting the results of my initial investigation. For now, I'm inclined to agree with Michael's conclusion in https://github.com/elastic/sdh-beats/issues/5148#issuecomment-2352771442 that there isn't a regression here. Still, the increase in memory usage from adding more Pods to the Node seems excessive, but it's not clear where it's coming from.

Test setup

  • Single node KiND cluster, default settings.
  • Fleet-managed Agent installed as per the official instructions.
  • System and Kubernetes integrations with default settings (at least initially).
  • 98 Nginx Pods running in the cluster, producing no logs.

Findings

  • The memory increase does seem primarily related to the kubernetes variable provider. It can be reproduced even with all the data collection disabled in the Kubernetes integration.
  • Memory usage does appear to scale with the number of Pods running on the Node, even if those Pods aren't actually logging anything.
  • Since the amount of metadata from a single Node shouldn't be enough to cause this effect, I thought that maybe we were getting unnecessary var updates from the provider. But tweaking the debounce delay value didn't provide a measurable improvement.

@MichaelKatsoulis
Copy link
Contributor

MichaelKatsoulis commented Sep 17, 2024

I would also like to post some results here based on Luca's comment about the OOM in small workloads. I run some tests in multiple versions of elastic agent and I want to share the results.

I used a single node cluster in GKE with 38 pods running. Here are the results of Elastic Agent's memory consumption per version:

Version 8.15.1

Integration Memory Consumption
no integration 280-330 Mb
system 450-500 Mb
Kubernetes 550-600 Mb
Kubernetes & system 740-790 Mb (restarts)

Version 8.14.0

Integration Memory Consumption
no integration 260-290 Mb
system 410-430 Mb
Kubernetes 550-570 Mb
Kubernetes & system 700-730 Mb

Version 8.13.0

Integration Memory Consumption
no integration 200-210 Mb
system 320-330 Mb
Kubernetes 500-510 Mb
Kubernetes & system 630-650 Mb

Version 8.12.0

Integration Memory Consumption
no integration 180-185 Mb
system 300-330 Mb
Kubernetes 480-520 Mb
Kubernetes & system 630-680 Mb

Version 8.11.0

Integration Memory Consumption
no integration 169-190 Mb
system 300-310 Mb
Kubernetes 520-550 Mb
Kubernetes & system 660-720 Mb (restart)

The easy thing to notice here is that the increase in memory that Kubernetes Integration causes to Elastic Agent is almost constant throughout the version changes. That is around 300-350 Mb. It got better actually after some better handling of metadata enrichment in 8.14.0 onwards.
Elastic Agent with no integration at all memory consumption increased over the version bumps and the installation of Kubernetes and System(comes as default) reached the set limit of 700 Mb.
I don't know if the 300Mb that kubernetes integration adds is a lot or not. But considering that system integration which does way less (no constant API calls to k8s) adds around 150 Mb, I could argue that is reasonable.

Another thing to note is that even without the Kubernetes Integration installed , there is still Kubernetes Provider and add_kubernetes_metadata processor enabled by default. I took a look at the heap.pprof of such an agent and Kubernetes related functions seem to be using around 10 %.

I would like to understand @faec comment more.
Within the elastic-agent process, more than 90% of memory use is in Kubernetes helpers

How was this measured? With or without Kubernetes Integration? Which version?

@swiatekm
Copy link
Contributor

@MichaelKatsoulis is this with agent monitoring enabled? I got the container memory usage to ~50Mi after disabling that, with only the elastic-agent binary running in the container. But this still increased to ~90 Mi after starting more Pods.

@MichaelKatsoulis
Copy link
Contributor

@MichaelKatsoulis is this with agent monitoring enabled? I got the container memory usage to ~50Mi after disabling that, with only the elastic-agent binary running in the container. But this still increased to ~90 Mi after starting more Pods.

Yes it is enabled. I kept all the defaults. If disabled, memory consumption with just the binary running is around what you mentioned.

@cmacknz
Copy link
Member

cmacknz commented Sep 17, 2024

Elastic Agent with no integration at all memory consumption increased over the version bumps

The jump in 8.14.0 is because of agentbeat, see #4730

@henrikno
Copy link

henrikno commented Sep 20, 2024

elastic-agent pod is using 4GB ram. Pods on that host: https://gist.github.com/henrikno/27c4165cd7eec7b3a24c424d8a8dad23, ps aux: https://gist.github.com/henrikno/92634f31dd8a3795ff1ec81b34dc1bf8, elastic-agent using 2.2GB, largest metricbeat (kubernetes-metrics) 1.6GB.

It sound a bit similar to topfreegames/maestro#473, where the updates from k8s are coming in too fast compared to how they're getting processed, so they're getting buffered somewhere in memory.

@swiatekm
Copy link
Contributor

swiatekm commented Sep 20, 2024

Looking at the profile supplied by @henrikno, this anomalous memory consumption is caused by storing ReplicaSet data. @neiljbrookes confirmed on Slack that the K8s clusters in question have a lot of Deployments, and consequently ReplicaSets. For example, we have ~7000 Deployments and ~75000 ReplicaSets in a particularly troublesome cluster. The heap profile shows ~700 MB of steady-state memory usage, which comes out to around 10KB per ReplicaSet, which a reasonable value.

Image

The Agents going OOM was mitigated by setting GOGC to 25, which suggests that churn from excessive updates from the API Server is part of the problem as well.

I'm planning to submit a fix that will cause us to store only the necessary data shortly. Stopping the churn is going to be a bit more challenging, but I think we should be able to solve it by only subscribing to metadata changes from these ReplicaSets. This will be more challenging to integrate into our autodiscovery framework, but is also less urgent.

Worth noting that I don't believe this is the problem causing unexpected agent memory consumption on Nodes with a lot of Pods, even in small clusters.

@MichaelKatsoulis
Copy link
Contributor

@swiatekm is the replicasetWatcher enabled by hand in the kubernetes provider you are using?
Because by default it is disabled by this setting as part of add_resource_metadata configuration.

The only way that replicasetWatcher is by default enabled is if you are using the state_replicaset integration for metrics only.

@swiatekm
Copy link
Contributor

@MichaelKatsoulis The SRE team have deployment metadata enabled in the kubernetes provider:

    providers:
      kubernetes:
        node: ${NODE_NAME}
        scope: node
        hints.enabled: false
        kubernetes_secrets:
          enabled: true
        add_resource_metadata:
          deployment: true

This enables the ReplicaSet watcher.

Like I said earlier, I don't think this is the root cause of the increased memory utilization on Nodes with large numbers of Pods.

@swiatekm
Copy link
Contributor

I moved the ReplicaSet problem to #5623, as it's confirmed and relatively straightforward to address. Will keep troubleshooting the excess memory usage on Nodes with lots of Pods in this issue.

@lepouletsuisse
Copy link

I have the same issue in my small K8s cluster with 2 of my agents (I have ~10 agents in total). This agent runs alone (as a deployment) for my single integrations that don't need to run on all nodes, although I also have a deamonset with my agents for other purposes (Metrics, logs, etc...). This is Elastic-agent version 8.15.2.
Note that the single pod coming from the deployment has the memory issue, but also only 1 of the pods created by the deamonset (not all, this is probably related to the node workload).
I bumped the memory request to 1Gb and the memory limit to 4Gb to debug and I found interesting the fact that the memory increased a lot at the beginning but came back to the normal ~500Mb after ~10 minutes.
Image

I tried to restart the pod to check if I observe the same memory behaviour and it behaved the same.
Image

I hope it can help to debug the issue!

swiatekm added a commit to elastic/elastic-agent-autodiscover that referenced this issue Oct 3, 2024
…cts (#109)

We only use metadata from Jobs and ReplicaSets, but require that full
resources are supplied. This change relaxes this requirement, allowing
PartialObjectMetadata resources to be used. This allows callers to use
metadata informers and avoid having to receive and deserialize
non-metadata updates from the API Server.

See elastic/elastic-agent#5580 for an example of
how this could be used. I'm planning to add the metadata informer from
that PR to this library as well. Together, these will allow us to
greatly reduce memory used for processing and storing ReplicaSets and
Jobs in beats and elastic-agent.

This is will help elastic/elastic-agent#5580 and
elastic/elastic-agent#4729 specifically, and
elastic/elastic-agent#3801 in general.
@swiatekm
Copy link
Contributor

swiatekm commented Oct 22, 2024

Several different problems impacting agent memory consumption have been discussed in this issue and some of the linked issues. I'd like to summarize the current state and work towards closing this in favor of more specific sub-issues.

  1. Agent and beats store too much ReplicaSet data, leading to high memory consumption in large clusters. Addressed by Agent and beats store too much ReplicaSet data in K8s #5623.
  2. General memory consumption increase between 8.14 and 8.15. I'm confident this is Queue keeps stale event data in memory in 8.15 beats#41355.
  3. Agent itself using too much memory when there are a lot of Pods running on the Node. Will be split into its own issue. EDIT: Elastic agent uses too much memory per Pod in k8s #5835

If there's anything I missed, please let me know. Once I open an issue for 3, I'd like to close this one.

@cmacknz
Copy link
Member

cmacknz commented Oct 22, 2024

Sounds good to me, thanks for getting to the bottom of this.

@swiatekm
Copy link
Contributor

I've moved the per-Pod memory issue to #5835. I'm going to close this one to keep the discussion focused. Feel free to reopen if you believe you're facing an issue different than the ones listed in #4729 (comment). If you want to verify if the fixes address your specific problem, you can use the current snapshot build for any branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.