Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baseline agent memory usage has increased in ECK integration tests due to agentbeat #4730

Open
cmacknz opened this issue May 9, 2024 · 7 comments
Labels
Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@cmacknz
Copy link
Member

cmacknz commented May 9, 2024

See elastic/cloud-on-k8s#7790 and the following comments. The agentbeat instance implementing the filestream-monitoring component was being OOMKilled.

An at least 85M jump in memory usage occurs in 8.14.0+8.15.9 but not 8.13.0, causing the ECK fleet tests to fail. ECK uses a 350Mi memory limit, which is lower than the default 700Mi provided in the agent reference configuration for k8s.

8.13.0

kubectl top pod test-agent-system-int-sf6b-agent-6n7vm -n e2e-mercury                                               ✔  12985  11:20:01
NAME                                     CPU(cores)   MEMORY(bytes)
test-agent-system-int-sf6b-agent-6n7vm   83m          265Mi

8.14.0+

kubectl top pod test-agent-system-int-vlpc-agent-28vhc -n e2e-mercury                                                                          ✔  12958  09:55:47
NAME                                     CPU(cores)   MEMORY(bytes)
test-agent-system-int-vlpc-agent-28vhc   171m         349Mi

The heap profiles from agent diagnostics when the process was being OOMKilled were not revealing, but they may not have been captured at the ideal time.

Screenshot 2024-05-09 at 11 37 58 AM
@cmacknz cmacknz added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label May 9, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz
Copy link
Member Author

cmacknz commented May 10, 2024

I have been poking around looking for the root cause of this and I haven't found an super obvious root cause yet. It might be more of a "death by 1000 papercuts" situation.

The heap sizes I've looked at are ~10 MB higher each which can partially explain this. I think the largest contributor is the increased number of .init sections given all of the Beat modules are now present in agentbeat.

Here is an 8.14.0 agentbeat in_use heap:
Screenshot 2024-05-10 at 4 54 31 PM

Here is an 8.13.4 metricbeat in_use heap:
Screenshot 2024-05-10 at 4 55 13 PM

Both of these were an instance of the http-metrics-monitoring component.

@blakerouse
Copy link
Contributor

We can look at reducing the number of func init() by switching to func InitializeModule only in the case that the subcommand of agentbeat is actually being ran.

Another option is to see if we can reduce the heap usage of each of the func init() and make the improvement across the board. Or do both.

@cmacknz
Copy link
Member Author

cmacknz commented May 13, 2024

Looking at the worst offender in github.com/goccy/go-json at 9.4mb and 4.6mb of heap, it is only used in the Filebeat cel input as a dependency of a dependency. This previously only affected Filebeat processes but now it affects every agent process because it is always imported into agentbeat.

❯ go mod why github.com/goccy/go-json
# github.com/goccy/go-json
github.com/elastic/beats/v7/x-pack/filebeat/input/cel
github.com/lestrrat-go/jwx/v2/jwt
github.com/lestrrat-go/jwx/v2/internal/json
github.com/goccy/go-json

There isn't a strictly easy fix for this. We'd have to improve it upstream, or move that input to a different JWT library. For example https://github.com/golang-jwt/jwt has no dependencies but I have no idea if it covers all the necessary use cases.

@cmacknz
Copy link
Member Author

cmacknz commented May 13, 2024

Interestingly goccy/go-json is supposed to be optional: https://github.com/lestrrat-go/jwx/blob/develop/v2/docs/20-global-settings.md#switching-to-a-faster-json-library

I don't see us explicitly opting in to that, hmm.

@cmacknz
Copy link
Member Author

cmacknz commented May 13, 2024

Ah, go mod why is only telling me the module with the requirement for the newest version. Looking at go mod graph it shows me we have more things depending goccy which explains why it is compiled in.

If I just delete the httpjson and cel inputs from the tree for example I then get:

go mod why github.com/goccy/go-json
# github.com/goccy/go-json
github.com/elastic/beats/v7/x-pack/metricbeat/module/gcp/billing
cloud.google.com/go/bigquery
github.com/apache/arrow/go/v12/arrow/array
github.com/goccy/go-json

Anyway, that is the worst offender but it has always been there.

@cmacknz cmacknz changed the title Baseline agent memory usage has increased in ECK integration tests Baseline agent memory usage has increased in ECK integration tests due to agentbeat Jun 19, 2024
@pebrc
Copy link

pebrc commented Aug 22, 2024

We have another uptick of memory usage (I am not sure it is related to agentbeat) elastic/cloud-on-k8s#8021 raises the memory requests/limits now to 640Mi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

4 participants