-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filebeat still has memory leak? #9302
Comments
Any Update on this issue? |
I've multiple filebeats agents running on 3 kubernetes clusters and on all of them the daemonset's pods keeps being OOMKilled. I've already increased the memory limit of the daemonset, but the problem still occurs, so I'm pretty sure this is a memory leak. This is the CPU / Memory / Disk usage for today alone from our staging environment: This is the memory for the past week: The logs at the time the agent is killed by the OOMKiller shows the following:
From the In almost all of our clusters the messages are similar to this one, and most of them show in the log (in the The job itself does not log that much information, most of the time only the following is logged:
And this can increase a few lines at max. This is my current config: https://gist.github.com/JCMais/1ec9ebea58eca5d31eb0da9d78c7da91 I'm running filebeat from the docker image There is also this other issue, which probably is also related: #9078 Is there anything that can be done to help move this issue forward? cc @ruflin @jsoriano @andrewkroh @ycombinator @cwurm |
I wonder if this is related to #7820 Any chance you could test 6.6 to see if it still happens there? |
Update: I just looked at #9078 and it seems some users there are already using 6.6 and still see the same issue :-( Overall I think there are 2 variables here:
@jsoriano @odacremolbap Could one of you dig deeper into this one? |
#10476 addressed an issue that could leave some configurations running forever even if the pods originating these configurations were already stopped. This could be a source of leaks, specially in dynamic clusters. This fix is included in 6.6.1, @JCMais @guot626 could you give a try to this version? |
Thanks for taking some time to look into this guys, I will be able to test |
@jsoriano I will upgrade the agents and let you know if the leak stops |
@jsoriano looks like the leak still exists. I've updated the agents on our dev cluster, which consists of two nodes, each node has one agent running. This is the agent on the node that does not have the kubernetes cronjob that is triggered every 10m: This is the agent on the same node than the cronjob's pods: Latest logs on agent with memory leak: The one that does not shows the leak: If you need anything let me know, in case you want full logs you can give me an address or something else I can share them, cannot paste them here because they contain sensitive information. They are running with debug levels enabled for EDIT: |
@JCMais Do we have a many IO errors? What about open files? I have a hunch that back-pressure by the outputs (or IO errors) cause filebeat to not close the inputs yet. This might cause errors in auto-discovery trying to shutdown the inputs over and over again, accumulating another set of go-routines. Will try to simulate this behavior with autodiscovery module in separation the next days. |
@urso there are no IO errors or disk pressure. |
I have similar issue with filebeat 6.6.2 with kubernetes autodiscover. Please find attached the pprof graph for one filebeat instance in out environment. If needed, I can do other debugging or some testing. |
@olivierboudet how you created this graph? |
@JCMais I have started filebeat with arguments --httpprof :6060
I retrieved the file on my laptop then ran :
|
Hi @exekias, not better for me neither with the 6.7 based image. Goroutines dump : https://gist.github.com/olivierboudet/28bdc806866a3cfd5d27a6a7ef6cdb3c Grafana chart of memory for the pod : |
Hi Folks, I see you both are using |
@exekia, yeah it is much better without add_kubernetes_metadata |
@exekias can also confirm that it looks more stable, I will report back in a few hours |
@exekias that helped immensely, this is the memory usage for the past 6 hours: agent on the node with the cronjob that is created every 10 mins: The memory is still growing, but much slower. I've updated the gist above with the new goroutines dump. |
Ok, so it seems there is still another leak, which seems to have less impact:
We will need to look into this too |
We just merged another fix that may help with that (#12125), I will prepare a new image with all fixes together |
Just pushed |
@exekias here are the goroutines for the past 4 hours using the image you commented: goroutines.loggoroutine profile: total 797 558 @ 0x993a1a 0x993ace 0x96a4c2 0x96a17b 0x1346444 0x9c12d1 0x1346443 github.com/elastic/beats/filebeat/channel.CloseOnSignal.func1+0x33 /home/exekias/go/src/github.com/elastic/beats/filebeat/channel/util.go:117 Memory is still growing, slowly, but growing, I don't know if still leaking or if that is going to be garbage collected later on. |
@exekias But how do I override the config defaults on |
Follows the strategy used in #10850 to check for goroutine leaks on tests in some places related to the leaks investigated as part of #9302. Several things here: * New helper to reuse the goroutine checker, it also prints now the goroutines dump on failure. * Add goroutine checks for autodiscover tests * Add goroutine checks for CloseOnSignal and SubOutlet tests (they detect the issues solved by #11263) * Add goroutine checks for log input tests (they detect issues solved by #12125 and #12164)
@cypherfox, kubernetes autodiscover provider accepts the same parameters as |
Hi @exekias your patch seems to work for us too. For a week the filebeat pods in our production cluster kept using more and more memory and then stopped shipping logs (without crashing despite the memory limit setting). With your patch we haven't restarted the pod since yesterday evening. |
Thank you everyone for your feedback, we are closing this issue as it seems most of the problems are now solved. All fixes were backported down to 6.8, and will be released with the next minor/patch version for anything above that one. |
Thanks @exekias, also, please do not remove your image in the meantime ( |
the fix already made it to an official release, so anyone using my image should move to |
I'm still seeing the same on |
@exekias the example config file (https://github.com/elastic/examples/blob/master/MonitoringKubernetes/filebeat-kubernetes.yaml) also has the same. Probably a good idea to get that fixed. |
Hey there, I am using filebeat 7.2.0 and I still meet this problem. |
Hi filebeat experts,
We still have the memory leak problem on filbeat 6.5.1. Can anyone help us to resolved it or give us some suggestions?
We once used filebeat 6.0 in our production environment but found that there was a memory leak problem. Then we upgraded it to version 6.5.1, in which the memory leak problem is said to be fixed according the filebeat community. Sadly our pressure test on version 6.5.1 is not satisfied and this memory leak problem seems not perfectly fixed. Below are the steps how we run our tests:
1).Every 10 seconds, a log file with 1.6M is generated on two nodes separately, one is deployed with filebeat 6.0, another is deployed with filebeat 6.5.1
We only keep the latest 200 log files on a node and clear the files generated early to free some disk space.
After running for a whole day, both filebeat 6.5.1 and filebeat 6.0 consume some memories. And filebeat 6.5.1 consumes more that filebeat 6.0 does.
We highly suspect that this memory leak problem has something to so with the logs that are cleared in the step
2). Seems that some filebeat harvesters are created to gather the logs that are unluckily cleared before they are harvested. Those harvesters with no logs to harvest throw exceptions and cause the memory leak problem.
Would anyone could help us to look into this issue?
Thanks in advance.
filebeat6.5.1 yml:
The text was updated successfully, but these errors were encountered: