-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential memory leak in Windows (v1.4.3) #3485
Comments
Is this a new issue after an upgrade, if so what version did you upgrade from? It may be useful to start telegraf with the You may want to enable the internal input and watch the |
I would focus on the heap page. Save the output of the heap page ~10 minutes after you first startup, and then again after a day when then memory has gone up. If you have Go installed somewhere you can try creating an image of the current memory use:
|
Here is the documentation for the I'm not sure if this would effect memory usage, I would think it would only increase the amount of logging. Please send me the pprof data when you can, hopefully it will contain a clue. It is probably not necessary to wait for it to get to 2GB so long as it is approximately double the normal memory usage or more. |
Looks like the issue wasn't resolved after all. Running a single instance on a server to see what happens with debugging on and removing some configs to see what's happening. |
This is a bit old, but I'm experiencing a similar memory leak on Windows Server 2016. After a few weeks, Telegraf is using GB of RAM. Was there any resolutions? In my case I am running 1.6.1 with MSSQL Input and a few Windows OS metrics. |
@etlweather Can you try running telegraf with the additional |
Telegraf started around 15Mb of RAM, not up to 241Mb. Attached is the output of |
Can you get |
I thought it was a bit small, but hey, what do I know about Go! Here is the full dump. The count on the main page for goroutines has increased by about 3,000 since. |
Looks like there are lots of leaking goroutines for connections waiting to receive:
Looking through the docs for github.com/denisenkom/go-mssqldb, I think we may need to use a
|
Trying... after running for a few seconds, the goroutines count is 34. Will come back in a couple of hours with an update. |
Still growing, from 13Mb to 80+Mb of RAM. |
Too bad, I think the next thing to check is if this is fixed in the latest Telegraf release. In 1.8.3 we have moved from a fork of the driver back to the main version, which has a fair number of bug fixes. |
I should have tried that first... trying now. Started 1.8.3, started with 14Mb of RAM or so. After a few second goroutine is at 55. Will come back in a couple hours. |
Already at 70Mb 😞 |
Maybe I had some configuration error which was preventing the plugin to work (when I tried today, I found I didn't have a user in SQL Server for telegraf, don't know if I lost the user as I have been messing around...). Anyway, right now with the current 1.9.3 version, I don't see this problem after a few hours of it running, it's been stably at between 20-23Mb of memory. |
@absolutejam Could you take another look with 1.9.3? |
We experienced the same with version 1.9.1. As long as the sqlserver input is configured correct and it can get the data is seems to work OK and there is no memory pressure. When there is a problem with the sql connection we get the logs as below (redacted the ip's) and the memory keeps rising. Ran it for a while now with 1.10.4 and that gives the same result. When misconfigured the errors occur and the memory keeps rising.
|
I ran it with the --pperf-addr setting. This gist show a "full go routine stack dump" after 10 minutes of running. I have no experience in Go but i think it could help. |
Thanks, that should indeed be helpful, looks like will need to dig into Go's sql package to understand the creation of these goroutines and why they are not completing:
|
Great. If I need to try or run something i'd be happy to. |
Could it be that the |
That could be it, can you test it? I can make you a build if it would help or it's pretty easy to setup Go as well. |
I will test it today. It was indeed easy to set up Go |
Putting the How can we proceed with this? |
Yes, Looking at the code in question and the goroutine list, this is indeed the issue. Putting defer db.Close above the ping will fix it. However: Why is telegraph apparently opening a new connection on every stat collection? Why are you creating a new connection pool, opening a connection from it, then closing it all down every time, (every 30 seconds?)? This is a complete waste of resources. Also, don't use ping like this. That is a complete waste of time and additional resources. Just try the query and if it works, then it works. All PING does is add yet another round trip doing a pointless You want three phases to working with this: init: create the connection pool. Don't use ping. That doesn't help you in this context. Then periodically use this connection pool (sql.DB) and call Query. Also, calling the variable This will be an issue for any database stats, not just SQL Server. This will be an issue for any OS, not just Windows. |
@kardianos All good points, it would be good to make all of those changes. We could look at the postgresql plugins as an example for using the connection pool. For this issue though lets only put the defer in the right spot, and optionally remove the unneeded ping. @arnodenuijl Would you be able to open a PR for that change? |
I created a pull request. Second in my life, so if I need to do anything other/different, just say so. |
Looks perfect, thanks! I opened this issue for the connection pool work #5898, it isn't something that I am going to work on right now but it is good to have it tracked. |
Directions
This week, we had an issue wherein at least 6 customer servers' Telegraf process was consuming 5GB of RAM all the time. Looking at the metrics Telegraf had collected, it looks to have started creeping gradually from the 1st of the month until memory used was sitting at 99%. Restarting the Telegraf service has released this memory pressure, but I'm observing it at the moment, in anticipation of it happening again.
Basically, I want to know what information I can provide to help tackle this if it's going to reoccur. Can I enable some verbose debugging and catch it in the act again, or is this related to the metrics I'm collecting? Although, there should be no change in series, so I would have thought the usage would be steady?
Bug report
Relevant telegraf.conf:
Just collecting a config dump now. Will attach shortly...
System info:
Mostly Windows Server 2012 R2 with one possible 2008 R2.
Steps to reproduce:
Run telegraf for a few weeks.
The text was updated successfully, but these errors were encountered: