Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential memory leak in Windows (v1.4.3) #3485

Closed
absolutejam opened this issue Nov 17, 2017 · 32 comments · Fixed by #5897
Closed

Potential memory leak in Windows (v1.4.3) #3485

absolutejam opened this issue Nov 17, 2017 · 32 comments · Fixed by #5897
Labels
area/sqlserver bug unexpected problem or unintended behavior
Milestone

Comments

@absolutejam
Copy link

Directions

This week, we had an issue wherein at least 6 customer servers' Telegraf process was consuming 5GB of RAM all the time. Looking at the metrics Telegraf had collected, it looks to have started creeping gradually from the 1st of the month until memory used was sitting at 99%. Restarting the Telegraf service has released this memory pressure, but I'm observing it at the moment, in anticipation of it happening again.

Basically, I want to know what information I can provide to help tackle this if it's going to reoccur. Can I enable some verbose debugging and catch it in the act again, or is this related to the metrics I'm collecting? Although, there should be no change in series, so I would have thought the usage would be steady?

Bug report

Relevant telegraf.conf:

Just collecting a config dump now. Will attach shortly...

System info:

Mostly Windows Server 2012 R2 with one possible 2008 R2.

Steps to reproduce:

Run telegraf for a few weeks.

@danielnelson
Copy link
Contributor

Is this a new issue after an upgrade, if so what version did you upgrade from?

It may be useful to start telegraf with the --pprof-addr localhost:6060 flags, You can then browse to http://localhost:6060/debug/pprof/. It may be helpful to compare the files there when you just start Telegraf, with how they look after the memory has increased.

You may want to enable the internal input and watch the internal_memstats measurement, this won't tell us the cause but it may be useful for tracking.

@danielnelson danielnelson added the bug unexpected problem or unintended behavior label Nov 17, 2017
@absolutejam
Copy link
Author

These servers never had Telegraf before; it's a new deployment.

I can see that the issue is happening again so I'm going to do as you suggested an report back in a few days to see if I can find anything. Thanks.

image

@absolutejam
Copy link
Author

Sorry for the delay. I've re-enabled Telegraf and added the debug flags on 0.0.0.0 and I'll try and scrape that info periodically. I've also enable the internal stats.

Is there a specific value I should keep an eye out for?
image

I'm going to set up some alerts for high memory usage and get back to you in a couple of weeks.

@danielnelson
Copy link
Contributor

I would focus on the heap page. Save the output of the heap page ~10 minutes after you first startup, and then again after a day when then memory has gone up.

If you have Go installed somewhere you can try creating an image of the current memory use:

go tool pprof -png http://localhost:6060/debug/pprof/heap > out.png

@absolutejam
Copy link
Author

absolutejam commented Dec 11, 2017

Alright, I didn't manage to set up anything to scrape when I left on Friday, but I checked on it today:
image

Checking across some of my servers with Salt, some of them are up to 2GB of used memory for Telegraf - and they all seem to have one thing in common: SQL. And my SQL servers automatically get a .conf file which contains:

[[inputs.sqlserver]]
servers = [ "Server=localhost;Port=1433;User Id=telegraf;Password=redacted;app name=telegraf;log=1; ]

and as far as I know, these servers haven't had the a user set up for telegraf yet. Would the log=1 be storing SQL connection failures issues in memory?

I can send some of the /debug/pprof data across from one of the servers where it's up to 2GB RAM used if needed.

@danielnelson
Copy link
Contributor

Here is the documentation for the log=1 bit of the connection string.

I'm not sure if this would effect memory usage, I would think it would only increase the amount of logging.

Please send me the pprof data when you can, hopefully it will contain a clue. It is probably not necessary to wait for it to get to 2GB so long as it is approximately double the normal memory usage or more.

@absolutejam
Copy link
Author

Looks like the issue wasn't resolved after all. Running a single instance on a server to see what happens with debugging on and removing some configs to see what's happening.

@etlweather
Copy link

etlweather commented Nov 11, 2018

This is a bit old, but I'm experiencing a similar memory leak on Windows Server 2016. After a few weeks, Telegraf is using GB of RAM.

Was there any resolutions?

In my case I am running 1.6.1 with MSSQL Input and a few Windows OS metrics.

@danielnelson
Copy link
Contributor

@etlweather Can you try running telegraf with the additional --pprof-addr localhost:6060 command line options and after it has run for log enough to see an increase in memory collect http://localhost:6060/debug/pprof/goroutine?debug=1 and attach it here along with your Telegraf config file?

@etlweather
Copy link

Telegraf started around 15Mb of RAM, not up to 241Mb.

Attached is the output of goroutine and the config.

goroutine.txt

mssql.txt
telegraf.txt

@danielnelson
Copy link
Contributor

Can you get http://localhost:6060/debug/pprof/goroutine?debug=2 as well? It turned out the first link I had you get doesn't contain the full list of goroutines.

@etlweather
Copy link

I thought it was a bit small, but hey, what do I know about Go!

Here is the full dump. The count on the main page for goroutines has increased by about 3,000 since.

goroutine2.txt

@danielnelson
Copy link
Contributor

Looks like there are lots of leaking goroutines for connections waiting to receive:

goroutine 18237 [chan receive, 879 minutes]:
database/sql.(*DB).connectionOpener(0xc042f8a460)
	/usr/local/go/src/database/sql/sql.go:871 +0x5a
created by database/sql.Open
	/usr/local/go/src/database/sql/sql.go:609 +0x1f5

Looking through the docs for github.com/denisenkom/go-mssqldb, I think we may need to use a NewConnector with a context when we open connections. However, a possible workaround might be to add a connection timeout to your DSN, could you try that out?

Server=192.168.1.10;Port=1433;User Id=<user>;Password=<pw>;app name=telegraf;log=1;connection timeout=15

@etlweather
Copy link

Trying... after running for a few seconds, the goroutines count is 34. Will come back in a couple of hours with an update.

@etlweather
Copy link

etlweather commented Nov 14, 2018

Still growing, from 13Mb to 80+Mb of RAM. goroutine counter is now 7054. Attached is the output of the debug logs.

goroutine.txt

@danielnelson
Copy link
Contributor

Too bad, I think the next thing to check is if this is fixed in the latest Telegraf release. In 1.8.3 we have moved from a fork of the driver back to the main version, which has a fair number of bug fixes.

@etlweather
Copy link

I should have tried that first... trying now. Started 1.8.3, started with 14Mb of RAM or so. After a few second goroutine is at 55. Will come back in a couple hours.

@etlweather
Copy link

Already at 70Mb 😞

@etlweather
Copy link

Maybe I had some configuration error which was preventing the plugin to work (when I tried today, I found I didn't have a user in SQL Server for telegraf, don't know if I lost the user as I have been messing around...).

Anyway, right now with the current 1.9.3 version, I don't see this problem after a few hours of it running, it's been stably at between 20-23Mb of memory.

@danielnelson
Copy link
Contributor

@absolutejam Could you take another look with 1.9.3?

@arnodenuijl
Copy link
Contributor

arnodenuijl commented May 21, 2019

We experienced the same with version 1.9.1.

As long as the sqlserver input is configured correct and it can get the data is seems to work OK and there is no memory pressure. When there is a problem with the sql connection we get the logs as below (redacted the ip's) and the memory keeps rising.

Ran it for a while now with 1.10.4 and that gives the same result. When misconfigured the errors occur and the memory keeps rising.

2019-05-21T18:58:50Z I! ERROR: BeginRead failed read tcp [<IPV6 ip address>%Ethernet 2]:59491->[<IPV6 ip address>]:1433: wsarecv: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z I! ERROR: BeginRead failed read tcp <Local IPV4 address >:59487-><Local IPV4 address >:1433: wsarecv: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z E! [inputs.sqlserver]: Error in plugin: read tcp <Local IPV4 address >:59487-><Local IPV4 address >:1433: wsarecv: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z E! [inputs.sqlserver]: Error in plugin: read tcp [<IPV6 ip address>%Ethernet 2]:59491->[<IPV6 ip address>]:1433: wsarecv: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z I! Failed to send SqlBatch with write tcp [<IPV6 ip address>%Ethernet 2]:59490->[<IPV6 ip address>]:1433: wsasend: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z E! [inputs.sqlserver]: Error in plugin: failed to send SQL Batch: write tcp [<IPV6 ip address>%Ethernet 2]:59490->[<IPV6 ip address>]:1433: wsasend: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z I! Failed to send SqlBatch with write tcp [<IPV6 ip address>%Ethernet 2]:59494->[<IPV6 ip address>]:1433: wsasend: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z E! [inputs.sqlserver]: Error in plugin: failed to send SQL Batch: write tcp [<IPV6 ip address>%Ethernet 2]:59494->[<IPV6 ip address>]:1433: wsasend: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z I! Failed to send SqlBatch with write tcp <Local IPV4 address >:59486-><Local IPV4 address >:1433: wsasend: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z E! [inputs.sqlserver]: Error in plugin: failed to send SQL Batch: write tcp <Local IPV4 address >:59486-><Local IPV4 address >:1433: wsasend: An existing connection was forcibly closed by the remote host.

@arnodenuijl
Copy link
Contributor

I ran it with the --pperf-addr setting. This gist show a "full go routine stack dump" after 10 minutes of running. I have no experience in Go but i think it could help.
https://gist.github.com/arnodenuijl/7e0d31be5da6ac6f8e816b0fd8f9600e

@danielnelson
Copy link
Contributor

Thanks, that should indeed be helpful, looks like will need to dig into Go's sql package to understand the creation of these goroutines and why they are not completing:

goroutine 5896 [select]:
database/sql.(*DB).connectionOpener(0xc003c22480, 0x26dbf40, 0xc000139b00)
	/usr/local/go/src/database/sql/sql.go:1001 +0xef
created by database/sql.OpenDB
	/usr/local/go/src/database/sql/sql.go:671 +0x164

goroutine 5897 [select]:
database/sql.(*DB).connectionResetter(0xc003c22480, 0x26dbf40, 0xc000139b00)
	/usr/local/go/src/database/sql/sql.go:1014 +0x102
created by database/sql.OpenDB
	/usr/local/go/src/database/sql/sql.go:672 +0x19a

@arnodenuijl
Copy link
Contributor

Great. If I need to try or run something i'd be happy to.

@arnodenuijl
Copy link
Contributor

Could it be that the defer conn.Close() must come on line 154 before the ping. So that we also clean up if the ping fails?

https://github.com/influxdata/telegraf/blob/master/plugins/inputs/sqlserver/sqlserver.go#L160

@danielnelson
Copy link
Contributor

That could be it, can you test it? I can make you a build if it would help or it's pretty easy to setup Go as well.

@arnodenuijl
Copy link
Contributor

I will test it today. It was indeed easy to set up Go

@arnodenuijl
Copy link
Contributor

arnodenuijl commented May 22, 2019

Putting the defer conn.Close() before the ping seems to resolve the issue. The amount of go routines stays at ca. 16 (without the change it quickly goes into the 1000's).
On the github of the library i asked the question if the library authors see any problems with this. I saw the same pattern in the sample (open -> ping -> defer) and i wonder if there is a specific reason for this.

How can we proceed with this?

@kardianos
Copy link

Yes, Looking at the code in question and the goroutine list, this is indeed the issue. Putting defer db.Close above the ping will fix it.

However: Why is telegraph apparently opening a new connection on every stat collection? Why are you creating a new connection pool, opening a connection from it, then closing it all down every time, (every 30 seconds?)? This is a complete waste of resources. Also, don't use ping like this. That is a complete waste of time and additional resources. Just try the query and if it works, then it works. All PING does is add yet another round trip doing a pointless select 1;.

You want three phases to working with this: init: create the connection pool. Don't use ping. That doesn't help you in this context. Then periodically use this connection pool (sql.DB) and call Query. Also, calling the variable conn, err := sql.Open(...) is misleading at best. This is NOT A CONNECTION. This is a connection pool, and at open, no network connection has been yet established.

This will be an issue for any database stats, not just SQL Server. This will be an issue for any OS, not just Windows.

@danielnelson
Copy link
Contributor

@kardianos All good points, it would be good to make all of those changes. We could look at the postgresql plugins as an example for using the connection pool.

For this issue though lets only put the defer in the right spot, and optionally remove the unneeded ping. @arnodenuijl Would you be able to open a PR for that change?

@arnodenuijl
Copy link
Contributor

I created a pull request. Second in my life, so if I need to do anything other/different, just say so.

@danielnelson
Copy link
Contributor

Looks perfect, thanks!

I opened this issue for the connection pool work #5898, it isn't something that I am going to work on right now but it is good to have it tracked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sqlserver bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants