Potential memory leak in Windows (v1.4.3) #3485

absolutejam · 2017-11-17T13:36:26Z

Directions

This week, we had an issue wherein at least 6 customer servers' Telegraf process was consuming 5GB of RAM all the time. Looking at the metrics Telegraf had collected, it looks to have started creeping gradually from the 1st of the month until memory used was sitting at 99%. Restarting the Telegraf service has released this memory pressure, but I'm observing it at the moment, in anticipation of it happening again.

Basically, I want to know what information I can provide to help tackle this if it's going to reoccur. Can I enable some verbose debugging and catch it in the act again, or is this related to the metrics I'm collecting? Although, there should be no change in series, so I would have thought the usage would be steady?

Bug report

Relevant telegraf.conf:

Just collecting a config dump now. Will attach shortly...

System info:

Mostly Windows Server 2012 R2 with one possible 2008 R2.

Steps to reproduce:

Run telegraf for a few weeks.

danielnelson · 2017-11-17T19:19:34Z

Is this a new issue after an upgrade, if so what version did you upgrade from?

It may be useful to start telegraf with the --pprof-addr localhost:6060 flags, You can then browse to http://localhost:6060/debug/pprof/. It may be helpful to compare the files there when you just start Telegraf, with how they look after the memory has increased.

You may want to enable the internal input and watch the internal_memstats measurement, this won't tell us the cause but it may be useful for tracking.

absolutejam · 2017-11-20T09:11:36Z

These servers never had Telegraf before; it's a new deployment.

I can see that the issue is happening again so I'm going to do as you suggested an report back in a few days to see if I can find anything. Thanks.

absolutejam · 2017-12-08T15:02:40Z

Sorry for the delay. I've re-enabled Telegraf and added the debug flags on 0.0.0.0 and I'll try and scrape that info periodically. I've also enable the internal stats.

Is there a specific value I should keep an eye out for?

I'm going to set up some alerts for high memory usage and get back to you in a couple of weeks.

danielnelson · 2017-12-08T19:44:34Z

I would focus on the heap page. Save the output of the heap page ~10 minutes after you first startup, and then again after a day when then memory has gone up.

If you have Go installed somewhere you can try creating an image of the current memory use:

go tool pprof -png http://localhost:6060/debug/pprof/heap > out.png

absolutejam · 2017-12-11T15:19:13Z

Alright, I didn't manage to set up anything to scrape when I left on Friday, but I checked on it today:

Checking across some of my servers with Salt, some of them are up to 2GB of used memory for Telegraf - and they all seem to have one thing in common: SQL. And my SQL servers automatically get a .conf file which contains:

[[inputs.sqlserver]]
servers = [ "Server=localhost;Port=1433;User Id=telegraf;Password=redacted;app name=telegraf;log=1; ]

and as far as I know, these servers haven't had the a user set up for telegraf yet. Would the log=1 be storing SQL connection failures issues in memory?

I can send some of the /debug/pprof data across from one of the servers where it's up to 2GB RAM used if needed.

danielnelson · 2017-12-11T21:21:39Z

Here is the documentation for the log=1 bit of the connection string.

I'm not sure if this would effect memory usage, I would think it would only increase the amount of logging.

Please send me the pprof data when you can, hopefully it will contain a clue. It is probably not necessary to wait for it to get to 2GB so long as it is approximately double the normal memory usage or more.

absolutejam · 2018-03-15T13:17:46Z

Looks like the issue wasn't resolved after all. Running a single instance on a server to see what happens with debugging on and removing some configs to see what's happening.

etlweather · 2018-11-11T01:40:51Z

This is a bit old, but I'm experiencing a similar memory leak on Windows Server 2016. After a few weeks, Telegraf is using GB of RAM.

Was there any resolutions?

In my case I am running 1.6.1 with MSSQL Input and a few Windows OS metrics.

danielnelson · 2018-11-13T00:00:46Z

@etlweather Can you try running telegraf with the additional --pprof-addr localhost:6060 command line options and after it has run for log enough to see an increase in memory collect http://localhost:6060/debug/pprof/goroutine?debug=1 and attach it here along with your Telegraf config file?

etlweather · 2018-11-13T16:23:24Z

Telegraf started around 15Mb of RAM, not up to 241Mb.

Attached is the output of goroutine and the config.

goroutine.txt

mssql.txt
telegraf.txt

danielnelson · 2018-11-13T19:14:28Z

Can you get http://localhost:6060/debug/pprof/goroutine?debug=2 as well? It turned out the first link I had you get doesn't contain the full list of goroutines.

etlweather · 2018-11-13T19:36:33Z

I thought it was a bit small, but hey, what do I know about Go!

Here is the full dump. The count on the main page for goroutines has increased by about 3,000 since.

goroutine2.txt

danielnelson · 2018-11-13T20:11:00Z

Looks like there are lots of leaking goroutines for connections waiting to receive:

goroutine 18237 [chan receive, 879 minutes]:
database/sql.(*DB).connectionOpener(0xc042f8a460)
	/usr/local/go/src/database/sql/sql.go:871 +0x5a
created by database/sql.Open
	/usr/local/go/src/database/sql/sql.go:609 +0x1f5

Looking through the docs for github.com/denisenkom/go-mssqldb, I think we may need to use a NewConnector with a context when we open connections. However, a possible workaround might be to add a connection timeout to your DSN, could you try that out?

Server=192.168.1.10;Port=1433;User Id=<user>;Password=<pw>;app name=telegraf;log=1;connection timeout=15

etlweather · 2018-11-13T20:16:06Z

Trying... after running for a few seconds, the goroutines count is 34. Will come back in a couple of hours with an update.

etlweather · 2018-11-14T00:10:16Z

Still growing, from 13Mb to 80+Mb of RAM. goroutine counter is now 7054. Attached is the output of the debug logs.

goroutine.txt

danielnelson · 2018-11-14T01:15:50Z

Too bad, I think the next thing to check is if this is fixed in the latest Telegraf release. In 1.8.3 we have moved from a fork of the driver back to the main version, which has a fair number of bug fixes.

etlweather · 2018-11-14T01:19:49Z

I should have tried that first... trying now. Started 1.8.3, started with 14Mb of RAM or so. After a few second goroutine is at 55. Will come back in a couple hours.

etlweather · 2018-11-14T02:29:57Z

Already at 70Mb 😞

etlweather · 2019-02-02T23:54:56Z

Maybe I had some configuration error which was preventing the plugin to work (when I tried today, I found I didn't have a user in SQL Server for telegraf, don't know if I lost the user as I have been messing around...).

Anyway, right now with the current 1.9.3 version, I don't see this problem after a few hours of it running, it's been stably at between 20-23Mb of memory.

danielnelson · 2019-02-03T04:50:54Z

@absolutejam Could you take another look with 1.9.3?

arnodenuijl · 2019-05-21T19:17:33Z

We experienced the same with version 1.9.1.

As long as the sqlserver input is configured correct and it can get the data is seems to work OK and there is no memory pressure. When there is a problem with the sql connection we get the logs as below (redacted the ip's) and the memory keeps rising.

Ran it for a while now with 1.10.4 and that gives the same result. When misconfigured the errors occur and the memory keeps rising.

2019-05-21T18:58:50Z I! ERROR: BeginRead failed read tcp [<IPV6 ip address>%Ethernet 2]:59491->[<IPV6 ip address>]:1433: wsarecv: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z I! ERROR: BeginRead failed read tcp <Local IPV4 address >:59487-><Local IPV4 address >:1433: wsarecv: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z E! [inputs.sqlserver]: Error in plugin: read tcp <Local IPV4 address >:59487-><Local IPV4 address >:1433: wsarecv: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z E! [inputs.sqlserver]: Error in plugin: read tcp [<IPV6 ip address>%Ethernet 2]:59491->[<IPV6 ip address>]:1433: wsarecv: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z I! Failed to send SqlBatch with write tcp [<IPV6 ip address>%Ethernet 2]:59490->[<IPV6 ip address>]:1433: wsasend: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z E! [inputs.sqlserver]: Error in plugin: failed to send SQL Batch: write tcp [<IPV6 ip address>%Ethernet 2]:59490->[<IPV6 ip address>]:1433: wsasend: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z I! Failed to send SqlBatch with write tcp [<IPV6 ip address>%Ethernet 2]:59494->[<IPV6 ip address>]:1433: wsasend: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z E! [inputs.sqlserver]: Error in plugin: failed to send SQL Batch: write tcp [<IPV6 ip address>%Ethernet 2]:59494->[<IPV6 ip address>]:1433: wsasend: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z I! Failed to send SqlBatch with write tcp <Local IPV4 address >:59486-><Local IPV4 address >:1433: wsasend: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z E! [inputs.sqlserver]: Error in plugin: failed to send SQL Batch: write tcp <Local IPV4 address >:59486-><Local IPV4 address >:1433: wsasend: An existing connection was forcibly closed by the remote host.

arnodenuijl · 2019-05-21T19:36:26Z

I ran it with the --pperf-addr setting. This gist show a "full go routine stack dump" after 10 minutes of running. I have no experience in Go but i think it could help.
https://gist.github.com/arnodenuijl/7e0d31be5da6ac6f8e816b0fd8f9600e

danielnelson · 2019-05-21T19:42:15Z

Thanks, that should indeed be helpful, looks like will need to dig into Go's sql package to understand the creation of these goroutines and why they are not completing:

goroutine 5896 [select]:
database/sql.(*DB).connectionOpener(0xc003c22480, 0x26dbf40, 0xc000139b00)
	/usr/local/go/src/database/sql/sql.go:1001 +0xef
created by database/sql.OpenDB
	/usr/local/go/src/database/sql/sql.go:671 +0x164

goroutine 5897 [select]:
database/sql.(*DB).connectionResetter(0xc003c22480, 0x26dbf40, 0xc000139b00)
	/usr/local/go/src/database/sql/sql.go:1014 +0x102
created by database/sql.OpenDB
	/usr/local/go/src/database/sql/sql.go:672 +0x19a

arnodenuijl · 2019-05-21T20:03:30Z

Great. If I need to try or run something i'd be happy to.

arnodenuijl · 2019-05-21T20:36:47Z

Could it be that the defer conn.Close() must come on line 154 before the ping. So that we also clean up if the ping fails?

https://github.com/influxdata/telegraf/blob/master/plugins/inputs/sqlserver/sqlserver.go#L160

danielnelson · 2019-05-21T22:59:08Z

That could be it, can you test it? I can make you a build if it would help or it's pretty easy to setup Go as well.

arnodenuijl · 2019-05-22T07:02:35Z

I will test it today. It was indeed easy to set up Go

arnodenuijl · 2019-05-22T10:08:21Z

Putting the defer conn.Close() before the ping seems to resolve the issue. The amount of go routines stays at ca. 16 (without the change it quickly goes into the 1000's).
On the github of the library i asked the question if the library authors see any problems with this. I saw the same pattern in the sample (open -> ping -> defer) and i wonder if there is a specific reason for this.

How can we proceed with this?

kardianos · 2019-05-22T17:01:24Z

Yes, Looking at the code in question and the goroutine list, this is indeed the issue. Putting defer db.Close above the ping will fix it.

However: Why is telegraph apparently opening a new connection on every stat collection? Why are you creating a new connection pool, opening a connection from it, then closing it all down every time, (every 30 seconds?)? This is a complete waste of resources. Also, don't use ping like this. That is a complete waste of time and additional resources. Just try the query and if it works, then it works. All PING does is add yet another round trip doing a pointless select 1;.

You want three phases to working with this: init: create the connection pool. Don't use ping. That doesn't help you in this context. Then periodically use this connection pool (sql.DB) and call Query. Also, calling the variable conn, err := sql.Open(...) is misleading at best. This is NOT A CONNECTION. This is a connection pool, and at open, no network connection has been yet established.

This will be an issue for any database stats, not just SQL Server. This will be an issue for any OS, not just Windows.

danielnelson · 2019-05-22T17:28:59Z

@kardianos All good points, it would be good to make all of those changes. We could look at the postgresql plugins as an example for using the connection pool.

For this issue though lets only put the defer in the right spot, and optionally remove the unneeded ping. @arnodenuijl Would you be able to open a PR for that change?

arnodenuijl · 2019-05-22T18:39:59Z

I created a pull request. Second in my life, so if I need to do anything other/different, just say so.

danielnelson · 2019-05-22T20:43:46Z

Looks perfect, thanks!

I opened this issue for the connection pool work #5898, it isn't something that I am going to work on right now but it is good to have it tracked.

danielnelson added the bug unexpected problem or unintended behavior label Nov 17, 2017

danielnelson added the area/sqlserver label Nov 13, 2018

arnodenuijl mentioned this issue May 22, 2019

fix example, put defer close before ping denisenkom/go-mssqldb#476

Closed

arnodenuijl mentioned this issue May 22, 2019

Move defered closing of connection #5897

Merged

3 tasks

danielnelson closed this as completed in #5897 May 22, 2019

danielnelson added this to the 1.11.0 milestone May 22, 2019

danielnelson mentioned this issue May 22, 2019

Reuse connections from connection pool in sqlserver input #5898

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential memory leak in Windows (v1.4.3) #3485

Potential memory leak in Windows (v1.4.3) #3485

absolutejam commented Nov 17, 2017

danielnelson commented Nov 17, 2017

absolutejam commented Nov 20, 2017

absolutejam commented Dec 8, 2017

danielnelson commented Dec 8, 2017

absolutejam commented Dec 11, 2017 •

edited

Loading

danielnelson commented Dec 11, 2017

absolutejam commented Mar 15, 2018

etlweather commented Nov 11, 2018 •

edited

Loading

danielnelson commented Nov 13, 2018

etlweather commented Nov 13, 2018

danielnelson commented Nov 13, 2018

etlweather commented Nov 13, 2018

danielnelson commented Nov 13, 2018

etlweather commented Nov 13, 2018

etlweather commented Nov 14, 2018 •

edited

Loading

danielnelson commented Nov 14, 2018

etlweather commented Nov 14, 2018

etlweather commented Nov 14, 2018

etlweather commented Feb 2, 2019

danielnelson commented Feb 3, 2019

arnodenuijl commented May 21, 2019 •

edited

Loading

arnodenuijl commented May 21, 2019

danielnelson commented May 21, 2019

arnodenuijl commented May 21, 2019

arnodenuijl commented May 21, 2019

danielnelson commented May 21, 2019

arnodenuijl commented May 22, 2019

arnodenuijl commented May 22, 2019 •

edited

Loading

kardianos commented May 22, 2019

danielnelson commented May 22, 2019

arnodenuijl commented May 22, 2019

danielnelson commented May 22, 2019

Potential memory leak in Windows (v1.4.3) #3485

Potential memory leak in Windows (v1.4.3) #3485

Comments

absolutejam commented Nov 17, 2017

Directions

Bug report

Relevant telegraf.conf:

System info:

Steps to reproduce:

danielnelson commented Nov 17, 2017

absolutejam commented Nov 20, 2017

absolutejam commented Dec 8, 2017

danielnelson commented Dec 8, 2017

absolutejam commented Dec 11, 2017 • edited Loading

danielnelson commented Dec 11, 2017

absolutejam commented Mar 15, 2018

etlweather commented Nov 11, 2018 • edited Loading

danielnelson commented Nov 13, 2018

etlweather commented Nov 13, 2018

danielnelson commented Nov 13, 2018

etlweather commented Nov 13, 2018

danielnelson commented Nov 13, 2018

etlweather commented Nov 13, 2018

etlweather commented Nov 14, 2018 • edited Loading

danielnelson commented Nov 14, 2018

etlweather commented Nov 14, 2018

etlweather commented Nov 14, 2018

etlweather commented Feb 2, 2019

danielnelson commented Feb 3, 2019

arnodenuijl commented May 21, 2019 • edited Loading

arnodenuijl commented May 21, 2019

danielnelson commented May 21, 2019

arnodenuijl commented May 21, 2019

arnodenuijl commented May 21, 2019

danielnelson commented May 21, 2019

arnodenuijl commented May 22, 2019

arnodenuijl commented May 22, 2019 • edited Loading

kardianos commented May 22, 2019

danielnelson commented May 22, 2019

arnodenuijl commented May 22, 2019

danielnelson commented May 22, 2019

absolutejam commented Dec 11, 2017 •

edited

Loading

etlweather commented Nov 11, 2018 •

edited

Loading

etlweather commented Nov 14, 2018 •

edited

Loading

arnodenuijl commented May 21, 2019 •

edited

Loading

arnodenuijl commented May 22, 2019 •

edited

Loading