SQL Server Input Error: Took Longer To Collect #3048

dulrich14 · 2017-07-24T18:31:19Z

Bug report

While running Telegraf to collect SQL Server metrics an Error is encountered. Sometimes it is after a few hours, sometimes it is after a few days. The error stops all future collection as well.

Error Message: Input Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)

I am aware of one other posts about this.
Link 1: #2051 - Not relevant because the queries are finishing under 6 secs.

Relevant telegraf.conf:

[agent]
interval = "15s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "15s"
flush_jitter = "5s"
debug = true
quiet = false
logfile = "C:\Program Files\Telegraf\telegraf.log"

[[inputs.sqlserver]]
servers = [
"Server=SERVERNAME\SQL_INSTANCENAME;User Id=telegraf;Password=XXXXXXXX;app name=telegraf;log=1;"
]

[[outputs.prometheus_client]]
listen = ":9273"

System info:

Telegraf Version: 1.4.0
Collection Server OS: Windows 7 (2cpu , 4gb of RAM)
SQL Server Details: Server 2008R2 (4cpu 24GB RAM) SQL Server 2008R2

No CPU or Memory bottle neck when Telegraf is running.

Steps to reproduce:

Set up Prometheus Database
Set up SQL Server Connection
Start Trace and let it run

Expected behavior:

Run smoothly without error while collecting the best SQL Server metrics.

Actual behavior:

Random encounter of Input SQL Server error.

Additional info:

After running multiple SQL Server Profile Traces I noticed that 10 different database connections open up and run their own query each interval. In my case they always complete under 6seconds with the majority of them completing under 2seconds. When the error is encountered only 9 connections are open and finish. The query that is missing differs and doesn't seem to repeat. Please see trace excerpt attached.

Here is the relevant Log info from debug mode:
2017-07-23T21:53:32Z D! Output [prometheus_client] wrote batch of 1000 metrics in 43.0043ms
2017-07-23T21:53:33Z D! Output [prometheus_client] buffer fullness: 789 / 10000 metrics.
2017-07-23T21:53:33Z D! Output [prometheus_client] wrote batch of 789 metrics in 13.0013ms
2017-07-23T21:53:47Z D! Output [prometheus_client] buffer fullness: 24 / 10000 metrics.
2017-07-23T21:53:47Z D! Output [prometheus_client] wrote batch of 24 metrics in 8.0008ms
2017-07-23T21:54:02Z D! Output [prometheus_client] wrote batch of 1000 metrics in 37.0037ms
2017-07-23T21:54:03Z D! Output [prometheus_client] buffer fullness: 766 / 10000 metrics.
2017-07-23T21:54:03Z D! Output [prometheus_client] wrote batch of 766 metrics in 18.0018ms
2017-07-23T21:54:20Z D! Output [prometheus_client] buffer fullness: 895 / 10000 metrics.
2017-07-23T21:54:20Z D! Output [prometheus_client] wrote batch of 895 metrics in 31.0031ms
2017-07-23T21:54:34Z D! Output [prometheus_client] buffer fullness: 895 / 10000 metrics.
2017-07-23T21:54:34Z D! Output [prometheus_client] wrote batch of 895 metrics in 28.0028ms
2017-07-23T21:54:50Z D! Output [prometheus_client] buffer fullness: 895 / 10000 metrics.
2017-07-23T21:54:50Z D! Output [prometheus_client] wrote batch of 895 metrics in 31.0031ms
2017-07-23T21:55:06Z D! Output [prometheus_client] buffer fullness: 895 / 10000 metrics.
2017-07-23T21:55:06Z D! Output [prometheus_client] wrote batch of 895 metrics in 29.0029ms
2017-07-23T21:55:16Z D! Output [prometheus_client] buffer fullness: 13 / 10000 metrics.
2017-07-23T21:55:16Z D! Output [prometheus_client] wrote batch of 13 metrics in 4.0004ms
2017-07-23T21:55:32Z D! Output [prometheus_client] wrote batch of 1000 metrics in 41.0041ms
2017-07-23T21:55:36Z D! Output [prometheus_client] buffer fullness: 777 / 10000 metrics.
2017-07-23T21:55:36Z D! Output [prometheus_client] wrote batch of 777 metrics in 13.0013ms
2017-07-23T21:55:48Z D! Output [prometheus_client] buffer fullness: 895 / 10000 metrics.
2017-07-23T21:55:48Z D! Output [prometheus_client] wrote batch of 895 metrics in 31.0031ms
2017-07-23T21:56:04Z D! Output [prometheus_client] buffer fullness: 887 / 10000 metrics.
2017-07-23T21:56:04Z D! Output [prometheus_client] wrote batch of 887 metrics in 25.0025ms
2017-07-23T21:56:16Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)
2017-07-23T21:56:20Z D! Output [prometheus_client] buffer fullness: 10 / 10000 metrics.
2017-07-23T21:56:20Z D! Output [prometheus_client] wrote batch of 10 metrics in 4.0004ms
2017-07-23T21:56:31Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)
2017-07-23T21:56:32Z D! Output [prometheus_client] buffer fullness: 0 / 10000 metrics.
2017-07-23T21:56:46Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)
2017-07-23T21:56:49Z D! Output [prometheus_client] buffer fullness: 0 / 10000 metrics.
2017-07-23T21:57:01Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)
2017-07-23T21:57:03Z D! Output [prometheus_client] buffer fullness: 0 / 10000 metrics.
2017-07-23T21:57:16Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)

Trace file is [attached]
Trace.txt

danielnelson · 2017-07-25T00:22:34Z

When the hang occurs, how many connections are open to sql server?

dulrich14 · 2017-07-25T01:10:07Z

9 connections are open from the telegraf user (10 are expected) when the hang occurs. There are dozens of connections from other users during the same time which is normal behavior. Since these queries are selecting from system DMV's the standard users don't have access to query the same tables. I initially thought blocking could be the issue but after running another trace and implementing some blocking alert jobs that was not the case.

danielnelson · 2017-07-25T01:14:22Z

I actually expected 0 or 1 connections (from the hung query), it looks from the code that we close the connections after every collection.

danielnelson · 2017-07-25T01:18:15Z

Can you try to collect a trace while Telegraf is hung?

If you run with --pprof-addr :6060 you can create a trace with curl '127.0.0.1:6060/debug/pprof/trace?seconds=60' > trace.out. You can also look at the trace file with go tool trace -http=':7070' trace trace.out, or just add the file here.

dulrich14 · 2017-07-25T01:39:05Z

I misinterpreted the question, telegraf is opening and closing each connection every collection interval. So to answer your original question there are 0 connections open during the hang.

Here is an excerpt combining sql profile trace and debug log from telegraf.

2017-07-23T17:55:44Z ---------- 10 SQL connections are opened
2017-07-23T17:55:48Z D! Output [prometheus_client] buffer fullness: 895 / 10000 metrics.
2017-07-23T17:55:48Z D! Output [prometheus_client] wrote batch of 895 metrics in 31.0031ms
2017-07-23T17:55:50Z ----- All queries completed (less than 6 seconds) , 10 SQL connections Closed--

2017-07-23T17:55:59Z ----------- Only 9 SQL connections opened
2017-07-23T17:56:04Z D! Output [prometheus_client] buffer fullness: 887 / 10000 metrics.
2017-07-23T17:56:04Z D! Output [prometheus_client] wrote batch of 887 metrics in 25.0025ms
2017-07-23T17:56:04Z ----- All queries completed (less than 6 seconds) , 9 SQL connections Closed--
--------------------No more SQL Connections after this time ---------------------------------------

2017-07-23T17:56:16Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)
2017-07-23T17:56:20Z D! Output [prometheus_client] buffer fullness: 10 / 10000 metrics.
2017-07-23T17:56:20Z D! Output [prometheus_client] wrote batch of 10 metrics in 4.0004ms
2017-07-23T17:56:31Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)

danielnelson · 2017-07-25T02:02:37Z

Let's try to collect a trace when it is hung then?

dulrich14 · 2017-07-25T02:24:12Z

Alright that sounds like a plan. I downloaded curl on the windows 7 machine I am working on
I am trying to run the trace command with my machine name and ipaddress in the hostname spot:
curl 'hostname:6060/debug/pprof/trace?seconds=60' > trace.out

I am receiving the error: curl (6) could not resolve host:
The file trace.out is being created (empty) and I can browse the http://localhost:6060/debug/pprof/ site with trace metrics. Any suggestions? I am new to curl.

dulrich14 · 2017-07-25T02:53:32Z

Never mind, I figured it out, syntax error, I needed to remove the quotes. I will post back when it hangs. I appreciate the help.
curl http://localhost:6060/debug/pprof/trace?seconds=60 > trace.out

dulrich14 · 2017-07-26T19:18:55Z

I have attached the trace. The "hang" happened about 4 hours before I ran the 60 second trace. Once again only 9 connections opened and closed to sql server.
trace.zip

zensqlmonitor · 2017-07-29T07:37:10Z

please identify which query is waiting the lock

dulrich14 · 2017-07-31T13:40:51Z

After reviewing multiple SQL Server traces it seems to be the query in the file attached that is never kicked off.
Ghost_Perfmon_Query.txt

zensqlmonitor · 2017-07-31T15:22:00Z

Tested in multiple small VMs (2 vCPU & 4 GB RAM), and the execution time is less than 60 ms.
Could you identify what is causing this scenario in SQL Server ?
You can have a look there, a good start:
https://www.mssqltips.com/sqlservertip/2927/identify-the-cause-of-sql-server-blocking/

dulrich14 · 2017-07-31T18:22:31Z

The query completes under 1 second each time it runs, the query however is not being executed when the "hang" occurs. I have implemented a blocking capture job and also a long running query alert. I will post results after the next hang.

You mentioned in a comment above "identify which query is waiting the lock" . Can you please elaborate how you can tell a query is waiting? Thanks

zensqlmonitor · 2017-07-31T20:38:08Z

For queries that are blocked or are blocking others, with WaitingSessionID and BlockingSessionID, you will get the related queries.

lukewpatterson · 2017-08-16T19:47:14Z

I'm running in a Docker container and using Prometheus Client output plugin. Here's the health check I use to detect the condition (so Docker can restart telegraf):

wget -O - http://localhost:9273/metrics | grep State_ONLINE_Total

danielnelson · 2018-07-10T03:12:10Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQL Server Input Error: Took Longer To Collect #3048

SQL Server Input Error: Took Longer To Collect #3048

dulrich14 commented Jul 24, 2017

danielnelson commented Jul 25, 2017

dulrich14 commented Jul 25, 2017

danielnelson commented Jul 25, 2017

danielnelson commented Jul 25, 2017 •

edited

Loading

dulrich14 commented Jul 25, 2017

danielnelson commented Jul 25, 2017

dulrich14 commented Jul 25, 2017

dulrich14 commented Jul 25, 2017

dulrich14 commented Jul 26, 2017

zensqlmonitor commented Jul 29, 2017

dulrich14 commented Jul 31, 2017

zensqlmonitor commented Jul 31, 2017 •

edited

Loading

dulrich14 commented Jul 31, 2017

zensqlmonitor commented Jul 31, 2017

lukewpatterson commented Aug 16, 2017

danielnelson commented Jul 10, 2018

SQL Server Input Error: Took Longer To Collect #3048

SQL Server Input Error: Took Longer To Collect #3048

Comments

dulrich14 commented Jul 24, 2017

Bug report

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

danielnelson commented Jul 25, 2017

dulrich14 commented Jul 25, 2017

danielnelson commented Jul 25, 2017

danielnelson commented Jul 25, 2017 • edited Loading

dulrich14 commented Jul 25, 2017

danielnelson commented Jul 25, 2017

dulrich14 commented Jul 25, 2017

dulrich14 commented Jul 25, 2017

dulrich14 commented Jul 26, 2017

zensqlmonitor commented Jul 29, 2017

dulrich14 commented Jul 31, 2017

zensqlmonitor commented Jul 31, 2017 • edited Loading

dulrich14 commented Jul 31, 2017

zensqlmonitor commented Jul 31, 2017

lukewpatterson commented Aug 16, 2017

danielnelson commented Jul 10, 2018

danielnelson commented Jul 25, 2017 •

edited

Loading

zensqlmonitor commented Jul 31, 2017 •

edited

Loading