Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL Server Input Error: Took Longer To Collect #3048

Closed
dulrich14 opened this issue Jul 24, 2017 · 16 comments
Closed

SQL Server Input Error: Took Longer To Collect #3048

dulrich14 opened this issue Jul 24, 2017 · 16 comments

Comments

@dulrich14
Copy link

Bug report

While running Telegraf to collect SQL Server metrics an Error is encountered. Sometimes it is after a few hours, sometimes it is after a few days. The error stops all future collection as well.

Error Message: Input Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)

I am aware of one other posts about this.
Link 1: #2051 - Not relevant because the queries are finishing under 6 secs.

Relevant telegraf.conf:

[agent]
interval = "15s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "15s"
flush_jitter = "5s"
debug = true
quiet = false
logfile = "C:\Program Files\Telegraf\telegraf.log"

[[inputs.sqlserver]]
servers = [
"Server=SERVERNAME\SQL_INSTANCENAME;User Id=telegraf;Password=XXXXXXXX;app name=telegraf;log=1;"
]

[[outputs.prometheus_client]]
listen = ":9273"

System info:

Telegraf Version: 1.4.0
Collection Server OS: Windows 7 (2cpu , 4gb of RAM)
SQL Server Details: Server 2008R2 (4cpu 24GB RAM) SQL Server 2008R2

No CPU or Memory bottle neck when Telegraf is running.

Steps to reproduce:

  1. Set up Prometheus Database
  2. Set up SQL Server Connection
  3. Start Trace and let it run

Expected behavior:

Run smoothly without error while collecting the best SQL Server metrics.

Actual behavior:

Random encounter of Input SQL Server error.

Additional info:

After running multiple SQL Server Profile Traces I noticed that 10 different database connections open up and run their own query each interval. In my case they always complete under 6seconds with the majority of them completing under 2seconds. When the error is encountered only 9 connections are open and finish. The query that is missing differs and doesn't seem to repeat. Please see trace excerpt attached.

Here is the relevant Log info from debug mode:
2017-07-23T21:53:32Z D! Output [prometheus_client] wrote batch of 1000 metrics in 43.0043ms
2017-07-23T21:53:33Z D! Output [prometheus_client] buffer fullness: 789 / 10000 metrics.
2017-07-23T21:53:33Z D! Output [prometheus_client] wrote batch of 789 metrics in 13.0013ms
2017-07-23T21:53:47Z D! Output [prometheus_client] buffer fullness: 24 / 10000 metrics.
2017-07-23T21:53:47Z D! Output [prometheus_client] wrote batch of 24 metrics in 8.0008ms
2017-07-23T21:54:02Z D! Output [prometheus_client] wrote batch of 1000 metrics in 37.0037ms
2017-07-23T21:54:03Z D! Output [prometheus_client] buffer fullness: 766 / 10000 metrics.
2017-07-23T21:54:03Z D! Output [prometheus_client] wrote batch of 766 metrics in 18.0018ms
2017-07-23T21:54:20Z D! Output [prometheus_client] buffer fullness: 895 / 10000 metrics.
2017-07-23T21:54:20Z D! Output [prometheus_client] wrote batch of 895 metrics in 31.0031ms
2017-07-23T21:54:34Z D! Output [prometheus_client] buffer fullness: 895 / 10000 metrics.
2017-07-23T21:54:34Z D! Output [prometheus_client] wrote batch of 895 metrics in 28.0028ms
2017-07-23T21:54:50Z D! Output [prometheus_client] buffer fullness: 895 / 10000 metrics.
2017-07-23T21:54:50Z D! Output [prometheus_client] wrote batch of 895 metrics in 31.0031ms
2017-07-23T21:55:06Z D! Output [prometheus_client] buffer fullness: 895 / 10000 metrics.
2017-07-23T21:55:06Z D! Output [prometheus_client] wrote batch of 895 metrics in 29.0029ms
2017-07-23T21:55:16Z D! Output [prometheus_client] buffer fullness: 13 / 10000 metrics.
2017-07-23T21:55:16Z D! Output [prometheus_client] wrote batch of 13 metrics in 4.0004ms
2017-07-23T21:55:32Z D! Output [prometheus_client] wrote batch of 1000 metrics in 41.0041ms
2017-07-23T21:55:36Z D! Output [prometheus_client] buffer fullness: 777 / 10000 metrics.
2017-07-23T21:55:36Z D! Output [prometheus_client] wrote batch of 777 metrics in 13.0013ms
2017-07-23T21:55:48Z D! Output [prometheus_client] buffer fullness: 895 / 10000 metrics.
2017-07-23T21:55:48Z D! Output [prometheus_client] wrote batch of 895 metrics in 31.0031ms
2017-07-23T21:56:04Z D! Output [prometheus_client] buffer fullness: 887 / 10000 metrics.
2017-07-23T21:56:04Z D! Output [prometheus_client] wrote batch of 887 metrics in 25.0025ms
2017-07-23T21:56:16Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)
2017-07-23T21:56:20Z D! Output [prometheus_client] buffer fullness: 10 / 10000 metrics.
2017-07-23T21:56:20Z D! Output [prometheus_client] wrote batch of 10 metrics in 4.0004ms
2017-07-23T21:56:31Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)
2017-07-23T21:56:32Z D! Output [prometheus_client] buffer fullness: 0 / 10000 metrics.
2017-07-23T21:56:46Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)
2017-07-23T21:56:49Z D! Output [prometheus_client] buffer fullness: 0 / 10000 metrics.
2017-07-23T21:57:01Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)
2017-07-23T21:57:03Z D! Output [prometheus_client] buffer fullness: 0 / 10000 metrics.
2017-07-23T21:57:16Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)

Trace file is [attached]
Trace.txt

@danielnelson
Copy link
Contributor

When the hang occurs, how many connections are open to sql server?

@dulrich14
Copy link
Author

9 connections are open from the telegraf user (10 are expected) when the hang occurs. There are dozens of connections from other users during the same time which is normal behavior. Since these queries are selecting from system DMV's the standard users don't have access to query the same tables. I initially thought blocking could be the issue but after running another trace and implementing some blocking alert jobs that was not the case.

@danielnelson
Copy link
Contributor

I actually expected 0 or 1 connections (from the hung query), it looks from the code that we close the connections after every collection.

@danielnelson
Copy link
Contributor

danielnelson commented Jul 25, 2017

Can you try to collect a trace while Telegraf is hung?

If you run with --pprof-addr :6060 you can create a trace with curl '127.0.0.1:6060/debug/pprof/trace?seconds=60' > trace.out. You can also look at the trace file with go tool trace -http=':7070' trace trace.out, or just add the file here.

@dulrich14
Copy link
Author

I misinterpreted the question, telegraf is opening and closing each connection every collection interval. So to answer your original question there are 0 connections open during the hang.

Here is an excerpt combining sql profile trace and debug log from telegraf.

2017-07-23T17:55:44Z ---------- 10 SQL connections are opened
2017-07-23T17:55:48Z D! Output [prometheus_client] buffer fullness: 895 / 10000 metrics.
2017-07-23T17:55:48Z D! Output [prometheus_client] wrote batch of 895 metrics in 31.0031ms
2017-07-23T17:55:50Z ----- All queries completed (less than 6 seconds) , 10 SQL connections Closed--

2017-07-23T17:55:59Z ----------- Only 9 SQL connections opened
2017-07-23T17:56:04Z D! Output [prometheus_client] buffer fullness: 887 / 10000 metrics.
2017-07-23T17:56:04Z D! Output [prometheus_client] wrote batch of 887 metrics in 25.0025ms
2017-07-23T17:56:04Z ----- All queries completed (less than 6 seconds) , 9 SQL connections Closed--
--------------------No more SQL Connections after this time ---------------------------------------

2017-07-23T17:56:16Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)
2017-07-23T17:56:20Z D! Output [prometheus_client] buffer fullness: 10 / 10000 metrics.
2017-07-23T17:56:20Z D! Output [prometheus_client] wrote batch of 10 metrics in 4.0004ms
2017-07-23T17:56:31Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)

@danielnelson
Copy link
Contributor

Let's try to collect a trace when it is hung then?

@dulrich14
Copy link
Author

Alright that sounds like a plan. I downloaded curl on the windows 7 machine I am working on
I am trying to run the trace command with my machine name and ipaddress in the hostname spot:
curl 'hostname:6060/debug/pprof/trace?seconds=60' > trace.out

I am receiving the error: curl (6) could not resolve host:
The file trace.out is being created (empty) and I can browse the http://localhost:6060/debug/pprof/ site with trace metrics. Any suggestions? I am new to curl.

@dulrich14
Copy link
Author

Never mind, I figured it out, syntax error, I needed to remove the quotes. I will post back when it hangs. I appreciate the help.
curl http://localhost:6060/debug/pprof/trace?seconds=60 > trace.out

@dulrich14
Copy link
Author

I have attached the trace. The "hang" happened about 4 hours before I ran the 60 second trace. Once again only 9 connections opened and closed to sql server.
trace.zip

@zensqlmonitor
Copy link
Contributor

please identify which query is waiting the lock

@dulrich14
Copy link
Author

After reviewing multiple SQL Server traces it seems to be the query in the file attached that is never kicked off.
Ghost_Perfmon_Query.txt

@zensqlmonitor
Copy link
Contributor

zensqlmonitor commented Jul 31, 2017

Tested in multiple small VMs (2 vCPU & 4 GB RAM), and the execution time is less than 60 ms.
Could you identify what is causing this scenario in SQL Server ?
You can have a look there, a good start:
https://www.mssqltips.com/sqlservertip/2927/identify-the-cause-of-sql-server-blocking/

@dulrich14
Copy link
Author

The query completes under 1 second each time it runs, the query however is not being executed when the "hang" occurs. I have implemented a blocking capture job and also a long running query alert. I will post results after the next hang.

You mentioned in a comment above "identify which query is waiting the lock" . Can you please elaborate how you can tell a query is waiting? Thanks

@zensqlmonitor
Copy link
Contributor

For queries that are blocked or are blocking others, with WaitingSessionID and BlockingSessionID, you will get the related queries.

@lukewpatterson
Copy link

I'm running in a Docker container and using Prometheus Client output plugin. Here's the health check I use to detect the condition (so Docker can restart telegraf):

wget -O - http://localhost:9273/metrics | grep State_ONLINE_Total

@danielnelson
Copy link
Contributor

I believe the root cause of this issue is #2919

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants