-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQL Server Input Error: Took Longer To Collect #3048
Comments
When the hang occurs, how many connections are open to sql server? |
9 connections are open from the telegraf user (10 are expected) when the hang occurs. There are dozens of connections from other users during the same time which is normal behavior. Since these queries are selecting from system DMV's the standard users don't have access to query the same tables. I initially thought blocking could be the issue but after running another trace and implementing some blocking alert jobs that was not the case. |
I actually expected 0 or 1 connections (from the hung query), it looks from the code that we close the connections after every collection. |
Can you try to collect a trace while Telegraf is hung? If you run with |
I misinterpreted the question, telegraf is opening and closing each connection every collection interval. So to answer your original question there are 0 connections open during the hang. Here is an excerpt combining sql profile trace and debug log from telegraf. 2017-07-23T17:55:44Z ---------- 10 SQL connections are opened 2017-07-23T17:55:59Z ----------- Only 9 SQL connections opened 2017-07-23T17:56:16Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s) |
Let's try to collect a trace when it is hung then? |
Alright that sounds like a plan. I downloaded curl on the windows 7 machine I am working on I am receiving the error: curl (6) could not resolve host: |
Never mind, I figured it out, syntax error, I needed to remove the quotes. I will post back when it hangs. I appreciate the help. |
I have attached the trace. The "hang" happened about 4 hours before I ran the 60 second trace. Once again only 9 connections opened and closed to sql server. |
please identify which query is waiting the lock |
After reviewing multiple SQL Server traces it seems to be the query in the file attached that is never kicked off. |
Tested in multiple small VMs (2 vCPU & 4 GB RAM), and the execution time is less than 60 ms. |
The query completes under 1 second each time it runs, the query however is not being executed when the "hang" occurs. I have implemented a blocking capture job and also a long running query alert. I will post results after the next hang. You mentioned in a comment above "identify which query is waiting the lock" . Can you please elaborate how you can tell a query is waiting? Thanks |
For queries that are blocked or are blocking others, with WaitingSessionID and BlockingSessionID, you will get the related queries. |
I'm running in a Docker container and using Prometheus Client output plugin. Here's the health check I use to detect the condition (so Docker can restart telegraf):
|
I believe the root cause of this issue is #2919 |
Bug report
While running Telegraf to collect SQL Server metrics an Error is encountered. Sometimes it is after a few hours, sometimes it is after a few days. The error stops all future collection as well.
Error Message: Input Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)
I am aware of one other posts about this.
Link 1: #2051 - Not relevant because the queries are finishing under 6 secs.
Relevant telegraf.conf:
[agent]
interval = "15s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "15s"
flush_jitter = "5s"
debug = true
quiet = false
logfile = "C:\Program Files\Telegraf\telegraf.log"
[[inputs.sqlserver]]
servers = [
"Server=SERVERNAME\SQL_INSTANCENAME;User Id=telegraf;Password=XXXXXXXX;app name=telegraf;log=1;"
]
[[outputs.prometheus_client]]
listen = ":9273"
System info:
Telegraf Version: 1.4.0
Collection Server OS: Windows 7 (2cpu , 4gb of RAM)
SQL Server Details: Server 2008R2 (4cpu 24GB RAM) SQL Server 2008R2
No CPU or Memory bottle neck when Telegraf is running.
Steps to reproduce:
Expected behavior:
Run smoothly without error while collecting the best SQL Server metrics.
Actual behavior:
Random encounter of Input SQL Server error.
Additional info:
After running multiple SQL Server Profile Traces I noticed that 10 different database connections open up and run their own query each interval. In my case they always complete under 6seconds with the majority of them completing under 2seconds. When the error is encountered only 9 connections are open and finish. The query that is missing differs and doesn't seem to repeat. Please see trace excerpt attached.
Here is the relevant Log info from debug mode:
2017-07-23T21:53:32Z D! Output [prometheus_client] wrote batch of 1000 metrics in 43.0043ms
2017-07-23T21:53:33Z D! Output [prometheus_client] buffer fullness: 789 / 10000 metrics.
2017-07-23T21:53:33Z D! Output [prometheus_client] wrote batch of 789 metrics in 13.0013ms
2017-07-23T21:53:47Z D! Output [prometheus_client] buffer fullness: 24 / 10000 metrics.
2017-07-23T21:53:47Z D! Output [prometheus_client] wrote batch of 24 metrics in 8.0008ms
2017-07-23T21:54:02Z D! Output [prometheus_client] wrote batch of 1000 metrics in 37.0037ms
2017-07-23T21:54:03Z D! Output [prometheus_client] buffer fullness: 766 / 10000 metrics.
2017-07-23T21:54:03Z D! Output [prometheus_client] wrote batch of 766 metrics in 18.0018ms
2017-07-23T21:54:20Z D! Output [prometheus_client] buffer fullness: 895 / 10000 metrics.
2017-07-23T21:54:20Z D! Output [prometheus_client] wrote batch of 895 metrics in 31.0031ms
2017-07-23T21:54:34Z D! Output [prometheus_client] buffer fullness: 895 / 10000 metrics.
2017-07-23T21:54:34Z D! Output [prometheus_client] wrote batch of 895 metrics in 28.0028ms
2017-07-23T21:54:50Z D! Output [prometheus_client] buffer fullness: 895 / 10000 metrics.
2017-07-23T21:54:50Z D! Output [prometheus_client] wrote batch of 895 metrics in 31.0031ms
2017-07-23T21:55:06Z D! Output [prometheus_client] buffer fullness: 895 / 10000 metrics.
2017-07-23T21:55:06Z D! Output [prometheus_client] wrote batch of 895 metrics in 29.0029ms
2017-07-23T21:55:16Z D! Output [prometheus_client] buffer fullness: 13 / 10000 metrics.
2017-07-23T21:55:16Z D! Output [prometheus_client] wrote batch of 13 metrics in 4.0004ms
2017-07-23T21:55:32Z D! Output [prometheus_client] wrote batch of 1000 metrics in 41.0041ms
2017-07-23T21:55:36Z D! Output [prometheus_client] buffer fullness: 777 / 10000 metrics.
2017-07-23T21:55:36Z D! Output [prometheus_client] wrote batch of 777 metrics in 13.0013ms
2017-07-23T21:55:48Z D! Output [prometheus_client] buffer fullness: 895 / 10000 metrics.
2017-07-23T21:55:48Z D! Output [prometheus_client] wrote batch of 895 metrics in 31.0031ms
2017-07-23T21:56:04Z D! Output [prometheus_client] buffer fullness: 887 / 10000 metrics.
2017-07-23T21:56:04Z D! Output [prometheus_client] wrote batch of 887 metrics in 25.0025ms
2017-07-23T21:56:16Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)
2017-07-23T21:56:20Z D! Output [prometheus_client] buffer fullness: 10 / 10000 metrics.
2017-07-23T21:56:20Z D! Output [prometheus_client] wrote batch of 10 metrics in 4.0004ms
2017-07-23T21:56:31Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)
2017-07-23T21:56:32Z D! Output [prometheus_client] buffer fullness: 0 / 10000 metrics.
2017-07-23T21:56:46Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)
2017-07-23T21:56:49Z D! Output [prometheus_client] buffer fullness: 0 / 10000 metrics.
2017-07-23T21:57:01Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)
2017-07-23T21:57:03Z D! Output [prometheus_client] buffer fullness: 0 / 10000 metrics.
2017-07-23T21:57:16Z E! Error in plugin [inputs.sqlserver]: took longer to collect than collection interval (15s)
Trace file is [attached]
Trace.txt
The text was updated successfully, but these errors were encountered: