Use timeout smaller than 10 seconds #959

PierreF · 2016-04-04T10:18:28Z

Mongo and Prometheus use timeout of 10 seconds. At least with Mongo, 10 seconds timeout raise strange behavior, probably because metrics collection took more time than the metric collection interval (10 second - default config):

Data wrote by telegraf are no longer "rounded" to 10 seconds and create hole if you round them using group by time(10s)
second:

> select usage_idle from cpu where cpu='cpu-total' and time >= '2016-04-04T09:56:37Z' and time <= '2016-04-04T09:57:23Z'
name: cpu
---------
time                    usage_idle
2016-04-04T09:56:37Z     88.7699366396944
2016-04-04T09:56:49Z     84.20123565754106
2016-04-04T09:57:00Z     86.28820960698596
2016-04-04T09:57:12Z     85.76434515993311
2016-04-04T09:57:23Z     82.51209854815667

> select mean(usage_idle) from cpu where cpu='cpu-total' and time >= '2016-04-04T09:56:37Z' and time <= '2016-04-04T09:57:23Z'  group by time(10s)
name: cpu
---------
time                    mean
2016-04-04T09:56:30Z     88.7699366396944
2016-04-04T09:56:40Z     84.20123565754106
2016-04-04T09:56:50Z
2016-04-04T09:57:00Z     86.28820960698596
2016-04-04T09:57:10Z     85.76434515993311
2016-04-04T09:57:20Z     82.51209854815667

Telegraf logs:

2016/04/04 11:55:31 Starting Telegraf (version 0.11.1-75-g357849c)
[...]
2016/04/04 11:56:40 Wrote 35 metrics to output influxdb in 5.402099ms
error dialing over ssl, no reachable servers
2016/04/04 11:56:49 Error in input [mongodb]: Unable to connect to MongoDB, no reachable servers
2016/04/04 11:56:49 Gathered metrics, (10s interval), from 11 inputs in 11.507262321s
2016/04/04 11:56:50 Wrote 35 metrics to output influxdb in 4.578914ms
error dialing over ssl, no reachable servers
2016/04/04 11:57:00 Error in input [mongodb]: Unable to connect to MongoDB, no reachable servers
2016/04/04 11:57:00 Gathered metrics, (10s interval), from 11 inputs in 11.513853869s
[NOTE: no data wrote at 11:57:00]
2016/04/04 11:57:10 Wrote 35 metrics to output influxdb in 4.51493ms

This PR use a slightly smaller timeout (8 seconds) which make Telegraf behave as usual: datapoint sent every 10 second and rounded to tenth of seconds (0, 10, 20, ...).

sebito91 · 2016-04-04T12:16:32Z

Isn't this handled in the [agent] section of the config?

https://github.com/influxdata/telegraf/blob/master/docs/CONFIGURATION.md

PierreF · 2016-04-04T12:36:49Z

I don't know which option in [agent] section you are referring.

The timeout of Mongo and Prometheus (and all other timeout) are hard-coded value in .go file. My issue is that when timeout of metric gathering is the same as metric interval, I have the strange behavior described above.

I do not want to increase metric interval, I want to keep the default of 10 seconds.

sparrc · 2016-04-04T16:56:31Z

maybe we should set it to 5s? since that's what we have all of our http/tcp dial timeouts set to

Use timeout smaller than 10 seconds

b51473e

PierreF force-pushed the timeout2 branch from e676a9b to b51473e Compare April 4, 2016 11:35

sparrc closed this in 5fe8903 Apr 4, 2016

PierreF deleted the timeout2 branch August 4, 2018 13:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use timeout smaller than 10 seconds #959

Use timeout smaller than 10 seconds #959

PierreF commented Apr 4, 2016

sebito91 commented Apr 4, 2016

PierreF commented Apr 4, 2016

sparrc commented Apr 4, 2016

Use timeout smaller than 10 seconds #959

Use timeout smaller than 10 seconds #959

Conversation

PierreF commented Apr 4, 2016

sebito91 commented Apr 4, 2016

PierreF commented Apr 4, 2016

sparrc commented Apr 4, 2016