Ping Input Not Updating InfluxDB if Server Down #4772

lizaoreo · 2018-09-28T20:49:26Z

Relevant telegraf.conf:

[[inputs.ping]]
urls = ["192.168.10.1"] # required
interval = "1m"
count = 2 # required
ping_interval = 0.0
timeout = 0.0

System info:

Telegraf 1.8
CentOS 7

Steps to reproduce:

Configure ping input
Take server being monitored offline
Rather than update the InfluxDB database with information showing pings are not longer being returned, it just throws an error in the Telegraf log, so grafana doesn't update as the InfluxDB database still shows the last good ping as the most recent update in the system.

Expected behavior:

Telegraf should report the stats to InfluxDB. For instance, below:

time	average_response_ms	host	packets_received	packets_transmitted	percent_packet_loss	url
2018-09-28T20:13:00Z	80.824	"mec-itlinuxlab"	2	2	0	"192.168.10.1"

Should be more like:

time	average_response_ms	host	packets_received	packets_transmitted	percent_packet_loss	url
2018-09-28T20:13:00Z	80.824	"mec-itlinuxlab"	0	2	100	"192.168.10.1"

Then I can use my grafana dashboard to report a service as down if I get a percent_packet_loss of greater than 25 percent.

Actual behavior:

I get the below:

time	average_response_ms	host	packets_received	packets_transmitted	percent_packet_loss	url
2018-09-28T20:13:00Z	80.824	"mec-itlinuxlab"	2	2	0	"192.168.10.1"

Additional info:

This used to work and at some point something must have changed how the results get returned. I honestly don't know when as I don't always keep telegraf updated and we don't often have services go down. There have been a few instances recently where we didn't know a service was down until eventually reported because the dashboard doesn't ever reflect the issue which has prompted me to begin researching the issue now.

david-guenault · 2018-10-11T13:36:54Z

Hi i can confirm this. Just the first error is in the time series database then nothing until the service come back.

david-guenault · 2018-10-11T13:40:19Z

Btw packet loss is more about quality than availibility ... you can't use quality métric to build sla metric. A bad quality does not always involve a loss of availability (except for 100% packet loss). Just bad qualtiy.
But may be i'm wrong.

glinton · 2018-10-11T18:09:26Z

What is the query you are using? I'm getting the following:

telegraf 1.8.0
influxdb 1.6.1

> select * from ping 
name: ping
time                average_response_ms maximum_response_ms minimum_response_ms packets_received packets_transmitted percent_packet_loss result_code standard_deviation_ms url
----                ------------------- ------------------- ------------------- ---------------- ------------------- ------------------- ----------- --------------------- ---
1539281101000000000 0.082               0.126               0.038               2                2                   0                   0           0.044                 172.17.0.2
1539281111000000000 0.071               0.089               0.053               2                2                   0                   0           0.018                 172.17.0.2
1539281121000000000 0.089               0.094               0.084               2                2                   0                   0           0.005                 172.17.0.2
# shut off node at 172.17.0.2
1539281138000000000                                                                                                                      2                                 172.17.0.2
1539281143000000000                                                             0                4                   100                 0                                 172.17.0.2
1539281153000000000                                                             0                4                   100                 0                                 172.17.0.2
1539281163000000000                                                             0                4                   100                 0                                 172.17.0.2

david-guenault · 2018-10-12T08:04:34Z

Exactly the same. One result code at 2 then 100% packet loss with 0 return code.

time                average_response_ms host  maximum_response_ms minimum_response_ms name      packets_received packets_transmitted percent_packet_loss result_code standard_deviation_ms tag1      tag2         tag3          url
----                ------------------- ----  ------------------- ------------------- ----      ---------------- ------------------- ------------------- ----------- --------------------- ----      ----         ----          ---
1539085200000000000                     host                                          name 0                1                   100                 0                                 tag1 tag2 tag3 0.0.0.0
1539085210000000000                     host                                          name 0                1                   100                 0                                 tag1 tag2 tag3 0.0.0.0
1539085220000000000                     host                                          name 0                1                   100                 0                                 tag1 tag2 tag3 0.0.0.0
1539085230000000000                     host                                          name 0                1                   100                 0                                 tag1 tag2 tag3 0.0.0.0
1539085240000000000                     host                                          name 0                1                   100                 0                                 tag1 tag2 tag3 0.0.0.0
1539085250000000000                     host                                          name 0                1                   100                 0                                 tag1 tag2 tag3 0.0.0.0
1539085260000000000                     host                                          name 0                1                   100                 0                                 tag1 tag2 tag3 0.0.0.0
1539085271000000000                     host                                          name 0                2                   100                 0                                 tag1 tag2 tag3 0.0.0.0
1539085280000000000                     host                                          name 0                1                   100                 0                                 tag1 tag2 tag3 0.0.0.0

glinton · 2018-10-12T17:47:33Z

What version of influxdb are you using? And can you get the query that the dashboard is using? I wonder if it is selecting where avg response time isn't empty

david-guenault · 2018-10-15T10:05:28Z

Here are the data. Btw the problem is not on query but on metric storage. it start with an error code > 0 and should be the same error code until the target is available again.

LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.5.1804 (Core)
Release:        7.5.1804
Codename:       Core

Influx DB Package: influxdb-1.5.2-1.x86_64

For the queries i'm first counting error codes and then counting all the metrics for specific tag within specific time range. Then just doing math to get sort of availability (count_ko/count_total).

SELECT count("result_code") FROM "ping" WHERE ("name" =~ /^name$/ AND "result_code" > 0) AND time >= 1538344800000ms;

SELECT count("result_code") FROM "ping" WHERE ("name" =~ /^name$/) AND time >= 1538344800000ms

glinton · 2018-10-17T15:51:37Z

Ok! Now it makes sense. So, the result code for a ping that times out is still 0. I believe that #4550 was intended to address that, but instead masked the 1 exit as a 0

glinton added the bug unexpected problem or unintended behavior label Oct 1, 2018

danielnelson added this to the 1.8.2 milestone Oct 4, 2018

glinton self-assigned this Oct 11, 2018

glinton mentioned this issue Oct 17, 2018

On host not available, set result_code accordingly #4875

Merged

glinton closed this as completed in #4875 Oct 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ping Input Not Updating InfluxDB if Server Down #4772

Ping Input Not Updating InfluxDB if Server Down #4772

lizaoreo commented Sep 28, 2018

david-guenault commented Oct 11, 2018

david-guenault commented Oct 11, 2018

glinton commented Oct 11, 2018

david-guenault commented Oct 12, 2018

glinton commented Oct 12, 2018

david-guenault commented Oct 15, 2018

glinton commented Oct 17, 2018

Ping Input Not Updating InfluxDB if Server Down #4772

Ping Input Not Updating InfluxDB if Server Down #4772

Comments

lizaoreo commented Sep 28, 2018

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

david-guenault commented Oct 11, 2018

david-guenault commented Oct 11, 2018

glinton commented Oct 11, 2018

david-guenault commented Oct 12, 2018

glinton commented Oct 12, 2018

david-guenault commented Oct 15, 2018

glinton commented Oct 17, 2018