Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ping Input Not Updating InfluxDB if Server Down #4772

Closed
lizaoreo opened this issue Sep 28, 2018 · 7 comments
Closed

Ping Input Not Updating InfluxDB if Server Down #4772

lizaoreo opened this issue Sep 28, 2018 · 7 comments
Assignees
Labels
bug unexpected problem or unintended behavior
Milestone

Comments

@lizaoreo
Copy link

Relevant telegraf.conf:

[[inputs.ping]]
urls = ["192.168.10.1"] # required
interval = "1m"
count = 2 # required
ping_interval = 0.0
timeout = 0.0

System info:

Telegraf 1.8
CentOS 7

Steps to reproduce:

  1. Configure ping input
  2. Take server being monitored offline
  3. Rather than update the InfluxDB database with information showing pings are not longer being returned, it just throws an error in the Telegraf log, so grafana doesn't update as the InfluxDB database still shows the last good ping as the most recent update in the system.

Expected behavior:

Telegraf should report the stats to InfluxDB. For instance, below:

time average_response_ms host packets_received packets_transmitted percent_packet_loss url
2018-09-28T20:13:00Z 80.824 "mec-itlinuxlab" 2 2 0 "192.168.10.1"

Should be more like:

time average_response_ms host packets_received packets_transmitted percent_packet_loss url
2018-09-28T20:13:00Z 80.824 "mec-itlinuxlab" 0 2 100 "192.168.10.1"

Then I can use my grafana dashboard to report a service as down if I get a percent_packet_loss of greater than 25 percent.

Actual behavior:

I get the below:

time average_response_ms host packets_received packets_transmitted percent_packet_loss url
2018-09-28T20:13:00Z 80.824 "mec-itlinuxlab" 2 2 0 "192.168.10.1"

Additional info:

This used to work and at some point something must have changed how the results get returned. I honestly don't know when as I don't always keep telegraf updated and we don't often have services go down. There have been a few instances recently where we didn't know a service was down until eventually reported because the dashboard doesn't ever reflect the issue which has prompted me to begin researching the issue now.

@glinton glinton added the bug unexpected problem or unintended behavior label Oct 1, 2018
@danielnelson danielnelson added this to the 1.8.2 milestone Oct 4, 2018
@david-guenault
Copy link

Hi i can confirm this. Just the first error is in the time series database then nothing until the service come back.

@david-guenault
Copy link

Btw packet loss is more about quality than availibility ... you can't use quality métric to build sla metric. A bad quality does not always involve a loss of availability (except for 100% packet loss). Just bad qualtiy.
But may be i'm wrong.

@glinton glinton self-assigned this Oct 11, 2018
@glinton
Copy link
Contributor

glinton commented Oct 11, 2018

What is the query you are using? I'm getting the following:

  • telegraf 1.8.0
  • influxdb 1.6.1
> select * from ping 
name: ping
time                average_response_ms maximum_response_ms minimum_response_ms packets_received packets_transmitted percent_packet_loss result_code standard_deviation_ms url
----                ------------------- ------------------- ------------------- ---------------- ------------------- ------------------- ----------- --------------------- ---
1539281101000000000 0.082               0.126               0.038               2                2                   0                   0           0.044                 172.17.0.2
1539281111000000000 0.071               0.089               0.053               2                2                   0                   0           0.018                 172.17.0.2
1539281121000000000 0.089               0.094               0.084               2                2                   0                   0           0.005                 172.17.0.2
# shut off node at 172.17.0.2
1539281138000000000                                                                                                                      2                                 172.17.0.2
1539281143000000000                                                             0                4                   100                 0                                 172.17.0.2
1539281153000000000                                                             0                4                   100                 0                                 172.17.0.2
1539281163000000000                                                             0                4                   100                 0                                 172.17.0.2

@david-guenault
Copy link

Exactly the same. One result code at 2 then 100% packet loss with 0 return code.

time                average_response_ms host  maximum_response_ms minimum_response_ms name      packets_received packets_transmitted percent_packet_loss result_code standard_deviation_ms tag1      tag2         tag3          url
----                ------------------- ----  ------------------- ------------------- ----      ---------------- ------------------- ------------------- ----------- --------------------- ----      ----         ----          ---
1539085200000000000                     host                                          name 0                1                   100                 0                                 tag1 tag2 tag3 0.0.0.0
1539085210000000000                     host                                          name 0                1                   100                 0                                 tag1 tag2 tag3 0.0.0.0
1539085220000000000                     host                                          name 0                1                   100                 0                                 tag1 tag2 tag3 0.0.0.0
1539085230000000000                     host                                          name 0                1                   100                 0                                 tag1 tag2 tag3 0.0.0.0
1539085240000000000                     host                                          name 0                1                   100                 0                                 tag1 tag2 tag3 0.0.0.0
1539085250000000000                     host                                          name 0                1                   100                 0                                 tag1 tag2 tag3 0.0.0.0
1539085260000000000                     host                                          name 0                1                   100                 0                                 tag1 tag2 tag3 0.0.0.0
1539085271000000000                     host                                          name 0                2                   100                 0                                 tag1 tag2 tag3 0.0.0.0
1539085280000000000                     host                                          name 0                1                   100                 0                                 tag1 tag2 tag3 0.0.0.0

@glinton
Copy link
Contributor

glinton commented Oct 12, 2018

What version of influxdb are you using? And can you get the query that the dashboard is using? I wonder if it is selecting where avg response time isn't empty

@david-guenault
Copy link

Here are the data. Btw the problem is not on query but on metric storage. it start with an error code > 0 and should be the same error code until the target is available again.

LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.5.1804 (Core)
Release:        7.5.1804
Codename:       Core
Influx DB Package: influxdb-1.5.2-1.x86_64

For the queries i'm first counting error codes and then counting all the metrics for specific tag within specific time range. Then just doing math to get sort of availability (count_ko/count_total).

SELECT count("result_code") FROM "ping" WHERE ("name" =~ /^name$/ AND "result_code" > 0) AND time >= 1538344800000ms;
SELECT count("result_code") FROM "ping" WHERE ("name" =~ /^name$/) AND time >= 1538344800000ms

@glinton
Copy link
Contributor

glinton commented Oct 17, 2018

Ok! Now it makes sense. So, the result code for a ping that times out is still 0. I believe that #4550 was intended to address that, but instead masked the 1 exit as a 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants