-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ping Input Not Updating InfluxDB if Server Down #4772
Comments
Hi i can confirm this. Just the first error is in the time series database then nothing until the service come back. |
Btw packet loss is more about quality than availibility ... you can't use quality métric to build sla metric. A bad quality does not always involve a loss of availability (except for 100% packet loss). Just bad qualtiy. |
What is the query you are using? I'm getting the following:
|
Exactly the same. One result code at 2 then 100% packet loss with 0 return code.
|
What version of influxdb are you using? And can you get the query that the dashboard is using? I wonder if it is selecting where avg response time isn't empty |
Here are the data. Btw the problem is not on query but on metric storage. it start with an error code > 0 and should be the same error code until the target is available again.
For the queries i'm first counting error codes and then counting all the metrics for specific tag within specific time range. Then just doing math to get sort of availability (count_ko/count_total).
|
Ok! Now it makes sense. So, the result code for a ping that times out is still 0. I believe that #4550 was intended to address that, but instead masked the |
Relevant telegraf.conf:
[[inputs.ping]]
urls = ["192.168.10.1"] # required
interval = "1m"
count = 2 # required
ping_interval = 0.0
timeout = 0.0
System info:
Telegraf 1.8
CentOS 7
Steps to reproduce:
Expected behavior:
Telegraf should report the stats to InfluxDB. For instance, below:
Should be more like:
Then I can use my grafana dashboard to report a service as down if I get a percent_packet_loss of greater than 25 percent.
Actual behavior:
I get the below:
Additional info:
This used to work and at some point something must have changed how the results get returned. I honestly don't know when as I don't always keep telegraf updated and we don't often have services go down. There have been a few instances recently where we didn't know a service was down until eventually reported because the dashboard doesn't ever reflect the issue which has prompted me to begin researching the issue now.
The text was updated successfully, but these errors were encountered: