Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICMP probes fails continually after some short DNS outages, until manual restart of blackbox-exporter container #591

Closed
itsx opened this issue Apr 1, 2020 · 20 comments

Comments

@itsx
Copy link

itsx commented Apr 1, 2020

Host operating system: output of uname -a

Linux a382643a1270 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 GNU/Linux

Docker version:
Docker version 18.09.7, build 2d0083d

OS running docker:
Linux prometheus1 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

blackbox_exporter version: output of blackbox_exporter -version

blackbox_exporter, version 0.16.0 (branch: HEAD, revision: 991f89846ae10db22a3933356a7d196642fcb9a9)
  build user:       root@64f600555645
  build date:       20191111-16:27:24
  go version:       go1.13.4

Docker image:
prom/blackbox-exporter:v0.16.0

What is the blackbox.yml module config.

modules:
  icmp:
    prober: icmp

  icmp-ip4:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: ip4

What is the prometheus.yml scrape config.

  - job_name: 'blackbox-ping'
    scrape_interval: 1s
    params:
      module: [icmp-ip4]
    static_configs:
      - targets:
        - 8.8.8.8
        labels:
          blackbox_job: 'ping'
    metrics_path: /probe
    relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - source_labels: [instance]
      regex: "[192.168|8.8].+"
      target_label: ping_type
      replacement: 'ip'
    - target_label: __address__
      replacement: prometheus1:9115

What logging output did you get from adding &debug=true to the probe URL?

Logs for the probe:
ts=2020-04-01T15:59:41.659129069Z caller=main.go:304 module=icmp target=8.8.8.8 level=info msg="Beginning probe" probe=icmp timeout_seconds=119.5
ts=2020-04-01T15:59:41.659349737Z caller=icmp.go:82 module=icmp target=8.8.8.8 level=info msg="Resolving target address" ip_protocol=ip6
ts=2020-04-01T15:59:41.6594102Z caller=icmp.go:82 module=icmp target=8.8.8.8 level=info msg="Resolved target address" ip=8.8.8.8
ts=2020-04-01T15:59:41.659430967Z caller=main.go:119 module=icmp target=8.8.8.8 level=info msg="Creating socket"
ts=2020-04-01T15:59:41.660190595Z caller=main.go:119 module=icmp target=8.8.8.8 level=info msg="Creating ICMP packet" seq=62367 id=33313
ts=2020-04-01T15:59:41.660223508Z caller=main.go:119 module=icmp target=8.8.8.8 level=info msg="Writing out packet"
ts=2020-04-01T15:59:41.660358188Z caller=main.go:119 module=icmp target=8.8.8.8 level=info msg="Waiting for reply packets"
ts=2020-04-01T16:01:41.159246759Z caller=main.go:119 module=icmp target=8.8.8.8 level=warn msg="Timeout reading from socket" err="read ip4 0.0.0.0: i/o timeout"
ts=2020-04-01T16:01:41.159308704Z caller=main.go:304 module=icmp target=8.8.8.8 level=error msg="Probe failed" duration_seconds=119.500030153



Metrics that would have been returned:
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 2.2626e-05
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 119.500030153
# HELP probe_icmp_duration_seconds Duration of icmp request by phase
# TYPE probe_icmp_duration_seconds gauge
probe_icmp_duration_seconds{phase="resolve"} 2.2626e-05
probe_icmp_duration_seconds{phase="rtt"} 0
probe_icmp_duration_seconds{phase="setup"} 0.000792348
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 0



Module configuration:
prober: icmp
http:
    ip_protocol_fallback: true
tcp:
    ip_protocol_fallback: true
icmp:
    ip_protocol_fallback: true
dns:
    ip_protocol_fallback: true

What did you do that produced an error?

We run blackbox-exporter inside docker container. Suddenly, without any changes on working machine or container, ping probe starts failing for one or more targets which we are monitoring, while other targets remain ok. When i run manually ping tool inside docker container and on hosting OS outside the docker container, both succeed.

So far we experienced this behavior for two of ours internal IP targets simultaneously (both from the same datacenter) and later just for 8.8.8.8 target.

I examined the problem with a tcpdump and it shows only request packets (no reply packets):

tcpdump -i eth0 -nn -s0 -X icmp and host 8.8.8.8
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:28:48.734661 IP 172.17.0.5 > 8.8.8.8: ICMP echo request, id 33313, seq 41979, length 36
	0x0000:  4500 0038 f40e 4000 4001 8a90 ac11 0005  E..8..@.@.......
	0x0010:  0808 0808 0800 7648 8221 a3fb 5072 6f6d  ......vH.!..Prom
	0x0020:  6574 6865 7573 2042 6c61 636b 626f 7820  etheus.Blackbox.
	0x0030:  4578 706f 7274 6572                      Exporter
15:28:48.977456 IP 172.17.0.5 > 8.8.8.8: ICMP echo request, id 33313, seq 41982, length 36
	0x0000:  4500 0038 f41d 4000 4001 8a81 ac11 0005  E..8..@.@.......
	0x0010:  0808 0808 0800 7645 8221 a3fe 5072 6f6d  ......vE.!..Prom
	0x0020:  6574 6865 7573 2042 6c61 636b 626f 7820  etheus.Blackbox.
	0x0030:  4578 706f 7274 6572                      Exporter
15:28:49.735066 IP 172.17.0.5 > 8.8.8.8: ICMP echo request, id 33313, seq 41990, length 36
	0x0000:  4500 0038 f475 4000 4001 8a29 ac11 0005  E..8.u@.@..)....
	0x0010:  0808 0808 0800 763d 8221 a406 5072 6f6d  ......v=.!..Prom
	0x0020:  6574 6865 7573 2042 6c61 636b 626f 7820  etheus.Blackbox.
	0x0030:  4578 706f 7274 6572                      Exporter
15:28:50.735053 IP 172.17.0.5 > 8.8.8.8: ICMP echo request, id 33313, seq 42001, length 36
	0x0000:  4500 0038 f497 4000 4001 8a07 ac11 0005  E..8..@.@.......
	0x0010:  0808 0808 0800 7632 8221 a411 5072 6f6d  ......v2.!..Prom
	0x0020:  6574 6865 7573 2042 6c61 636b 626f 7820  etheus.Blackbox.
	0x0030:  4578 706f 7274 6572                      Exporter

I also checked if there is any zero-filled ID field in IP header, as it was discussed in a very similar issue here: #360, but it is not our case.

The only correlations which we found in Grafana, are very short outages of connection from the blackbox-exporter machine to some of ours internal DNS servers (spikes are in the same time as the probes starts failing) monitored with the same blackbox-exporter ...

What did you expect to see?

Maybe some failed probes during a potential outage, but then successfull probes again.

What did you see instead?

Probes continually fails, for hours, just until i manually restart docker image.

@brian-brazil
Copy link
Contributor

If the pings are going out and nothing is coming back, that's not a problem with the blackbox exporter.

@itsx
Copy link
Author

itsx commented Apr 3, 2020

I would also suspect broken network, but I can't understand that ping tool inside the blackbox-exporter container works without problems, same with the traceroute command. Also outside the container, ping and traceroute commands work fine (for the same target).

This is tcpdump output from inside the container (just blackbox-exporter running):

root @ /
 [5] 🐳  →  tcpdump -i eth0 icmp and host 1.1.1.1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
14:51:59.599447 IP a382643a1270 > one.one.one.one: ICMP echo request, id 33313, seq 53067, length 36
14:52:00.599849 IP a382643a1270 > one.one.one.one: ICMP echo request, id 33313, seq 53081, length 36
14:52:01.599661 IP a382643a1270 > one.one.one.one: ICMP echo request, id 33313, seq 53095, length 36
14:52:02.599350 IP a382643a1270 > one.one.one.one: ICMP echo request, id 33313, seq 53109, length 36
14:52:03.599603 IP a382643a1270 > one.one.one.one: ICMP echo request, id 33313, seq 53123, length 36

And this is tcpdump output, when I start ping manually inside the container, along the blackbox-exporter (blackbox-exporter id==33313):

root @ /
 [4] 🐳  →  tcpdump -i eth0 icmp and host 1.1.1.1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
14:48:50.599214 IP a382643a1270 > one.one.one.one: ICMP echo request, id 33313, seq 50421, length 36
14:48:51.384085 IP a382643a1270 > one.one.one.one: ICMP echo request, id 35072, seq 5, length 64
14:48:51.392669 IP one.one.one.one > a382643a1270: ICMP echo reply, id 35072, seq 5, length 64
14:48:51.599289 IP a382643a1270 > one.one.one.one: ICMP echo request, id 33313, seq 50435, length 36
14:48:52.384292 IP a382643a1270 > one.one.one.one: ICMP echo request, id 35072, seq 6, length 64
14:48:52.393031 IP one.one.one.one > a382643a1270: ICMP echo reply, id 35072, seq 6, length 64
14:48:52.599559 IP a382643a1270 > one.one.one.one: ICMP echo request, id 33313, seq 50449, length 36
14:48:53.384517 IP a382643a1270 > one.one.one.one: ICMP echo request, id 35072, seq 7, length 64
14:48:53.396626 IP one.one.one.one > a382643a1270: ICMP echo reply, id 35072, seq 7, length 64
14:48:53.599622 IP a382643a1270 > one.one.one.one: ICMP echo request, id 33313, seq 50463, length 36
14:48:54.384748 IP a382643a1270 > one.one.one.one: ICMP echo request, id 35072, seq 8, length 64
14:48:54.392914 IP one.one.one.one > a382643a1270: ICMP echo reply, id 35072, seq 8, length 64
14:48:54.599671 IP a382643a1270 > one.one.one.one: ICMP echo request, id 33313, seq 50477, length 36
14:48:55.384969 IP a382643a1270 > one.one.one.one: ICMP echo request, id 35072, seq 9, length 64
14:48:55.393031 IP one.one.one.one > a382643a1270: ICMP echo reply, id 35072, seq 9, length 64
14:48:55.599634 IP a382643a1270 > one.one.one.one: ICMP echo request, id 33313, seq 50491, length 36

And tcpdump output outside the container on a host machine looks same.

We have added a couple of another public ip targets (like 1.1.1.1) and after a day, they have started to fail too.

I would check more deeply, what's going on, but I have no idea where to look now. Please, don't You have any suggestions what else to check or how to possibly debug it?

@brian-brazil
Copy link
Contributor

I'm afraid I can't really help with issues that are not with the blackbox exporter itself.

@tlinhart
Copy link

tlinhart commented Apr 3, 2020

Well, if Blackbox exporter is the only tool that sends packets and doesn't receive a reply, I guess it actually might be a problem with Blackbox itself. Maybe the payload is bad? Or maybe the other side reply's payload is but Blackbox is the only tool that can't handle it.

@brian-brazil
Copy link
Contributor

If tcpdump isn't showing a packet coming back, it can't be a problem with the blackbox exporter.

@tlinhart
Copy link

tlinhart commented Apr 3, 2020

Are you sure? Blackbox is forging the ICMP packets and should they be invalid in some way, I wouldn't be surprised not to get a reply...

@brian-brazil
Copy link
Contributor

You have not demonstrated that however. As-is this appears to be a network issue on your end. It makes more sense to ask questions like this on the prometheus-users mailing list rather than in a GitHub issue. On the mailing list, more people are available to potentially respond to your question, and the whole community can benefit from the answers provided.

@tlinhart
Copy link

tlinhart commented Apr 3, 2020

Agreed, mailing list might be a better place for further discussion, thanks for hint.

@jeremybz
Copy link

We are also being hit by this issue in EC2

@jeremybz
Copy link

We confirm the same behavior: once blackbox ping encounters trouble, it keeps returning fails, even when pings run from the command line return fine. Blackbox does not recover until it is restarted.

@brian-brazil
Copy link
Contributor

I'm still seeing nothing here to indicate a blackbox exporter issue, this looks like an EC2 networking issue.

@jeremybz
Copy link

Why would blackbox fail when ping works if it was a networking issue?

@brian-brazil
Copy link
Contributor

If the echo request is being sent but echo replies aren't making it back to the machine, that's not a blackbox issue.

@jeremybz
Copy link

Not trying to be obtuse here, it just comes naturally to me.
If ping is getting replies just fine, but blackbox isn't reporting that it receives any, I think that points to a blackbox issue. It's also suggestive that that the problem goes away by restarting the blackbox process every time.

@brian-brazil
Copy link
Contributor

Per the above tcpdump, there are no replies coming back - so the blackbox exporter doesn't even have the opportunity to receive them.

@jeremybz
Copy link

Good answer! Thank you.

@jeremybz
Copy link

jeremybz commented Jul 1, 2020

Update for people who hit the same issue: for us this was caused was a firewall which started blocking ICMP packages with the same ID. This explains why restarting blackbox temporarily fixed the issue.

@brian-brazil
Copy link
Contributor

Thanks for the update. I presume that it's similar with other users, so I'm going to close this.

@gmintoco
Copy link

@jeremybz We are having the same issues. If you can share I'd be interested in hearing:

What was your firewall? What policy/change did you make to resolve the issue?

thanks

@gmintoco
Copy link

gmintoco commented May 3, 2021

For us this was caused by our routers ICMP TTL for some reason packets from the Blackbox exporter were not creating a new session on the router (FortiGate in this case) (most likely due to the same ICMP ID) and were being routed incorrectly. This was resolved in our case by increasing the scrape interval to 90s (above our 60s TTL) - It seems to have resolved the issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants