Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICMP probes fails continually after down and up of several target hosts, until manual restart of blackbox-exporter. #360

Closed
fessmage opened this issue Sep 19, 2018 · 21 comments

Comments

@fessmage
Copy link

fessmage commented Sep 19, 2018

Host operating system: output of uname -a

Linux prometheus 4.4.0-134-generic #160-Ubuntu SMP Wed Aug 15 14:58:00 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

blackbox_exporter version: output of blackbox_exporter -version

blackbox_exporter, version 0.12.0 (branch: HEAD, revision: 4a22506)
build user: root@634195974c8e
build date: 20180227-11:50:29
go version: go1.10

What is the blackbox.yml module config.

modules:
  icmp:
    prober: icmp
    timeout: 2s
    icmp:
      preferred_ip_protocol: ip4

What is the prometheus.yml scrape config.

scrape_configs:
  - job_name: 'icmp-ping'
    metrics_path: /probe
    params:
      module: [icmp]
    scrape_interval: 5s
    scrape_timeout: 2s
    file_sd_configs:
      - files:
        - '/etc/prometheus/targets/ping-hosts.yml'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 'prometheus.domain.zz:9115'

What logging output did you get from adding &debug=true to the probe URL?

Logs for the probe:
ts=2018-09-19T10:54:04.147594552Z caller=main.go:116 module=icmp target=probed-host.domain.zz level=info msg="Beginning probe" probe=icmp timeout_seconds=1.5
ts=2018-09-19T10:54:04.147696813Z caller=utils.go:42 module=icmp target=probed-host.domain.zz level=info msg="Resolving target address" preferred_ip_protocol=ip4
ts=2018-09-19T10:54:04.148567095Z caller=utils.go:65 module=icmp target=probed-host.domain.zz level=info msg="Resolved target address" ip=192.168.100.49
ts=2018-09-19T10:54:04.148667279Z caller=icmp.go:71 module=icmp target=probed-host.domain.zz level=info msg="Creating socket"
ts=2018-09-19T10:54:04.14885651Z caller=icmp.go:117 module=icmp target=probed-host.domain.zz level=info msg="Creating ICMP packet" seq=61478 id=7522
ts=2018-09-19T10:54:04.148950165Z caller=icmp.go:129 module=icmp target=probed-host.domain.zz level=info msg="Writing out packet"
ts=2018-09-19T10:54:04.149184806Z caller=icmp.go:157 module=icmp target=probed-host.domain.zz level=info msg="Waiting for reply packets"
ts=2018-09-19T10:54:05.647899261Z caller=icmp.go:162 module=icmp target=probed-host.domain.zz level=warn msg="Timeout reading from socket" err="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout"
ts=2018-09-19T10:54:05.648033921Z caller=main.go:129 module=icmp target=probed-host.domain.zz level=error msg="Probe failed" duration_seconds=1.5003776850000001



Metrics that would have been returned:
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.000923902
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 1.5003776850000001
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 0



Module configuration:
prober: icmp
timeout: 2s
icmp:
  preferred_ip_protocol: ip4

What did you do that produced an error?

Restarting openvpn client on hypervisor host, which run virtual machine with prometheus and blackbox-exporter. Blackbox-exporter target file has around 70 entries, more than 50 behind that vpn connection.

What did you expect to see?

Some failed probes during vpn restart on hypervisor and then successfull probes again.

What did you see instead?

Probes was continually fails, for ten's of minutes, just until i manually restarted blackbox-exporter.

@brian-brazil
Copy link
Contributor

What does tcpdump show when this happens?

@fessmage
Copy link
Author

In tcpdump i see only icmp requests and replies for other hosts, which was not trapped in this error. There is no activity for hosts with probe_success=0.

I think error happen somewhere on deeper level, before actual network activity - because of this string from failed probe log: level=warn msg="Timeout reading from socket" err="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout"

@brian-brazil
Copy link
Contributor

The logs you shared show that icmp requests are being sent, which doesn't agree with that. What does strace show?

@fessmage
Copy link
Author

You right, i doesn't saw that in tcpdump by the first look. I rechecked tcpdump and grep from it one particular host by domain name and ip address:

root@prometheus:~# grep -e 'probed-host' -e '192.168.100.49' tcpdump2.txt
11:57:02.267636 IP prometheus.domain.zz.46648 > 192.168.11.1.domain: 46261+ AAAA? probed-host.domain.zz. (32)
11:57:02.267845 IP prometheus.domain.zz.42413 > 192.168.111.1.domain: 5240+ A? probed-host.domain.zz. (32)
11:57:02.268144 IP 192.168.111.1.domain > prometheus.domain.zz.42413: 5240 1/0/0 A 192.168.100.49 (48)
11:57:02.268385 IP prometheus.domain.zz > 192.168.100.49: ICMP echo request, id 7992, seq 1778, length 36
11:57:07.265888 IP prometheus.domain.zz.53428 > 192.168.111.1.domain: 25246+ AAAA? probed-host.domain.zz. (32)
11:57:07.267532 IP prometheus.domain.zz.57825 > 192.168.111.1.domain: 33728+ A? probed-host.domain.zz. (32)
11:57:07.326568 IP 192.168.111.1.domain > prometheus.domain.zz.57825: 33728 1/1/1 A 192.168.100.49 (81)
11:57:07.327386 IP prometheus.domain.zz > 192.168.100.49: ICMP echo request, id 7992, seq 1850, length 36
11:57:08.214424 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [P.], seq 3203130557:3203130758, ack 3515674937, win 915, options [nop,nop,TS val 281362417 ecr 1144254527], length 201
11:57:08.303561 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [P.], seq 1:4097, ack 201, win 147, options [nop,nop,TS val 1144269457 ecr 281362417], length 4096
11:57:08.303587 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 4097, win 979, options [nop,nop,TS val 281362439 ecr 1144269457], length 0
11:57:08.303758 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 4097:5453, ack 201, win 147, options [nop,nop,TS val 1144269457 ecr 281362417], length 1356
11:57:08.303774 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 5453, win 1001, options [nop,nop,TS val 281362439 ecr 1144269457], length 0
11:57:08.303835 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 5453:6809, ack 201, win 147, options [nop,nop,TS val 1144269457 ecr 281362417], length 1356
11:57:08.303855 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 6809, win 1024, options [nop,nop,TS val 281362439 ecr 1144269457], length 0
11:57:08.303885 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 6809:8165, ack 201, win 147, options [nop,nop,TS val 1144269457 ecr 281362417], length 1356
11:57:08.303895 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 8165, win 1047, options [nop,nop,TS val 281362439 ecr 1144269457], length 0
11:57:08.303950 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 8165:9521, ack 201, win 147, options [nop,nop,TS val 1144269457 ecr 281362417], length 1356
11:57:08.303961 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 9521, win 1069, options [nop,nop,TS val 281362439 ecr 1144269457], length 0
11:57:08.304043 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 9521:10877, ack 201, win 147, options [nop,nop,TS val 1144269457 ecr 281362417], length 1356
11:57:08.304057 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 10877, win 1092, options [nop,nop,TS val 281362439 ecr 1144269457], length 0
11:57:08.304136 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 10877:12233, ack 201, win 147, options [nop,nop,TS val 1144269457 ecr 281362417], length 1356
11:57:08.304149 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 12233, win 1115, options [nop,nop,TS val 281362439 ecr 1144269457], length 0
11:57:08.363779 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [P.], seq 12233:14176, ack 201, win 147, options [nop,nop,TS val 1144269517 ecr 281362439], length 1943
11:57:08.363795 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 14176, win 1145, options [nop,nop,TS val 281362454 ecr 1144269517], length 0
11:57:27.268508 IP prometheus.domain.zz.54776 > 192.168.111.1.domain: 45772+ AAAA? probed-host.domain.zz. (32)
11:57:27.268940 IP prometheus.domain.zz.39229 > 192.168.111.1.domain: 42525+ A? probed-host.domain.zz. (32)
11:57:27.269296 IP 192.168.111.1.domain > prometheus.domain.zz.39229: 42525 1/0/0 A 192.168.100.49 (48)
11:57:27.271119 IP prometheus.domain.zz > 192.168.100.49: ICMP echo request, id 7992, seq 2138, length 36
11:57:32.265530 IP prometheus.domain.zz.52436 > 192.168.111.1.domain: 2138+ AAAA? probed-host.domain.zz. (32)
11:57:32.265895 IP prometheus.domain.zz.41574 > 192.168.111.1.domain: 41426+ A? probed-host.domain.zz. (32)
11:57:32.266297 IP 192.168.111.1.domain > prometheus.domain.zz.41574: 41426 1/0/0 A 192.168.100.49 (48)
11:57:32.266929 IP prometheus.domain.zz > 192.168.100.49: ICMP echo request, id 7992, seq 2210, length 36

@fessmage
Copy link
Author

Addition info, if that can help - i run blackbox-exporter under separate non-root user, with help of systemd service, and with that capability: AmbientCapabilities=CAP_NET_RAW

@brian-brazil
Copy link
Contributor

I don't see any icmp responses there, it looks like there's a network issue on your end.

@fessmage
Copy link
Author

But systemctl restart blackbox-exporter.service - resolves it:

root@prometheus:~# systemctl restart blackbox_exporter.service
root@prometheus:~# tcpdump > tcpdump3.txt
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
^C5003 packets captured
5531 packets received by filter
476 packets dropped by kernel
root@prometheus:~# grep -e 'probed-host' -e '192.168.100.49' tcpdump3.txt
12:23:32.265419 IP prometheus.domain.zz.59785 > 192.168.111.1.domain: 54999+ AAAA? probed-host.domain.zz. (32)
12:23:32.265670 IP prometheus.domain.zz.38876 > 192.168.111.1.domain: 38571+ A? probed-host.domain.zz. (32)
12:23:32.325900 IP 192.168.111.1.domain > prometheus.domain.zz.38876: 38571 1/1/1 A 192.168.100.49 (81)
12:23:32.326569 IP prometheus.domain.zz > 192.168.100.49: ICMP echo request, id 8136, seq 143, length 36
12:23:32.386139 IP 192.168.100.49 > prometheus.domain.zz: ICMP echo reply, id 8136, seq 143, length 36
12:23:37.265432 IP prometheus.domain.zz.54099 > 192.168.111.1.domain: 24124+ AAAA? probed-host.domain.zz. (32)
12:23:37.265657 IP prometheus.domain.zz.47681 > 192.168.111.1.domain: 31907+ A? probed-host.domain.zz. (32)
12:23:37.265925 IP 192.168.111.1.domain > prometheus.domain.zz.47681: 31907 1/0/0 A 192.168.100.49 (48)
12:23:37.266534 IP prometheus.domain.zz > 192.168.100.49: ICMP echo request, id 8136, seq 215, length 36
12:23:37.326138 IP 192.168.100.49 > prometheus.domain.zz: ICMP echo reply, id 8136, seq 215, length 36
12:23:38.214358 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [P.], seq 3203151863:3203152064, ack 3517176056, win 1444, options [nop,nop,TS val 281759917 ecr 1145844592], length 201
12:23:38.274308 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], ack 201, win 501, options [nop,nop,TS val 1145859429 ecr 281759917], length 0
12:23:38.322059 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 1:2713, ack 201, win 501, options [nop,nop,TS val 1145859476 ecr 281759917], length 2712
12:23:38.322081 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 2713, win 1438, options [nop,nop,TS val 281759944 ecr 1145859476], length 0
12:23:38.322245 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [P.], seq 2713:4097, ack 201, win 501, options [nop,nop,TS val 1145859476 ecr 281759917], length 1384
12:23:38.322258 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 4097, win 1428, options [nop,nop,TS val 281759944 ecr 1145859476], length 0
12:23:38.322400 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 4097:6809, ack 201, win 501, options [nop,nop,TS val 1145859476 ecr 281759917], length 2712
12:23:38.322414 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 6809, win 1412, options [nop,nop,TS val 281759944 ecr 1145859476], length 0
12:23:38.322489 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 6809:8165, ack 201, win 501, options [nop,nop,TS val 1145859476 ecr 281759917], length 1356
12:23:38.322503 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 8165, win 1404, options [nop,nop,TS val 281759944 ecr 1145859476], length 0
12:23:38.322561 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 8165:9521, ack 201, win 501, options [nop,nop,TS val 1145859476 ecr 281759917], length 1356
12:23:38.322572 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 9521, win 1396, options [nop,nop,TS val 281759944 ecr 1145859476], length 0
12:23:38.322646 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 9521:10877, ack 201, win 501, options [nop,nop,TS val 1145859476 ecr 281759917], length 1356
12:23:38.322659 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 10877, win 1388, options [nop,nop,TS val 281759944 ecr 1145859476], length 0
12:23:38.322787 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 10877:12233, ack 201, win 501, options [nop,nop,TS val 1145859476 ecr 281759917], length 1356
12:23:38.322798 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 12233, win 1379, options [nop,nop,TS val 281759944 ecr 1145859476], length 0
12:23:38.383109 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [P.], seq 12233:14152, ack 201, win 501, options [nop,nop,TS val 1145859537 ecr 281759944], length 1919
12:23:38.383124 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 14152, win 1438, options [nop,nop,TS val 281759959 ecr 1145859537], length 0
12:23:47.266407 IP prometheus.domain.zz.35129 > 192.168.111.1.domain: 1644+ AAAA? probed-host.domain.zz. (32)
12:23:47.266666 IP prometheus.domain.zz.46044 > 192.168.111.1.domain: 34194+ A? probed-host.domain.zz. (32)
12:23:47.266976 IP 192.168.111.1.domain > prometheus.domain.zz.46044: 34194 1/0/0 A 192.168.100.49 (48)
12:23:47.267342 IP prometheus.domain.zz > 192.168.100.49: ICMP echo request, id 8136, seq 359, length 36
12:23:47.326937 IP 192.168.100.49 > prometheus.domain.zz: ICMP echo reply, id 8136, seq 359, length 36
12:23:52.266159 IP prometheus.domain.zz.39688 > 192.168.111.1.domain: 52232+ AAAA? probed-host.domain.zz. (32)
12:23:52.266524 IP prometheus.domain.zz.50519 > 192.168.111.1.domain: 56848+ A? probed-host.domain.zz. (32)
12:23:52.266864 IP 192.168.111.1.domain > prometheus.domain.zz.50519: 56848 1/0/0 A 192.168.100.49 (48)
12:23:52.267263 IP prometheus.domain.zz > 192.168.100.49: ICMP echo request, id 8136, seq 431, length 36
12:23:52.328748 IP 192.168.100.49 > prometheus.domain.zz: ICMP echo reply, id 8136, seq 431, length 36
12:23:53.214389 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [P.], seq 201:402, ack 14152, win 1444, options [nop,nop,TS val 281763667 ecr 1145859537], length 201
12:23:53.274863 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], ack 402, win 501, options [nop,nop,TS val 1145874429 ecr 281763667], length 0
12:23:53.346103 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 14152:16864, ack 402, win 501, options [nop,nop,TS val 1145874500 ecr 281763667], length 2712
12:23:53.346125 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 16864, win 1438, options [nop,nop,TS val 281763700 ecr 1145874500], length 0
12:23:53.346238 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [P.], seq 16864:18248, ack 402, win 501, options [nop,nop,TS val 1145874501 ecr 281763667], length 1384
12:23:53.346248 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 18248, win 1428, options [nop,nop,TS val 281763700 ecr 1145874501], length 0
12:23:53.346545 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 18248:20960, ack 402, win 501, options [nop,nop,TS val 1145874501 ecr 281763667], length 2712
12:23:53.346559 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 20960, win 1438, options [nop,nop,TS val 281763700 ecr 1145874501], length 0
12:23:53.346650 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 20960:23672, ack 402, win 501, options [nop,nop,TS val 1145874501 ecr 281763667], length 2712
12:23:53.346664 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 23672, win 1422, options [nop,nop,TS val 281763700 ecr 1145874501], length 0
12:23:53.346698 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 23672:25028, ack 402, win 501, options [nop,nop,TS val 1145874501 ecr 281763667], length 1356
12:23:53.346705 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 25028, win 1414, options [nop,nop,TS val 281763700 ecr 1145874501], length 0
12:23:53.346788 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 25028:26384, ack 402, win 501, options [nop,nop,TS val 1145874501 ecr 281763667], length 1356
12:23:53.346798 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 26384, win 1406, options [nop,nop,TS val 281763700 ecr 1145874501], length 0
12:23:53.406958 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [P.], seq 26384:28329, ack 402, win 501, options [nop,nop,TS val 1145874561 ecr 281763700], length 1945
12:23:53.406974 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 28329, win 1438, options [nop,nop,TS val 281763715 ecr 1145874561], length 0
12:23:57.267846 IP prometheus.domain.zz.39885 > 192.168.111.1.domain: 38234+ AAAA? probed-host.domain.zz. (32)
12:23:57.268175 IP prometheus.domain.zz.56903 > 192.168.111.1.domain: 3756+ A? probed-host.domain.zz. (32)
12:23:57.268491 IP 192.168.111.1.domain > prometheus.domain.zz.56903: 3756 1/0/0 A 192.168.100.49 (48)
12:23:57.269143 IP prometheus.domain.zz > 192.168.100.49: ICMP echo request, id 8136, seq 503, length 36
12:23:57.328733 IP 192.168.100.49 > prometheus.domain.zz: ICMP echo reply, id 8136, seq 503, length 36
12:24:02.265630 IP prometheus.domain.zz.48338 > 192.168.111.1.domain: 44678+ AAAA? probed-host.domain.zz. (32)
12:24:02.265837 IP prometheus.domain.zz.52843 > 192.168.111.1.domain: 50580+ A? probed-host.domain.zz. (32)
12:24:02.266200 IP 192.168.111.1.domain > prometheus.domain.zz.52843: 50580 1/0/0 A 192.168.100.49 (48)
12:24:02.266575 IP prometheus.domain.zz > 192.168.100.49: ICMP echo request, id 8136, seq 575, length 36
12:24:02.326056 IP 192.168.100.49 > prometheus.domain.zz: ICMP echo reply, id 8136, seq 575, length 36
12:24:07.265596 IP prometheus.domain.zz.60104 > 192.168.111.1.domain: 14992+ AAAA? probed-host.domain.zz. (32)
12:24:07.265817 IP prometheus.domain.zz.58376 > 192.168.111.1.domain: 21418+ A? probed-host.domain.zz. (32)
12:24:07.266102 IP 192.168.111.1.domain > prometheus.domain.zz.58376: 21418 1/0/0 A 192.168.100.49 (48)
12:24:07.266727 IP prometheus.domain.zz > 192.168.100.49: ICMP echo request, id 8136, seq 647, length 36
12:24:07.326159 IP 192.168.100.49 > prometheus.domain.zz: ICMP echo reply, id 8136, seq 647, length 36
12:24:08.214329 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [P.], seq 402:603, ack 28329, win 1444, options [nop,nop,TS val 281767417 ecr 1145874561], length 201
12:24:08.274422 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], ack 603, win 501, options [nop,nop,TS val 1145889429 ecr 281767417], length 0
12:24:08.303869 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 28329:31041, ack 603, win 501, options [nop,nop,TS val 1145889458 ecr 281767417], length 2712
12:24:08.303894 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 31041, win 1438, options [nop,nop,TS val 281767439 ecr 1145889458], length 0
12:24:08.304030 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [P.], seq 31041:32425, ack 603, win 501, options [nop,nop,TS val 1145889458 ecr 281767417], length 1384
12:24:08.304044 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 32425, win 1428, options [nop,nop,TS val 281767439 ecr 1145889458], length 0
12:24:08.304049 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 32425:33781, ack 603, win 501, options [nop,nop,TS val 1145889458 ecr 281767417], length 1356
12:24:08.304057 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 33781, win 1420, options [nop,nop,TS val 281767439 ecr 1145889458], length 0
12:24:08.304359 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 33781:36493, ack 603, win 501, options [nop,nop,TS val 1145889458 ecr 281767417], length 2712
12:24:08.304378 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 36493, win 1404, options [nop,nop,TS val 281767439 ecr 1145889458], length 0
12:24:08.304570 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 36493:37849, ack 603, win 501, options [nop,nop,TS val 1145889458 ecr 281767417], length 1356
12:24:08.304616 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 37849, win 1396, options [nop,nop,TS val 281767439 ecr 1145889458], length 0
12:24:08.304699 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 37849:39205, ack 603, win 501, options [nop,nop,TS val 1145889458 ecr 281767417], length 1356
12:24:08.304714 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 39205, win 1388, options [nop,nop,TS val 281767439 ecr 1145889458], length 0
12:24:08.304801 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [.], seq 39205:40561, ack 603, win 501, options [nop,nop,TS val 1145889458 ecr 281767417], length 1356
12:24:08.304815 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 40561, win 1379, options [nop,nop,TS val 281767439 ecr 1145889458], length 0
12:24:08.364593 IP 192.168.100.49.9100 > prometheus.domain.zz.40402: Flags [P.], seq 40561:42484, ack 603, win 501, options [nop,nop,TS val 1145889519 ecr 281767439], length 1923
12:24:08.364608 IP prometheus.domain.zz.40402 > 192.168.100.49.9100: Flags [.], ack 42484, win 1438, options [nop,nop,TS val 281767454 ecr 1145889519], length 0
12:24:12.265564 IP prometheus.domain.zz.54086 > 192.168.111.1.domain: 46437+ AAAA? probed-host.domain.zz. (32)
12:24:12.265783 IP prometheus.domain.zz.57286 > 192.168.111.1.domain: 29634+ A? probed-host.domain.zz. (32)
12:24:12.266118 IP 192.168.111.1.domain > prometheus.domain.zz.57286: 29634 1/0/0 A 192.168.100.49 (48)
12:24:12.266855 IP prometheus.domain.zz > 192.168.100.49: ICMP echo request, id 8136, seq 719, length 36
12:24:12.326374 IP 192.168.100.49 > prometheus.domain.zz: ICMP echo reply, id 8136, seq 719, length 36
12:24:17.265939 IP prometheus.domain.zz.50376 > 192.168.111.1.domain: 22542+ AAAA? probed-host.domain.zz. (32)
12:24:17.266352 IP prometheus.domain.zz.43801 > 192.168.111.1.domain: 18193+ A? probed-host.domain.zz. (32)
12:24:17.266610 IP 192.168.111.1.domain > prometheus.domain.zz.43801: 18193 1/0/0 A 192.168.100.49 (48)
12:24:17.266899 IP prometheus.domain.zz > 192.168.100.49: ICMP echo request, id 8136, seq 791, length 36
12:24:17.326485 IP 192.168.100.49 > prometheus.domain.zz: ICMP echo reply, id 8136, seq 791, length 36

And new probe's log:

Logs for the probe:
ts=2018-09-19T12:24:07.265316599Z caller=main.go:116 module=icmp target=probed-host.domain.zz level=info msg="Beginning probe" probe=icmp timeout_seconds=1.5
ts=2018-09-19T12:24:07.265406463Z caller=utils.go:42 module=icmp target=probed-host.domain.zz level=info msg="Resolving target address" preferred_ip_protocol=ip4
ts=2018-09-19T12:24:07.266258678Z caller=utils.go:65 module=icmp target=probed-host.domain.zz level=info msg="Resolved target address" ip=192.168.100.49
ts=2018-09-19T12:24:07.266309392Z caller=icmp.go:71 module=icmp target=probed-host.domain.zz level=info msg="Creating socket"
ts=2018-09-19T12:24:07.266495811Z caller=icmp.go:117 module=icmp target=probed-host.domain.zz level=info msg="Creating ICMP packet" seq=647 id=8136
ts=2018-09-19T12:24:07.266588111Z caller=icmp.go:129 module=icmp target=probed-host.domain.zz level=info msg="Writing out packet"
ts=2018-09-19T12:24:07.266956041Z caller=icmp.go:157 module=icmp target=probed-host.domain.zz level=info msg="Waiting for reply packets"
ts=2018-09-19T12:24:07.326266234Z caller=icmp.go:177 module=icmp target=probed-host.domain.zz level=info msg="Found matching reply packet"
ts=2018-09-19T12:24:07.326344048Z caller=main.go:127 module=icmp target=probed-host.domain.zz level=info msg="Probe succeeded" duration_seconds=0.060974965



Metrics that would have been returned:
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.000857973
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.060974965
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1



Module configuration:
prober: icmp
timeout: 2s
icmp:
  preferred_ip_protocol: ip4

@fessmage
Copy link
Author

That is my systemd service:

root@prometheus:~# cat /etc/systemd/system/blackbox_exporter.service 
# Ansible managed

[Unit]
Description=prometheus blackbox_exporter
After=syslog.target
After=network.target

[Service]
Type=simple
User=blackbox
Group=blackbox
AmbientCapabilities=CAP_NET_RAW
ExecStart=/opt/blackbox_exporter-0.12.0.linux-amd64/blackbox_exporter --config.file=/etc/blackbox/blackbox.yml
StandardOutput=syslog
StandardError=syslog
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

@brian-brazil
Copy link
Contributor

That'd imply some bad state in your network stack to me, likely something to do with connection tracking.

@xealot
Copy link

xealot commented Oct 9, 2018

I have a similar error, though it may be caused by something else. For me, the only way I can get the icmp probe to succeed is by trying it against a target of 127.0.0.1 or localhost.

Any other IP address, either within the local LAN or without seems to fail.

I've also added the capability with:
sudo setcap cap_net_raw+ep /usr/local/bin/blackbox_exporter

And I've instructed systemd to run it as root.

[Unit]
Description=Blackbox Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=root
Group=root
Type=simple
ExecStart=/usr/local/bin/blackbox_exporter --config.file /etc/blackbox_exporter/blackbox.yml

[Install]
WantedBy=multi-user.target

uname -a

Linux nuc 4.13.0-1024-oem #27-Ubuntu SMP Fri Apr 13 08:27:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

blackbox_exporter version: output of blackbox_exporter -version

blackbox_exporter, version 0.12.0 (branch: HEAD, revision: 4a22506cf0cf139d9b2f9cde099f0012d9fcabde)
  build user:       root@634195974c8e
  build date:       20180227-11:50:29
  go version:       go1.10

What is the blackbox.yml module config.

modules:
    http_2xx:
        prober: http
        timeout: 5s
        http:
            valid_status_codes: []
            method: GET
    icmp_ipv4:
        prober: icmp
        timeout: 15s
        icmp:
            preferred_ip_protocol: "ip4"
            source_ip_address: "127.0.0.1"

What is the prometheus.yml scrape config.

global:
    scrape_interval: 15s
    evaluation_interval: 30s
    # scrape_timeout is set to the global default (10s).
    external_labels:
        monitor: nuc

rule_files:
    - "rules.d/*.rules"

scrape_configs:
    - job_name: "prometheus"
      scrape_interval: 5s
      static_configs:
          - targets: ["localhost:9090"]

    - job_name: "node_exporter"
      scrape_interval: 5s
      static_configs:
          - targets: ["localhost:9100"]

    - job_name: "blackbox"
      scrape_interval: 5s
      metrics_path: /probe
      params:
          module: [http_2xx]
      static_configs:
          - targets:
                - google.com
      relabel_configs:
          - source_labels: [__address__]
            target_label: __param_target
          - source_labels: [__param_target]
            target_label: instance
          - target_label: __address__
            replacement: 127.0.0.1:9115 # The blackbox exporter's real hostname:port.

    - job_name: "pingtime"
      scrape_interval: 1s
      metrics_path: /probe
      params:
          module: [icmp_ipv4]
      static_configs:
          - targets:
                - 127.0.0.1
                - 192.168.10.1
                - google.com
      relabel_configs:
          - source_labels: [__address__]
            target_label: __param_target
          - source_labels: [__param_target]
            target_label: instance
          - target_label: __address__
            replacement: 127.0.0.1:9115 # The blackbox exporter's real hostname:port.

What logging output did you get from adding &debug=true to the probe URL?

Logs for the probe:
ts=2018-10-09T21:30:58.758299096Z caller=main.go:116 module=icmp_ipv4 target=192.168.10.1 level=info msg="Beginning probe" probe=icmp timeout_seconds=9.5
ts=2018-10-09T21:30:58.758346367Z caller=utils.go:42 module=icmp_ipv4 target=192.168.10.1 level=info msg="Resolving target address" preferred_ip_protocol=ip4
ts=2018-10-09T21:30:58.758360112Z caller=utils.go:65 module=icmp_ipv4 target=192.168.10.1 level=info msg="Resolved target address" ip=192.168.10.1
ts=2018-10-09T21:30:58.758368729Z caller=icmp.go:68 module=icmp_ipv4 target=192.168.10.1 level=info msg="Using source address" srcIP=127.0.0.1
ts=2018-10-09T21:30:58.758378392Z caller=icmp.go:71 module=icmp_ipv4 target=192.168.10.1 level=info msg="Creating socket"
ts=2018-10-09T21:30:58.758408948Z caller=icmp.go:117 module=icmp_ipv4 target=192.168.10.1 level=info msg="Creating ICMP packet" seq=875 id=3048
ts=2018-10-09T21:30:58.758419928Z caller=icmp.go:129 module=icmp_ipv4 target=192.168.10.1 level=info msg="Writing out packet"
ts=2018-10-09T21:30:58.758453495Z caller=icmp.go:157 module=icmp_ipv4 target=192.168.10.1 level=info msg="Waiting for reply packets"
ts=2018-10-09T21:31:08.258410105Z caller=icmp.go:162 module=icmp_ipv4 target=192.168.10.1 level=warn msg="Timeout reading from socket" err="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout"
ts=2018-10-09T21:31:08.25850825Z caller=main.go:129 module=icmp_ipv4 target=192.168.10.1 level=error msg="Probe failed" duration_seconds=9.500172787



Metrics that would have been returned:
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 1.1342e-05
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 9.500172787
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 0



Module configuration:
prober: icmp
timeout: 15s
icmp:
  preferred_ip_protocol: ip4
  source_ip_address: 127.0.0.1

@brian-brazil
Copy link
Contributor

That's unrelated to this issue, 127.0.0.1 won't have a route to elsewhere.

@fessmage
Copy link
Author

fessmage commented Oct 9, 2018

xealot, i think you have problem because of this string in blackbox config:
source_ip_address: "127.0.0.1"
Change it to real network interface ip, or delete completely.

@xealot
Copy link

xealot commented Oct 10, 2018

@fessmage you're right, sorry for the tracker noise and thanks for the tip

@bva
Copy link

bva commented Oct 18, 2018

I see similar problem with AWS environment.
Blackbox exporter configured to ping external ip (elastic), located in AWS also. After awhile, ICMP probes starts to fail. Reboot of blackbox_exporter helps, but problem always appears later.

tcpdump shows requests, but there were no answers from target machine. I tracked down problem to zero-filled ID field in IP header of packet, generated by blackbox_exporter. I attached patch, that fixes problem, in my case. Not sure though, will this fix will work for original issue reporter.
add_ipv4_id.txt

@brian-brazil
Copy link
Contributor

I don't think that patch is in line with RFC 6864, and if this is the issue then the problem is with your network rather than the blackbox exporter.

@bva
Copy link

bva commented Oct 19, 2018

Well, we don't have too much control to fix Amazon networking, i think. I you compare packets emitted by standard linux ping utility and blackbox_exporter, you will surely see difference in id header field.

Question, whether ping utility is RFC compliant or no, remains open:)

@wtip
Copy link

wtip commented Feb 27, 2019

I'm experiencing the same issue running the exporter on Ubuntu 16.04 hosts. Is there anything I can do to help troubleshoot?

@candlerb
Copy link
Contributor

I see nothing in RFC 6864 which says you may not set differing / random IDs on packets - only that they are no longer required to be unique.

Interestingly, if I ping 8.8.8.8 from Linux, I see the Linux sender putting different values in this field, but the responses have zero ID. So at least Google thinks that zero ID is acceptable.

root@prometheus:~# tcpdump -i eth0 -nn -s0 -X icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:55:58.843065 IP 10.12.255.33 > 8.8.8.8: ICMP echo request, id 7075, seq 1, length 64
	0x0000:  4500 0054 9c1f 4000 4001 854c 0a0c ff21  E..T..@[email protected]...!
                           ^^^^
	0x0010:  0808 0808 0800 f439 1ba3 0001 0e15 055d  .......9.......]
	0x0020:  0000 0000 09dd 0c00 0000 0000 1011 1213  ................
	0x0030:  1415 1617 1819 1a1b 1c1d 1e1f 2021 2223  .............!"#
	0x0040:  2425 2627 2829 2a2b 2c2d 2e2f 3031 3233  $%&'()*+,-./0123
	0x0050:  3435 3637                                4567
15:55:58.850547 IP 8.8.8.8 > 10.12.255.33: ICMP echo reply, id 7075, seq 1, length 64
	0x0000:  4500 0054 0000 0000 3601 6b6c 0808 0808  E..T....6.kl....
                           ^^^^
	0x0010:  0a0c ff21 0000 fc39 1ba3 0001 0e15 055d  ...!...9.......]
	0x0020:  0000 0000 09dd 0c00 0000 0000 1011 1213  ................
	0x0030:  1415 1617 1819 1a1b 1c1d 1e1f 2021 2223  .............!"#
	0x0040:  2425 2627 2829 2a2b 2c2d 2e2f 3031 3233  $%&'()*+,-./0123
	0x0050:  3435 3637                                4567

But pinging 9.9.9.9 I see the responses with different IDs in each packet.

@fernandocarletti
Copy link

Did anyone find a workaround for this?

@brian-brazil
Copy link
Contributor

None of the various different issues reported here appear to be issues in the blackbox exporter, all look like true positives due to bugs in the network or configuration errors. There's nothing the blackbox exporter can do if pings aren't making it back to the machine, so I'm going to close this.

@gmintoco
Copy link

Update for us here in case it helps other people googling for it:

This was caused by the payload for the blackbox icmp probe being 36 bytes. When we increased it to 64 bytes our probes were successful (using the payload_size parameter)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants