ICMP RTT not accurate #315

mancubus77 · 2018-04-28T12:56:34Z

Host operating system: output of `uname -a`

Linux basic 3.10.0-693.17.1.el7.x86_64 #1 SMP Sun Jan 14 10:36:03 EST 2018 x86_64 x86_64 x86_64 GNU/Linux

blackbox_exporter version: output of `blackbox_exporter -version`

master 
commit 969228a28772a448591e37ca7fbac5aad708b57c

What is the blackbox.yml module config.

modules:
  icmp:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: "ip4"

What is the prometheus.yml scrape config.

  - job_name: 'blackbox-icmp'
    scrape_interval: 30s
    metrics_path: /probe
    params:
      module: [icmp]
    static_configs:
      - targets:
        - 8.8.8.8    # DNS1
        - 1.1.1.1    # DNS2
        - 127.0.0.1
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 172.17.0.1:9115

What logging output did you get from adding `&debug=true` to the probe URL?

--log.lever=debug

Issue

ICMP RTT for localhost:
[root@ise-basic ~]# ping 127.0.0.1 -c 5

PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.030 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.049 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.032 ms
64 bytes from 127.0.0.1: icmp_seq=4 ttl=64 time=0.020 ms
64 bytes from 127.0.0.1: icmp_seq=5 ttl=64 time=0.034 ms

Meanwhile blackbox returns:

# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 2.0513e-05
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.000377028
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1

0.000377028s equals 0.377028ms, which is looks wrong for me.
Nevertheless in debug mode based on timestamp I notices nearly same time stamps:

ts=2018-04-28T12:42:14.747479787Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Beginning probe" probe=icmp timeout_seconds=4.5
ts=2018-04-28T12:42:14.747534755Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Resolving target address" preferred_ip_protocol=ip4
ts=2018-04-28T12:42:14.747557102Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Resolved target address" ip=127.0.0.1
ts=2018-04-28T12:42:14.747573825Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Creating socket"
ts=2018-04-28T12:42:14.747633676Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Creating ICMP packet" seq=41 id=46168
ts=2018-04-28T12:42:14.74765198Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Writing out packet"
ts=2018-04-28T12:42:14.747746052Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Waiting for reply packets"
ts=2018-04-28T12:42:14.747779592Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Found matching reply packet"
ts=2018-04-28T12:42:14.747812433Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Probe succeeded" duration_seconds=0.000301551

Based on the timestamps between "Waiting for reply packets" and "Found matching reply packet", waiting takes 747779592-747746052=33540ns, which is equals 0.033ms

Also I've noticed difference in timestamps:
ts=2018-04-28T12:42:14.74765198Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Writing out packet"
ts=2018-04-28T12:42:14.747746052Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Waiting for reply packets"

I don't believe that it might be cause of issue based on source code, but how knows.
Any advices on this?

The text was updated successfully, but these errors were encountered:

brian-brazil · 2018-04-28T17:04:17Z

probe_duration_seconds is how long the probe as a whole takes.

SuperQ · 2018-04-28T18:03:09Z

I looked into this previously. I can confirm, the exporter is very bad at handling very low latency results for ICMP.

It seems like a problem with Go just taking a long time to get the packet off the network. I need to do some more careful testing to see if there is something we can improve here.

mancubus77 · 2018-04-29T01:49:43Z

@SuperQ
I read the code and made some measurements. Looks like icmp.go has overhead in execution time measurement, therefore in case user enables debug, response times soars:

red zone -- enabled debug log verbosity
green zone -- disabled log verbosity

Measurement happens here:
main.go

        **start := time.Now()**
        registry := prometheus.NewRegistry()
        registry.MustRegister(probeSuccessGauge)
        registry.MustRegister(probeDurationGauge)
        success := prober(ctx, target, module, registry, sl, http.ResponseWriter())
        duration := time.Since(start).Seconds()
        **probeDurationGauge.Set(duration)**

The application measures execution of called function, but not socket execution time. I think a solution might be to move measurement from main.go to icmp.go and calculate time difference between write/read socket functions:

        // Reply should be the same except for the message type.
        wm.Type = replyType
        wb, err = wm.Marshal(nil)
        if err != nil {
                level.Error(logger).Log("msg", "Error marshalling packet", "err", err)
                return
        }

        rb := make([]byte, 65536)
        if err := socket.SetReadDeadline(deadline); err != nil {
                level.Error(logger).Log("msg", "Error setting socket deadline", "err", err)
                return
        }
**>>> start := time.Now()**
        level.Info(logger).Log("msg", "Waiting for reply packets")
        for {
...
                if bytes.Compare(rb[:n], wb) == 0 {
                        **>>> rttTime := time.Since(start).Seconds()**
                        return true**, rttTime**
                }

I think it could be solution for precise times and rid of execution overhead.

SuperQ · 2018-04-29T06:21:38Z

Yes, I'm thinking if we adjust the interface to the prober functions slightly, that we return success, duration := prober(...). This would allow us to get a more accurate result for the probe duration without so much measuring the blackbox_exporter itself.

brian-brazil · 2018-04-29T06:26:30Z

I don't think we should be changing the interface in that way, probe_duration_seconds should remain the length of the entire probe. That isn't to say the icmp module couldn't get a new metric.

SuperQ · 2018-04-29T06:27:55Z

No, the probe duration needs to be the duration of the probe, not the blackbox_exporter. We shouldn't be measuring the exporter itself.

brian-brazil · 2018-04-29T06:36:13Z

probe_duration_seconds is in line with how every other exporter does it.

mancubus77 · 2018-04-29T06:51:13Z

@SuperQ I'm with you mate
@brian-brazil based on the code all modules have same behavior, do you reckon to have one more gauge like real_probe_duration_seconds to represent time waited for data from socket?

brian-brazil · 2018-04-29T07:03:02Z

An icmp specific metric would be probe_icmp_something. reply_seconds maybe?

SuperQ · 2018-04-29T07:14:49Z

I don't see how this is comparable to other exporters. Most other exporters expose raw data from some other system, pre-measured. This is creating a new measurement, which should be close to the origin of the measurement, which is the duration of the probe, not the duration of the exporter.

We have scrape_duration_seconds for that.

brian-brazil · 2018-04-29T07:22:02Z

scrape_duration_seconds is what Prometheus sees, not what the exporter sees. The exporter guidelines also recommend a metric from the exporter's perspective as differences between them help measure network latency to the exporter and spot exporter overload.

Keep in mind also that this is the blackbox exporter, not the icmp exporter. What you are suggesting makes no sense for the other probes.

brian-brazil · 2018-04-29T07:32:34Z

Also, probe_duration_seconds is the duration of the probe.

SuperQ · 2018-04-29T08:03:41Z

Exporter overload/overhead should come from /metrics, not /probe. For the probe itself, we don't don't care about the latency to the exporter very much, we care about the probe itself, whether it's ICMP, http, or whatever.

Yes, it does make sense for all prober functions, as they should not include exporter overhead in their measurements either.

brian-brazil · 2018-04-29T08:09:13Z

Look I'm not going to argue this further. probe_duration_seconds is doing exactly what it's meant to be doing.

If you want so measure a sub-part of a probe, you need a new metric.

mancubus77 · 2018-04-29T08:16:32Z

@brian-brazil I think this is general issue for blackbox_exporter. How is about to implement it over all modules? return true, rttTime
might work fine for this, therefore metric probe_<module_name>_reply_seconds will perfectly fit here.

brian-brazil · 2018-04-29T08:21:09Z

might work fine for this, therefore metric probe_<module_name>_reply_seconds will perfectly fit here.

HTTP already has it's own thing for this, I wouldn't worry much about making it generic. Propose something that makes sense for icmp.

mancubus77 · 2018-04-29T08:27:24Z

@brian-brazil I'm ok with probe_icmp_rtt_seconds

# HELP probe_icmp_rtt_seconds Returns Round Trip Time (RTT) for ICMP query. 
# TYPE probe_icmp_rtt_seconds gauge

brian-brazil · 2018-04-29T08:32:40Z

Sounds good, though I'm make it a bit clearer in the description exactly what duration is being measured.

mancubus77 · 2018-04-30T05:20:27Z

@brian-brazil do you need any help to implement this mate?

brian-brazil · 2018-04-30T06:58:48Z

A PR would be welcome.

mancubus77 · 2018-05-01T10:15:12Z

@brian-brazil any recommendations from you how to implement it align with design?

brian-brazil · 2018-05-01T14:09:57Z

It's just another metric, look at how it's done in other probes.

SuperQ mentioned this issue Jul 24, 2018

Add ICMP duration metrics #346

Merged

brian-brazil closed this as completed in #346 Jul 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ICMP RTT not accurate #315

ICMP RTT not accurate #315

mancubus77 commented Apr 28, 2018

brian-brazil commented Apr 28, 2018

SuperQ commented Apr 28, 2018

mancubus77 commented Apr 29, 2018

SuperQ commented Apr 29, 2018

brian-brazil commented Apr 29, 2018

SuperQ commented Apr 29, 2018

brian-brazil commented Apr 29, 2018

mancubus77 commented Apr 29, 2018

brian-brazil commented Apr 29, 2018

SuperQ commented Apr 29, 2018

brian-brazil commented Apr 29, 2018

brian-brazil commented Apr 29, 2018

SuperQ commented Apr 29, 2018

brian-brazil commented Apr 29, 2018

mancubus77 commented Apr 29, 2018 •

edited

Loading

brian-brazil commented Apr 29, 2018

mancubus77 commented Apr 29, 2018 •

edited

Loading

brian-brazil commented Apr 29, 2018

mancubus77 commented Apr 30, 2018

brian-brazil commented Apr 30, 2018

mancubus77 commented May 1, 2018

brian-brazil commented May 1, 2018

ICMP RTT not accurate #315

ICMP RTT not accurate #315

Comments

mancubus77 commented Apr 28, 2018

Host operating system: output of uname -a

blackbox_exporter version: output of blackbox_exporter -version

What is the blackbox.yml module config.

What is the prometheus.yml scrape config.

What logging output did you get from adding &debug=true to the probe URL?

Issue

brian-brazil commented Apr 28, 2018

SuperQ commented Apr 28, 2018

mancubus77 commented Apr 29, 2018

SuperQ commented Apr 29, 2018

brian-brazil commented Apr 29, 2018

SuperQ commented Apr 29, 2018

brian-brazil commented Apr 29, 2018

mancubus77 commented Apr 29, 2018

brian-brazil commented Apr 29, 2018

SuperQ commented Apr 29, 2018

brian-brazil commented Apr 29, 2018

brian-brazil commented Apr 29, 2018

SuperQ commented Apr 29, 2018

brian-brazil commented Apr 29, 2018

mancubus77 commented Apr 29, 2018 • edited Loading

brian-brazil commented Apr 29, 2018

mancubus77 commented Apr 29, 2018 • edited Loading

brian-brazil commented Apr 29, 2018

mancubus77 commented Apr 30, 2018

brian-brazil commented Apr 30, 2018

mancubus77 commented May 1, 2018

brian-brazil commented May 1, 2018

Host operating system: output of `uname -a`

blackbox_exporter version: output of `blackbox_exporter -version`

What logging output did you get from adding `&debug=true` to the probe URL?

mancubus77 commented Apr 29, 2018 •

edited

Loading

mancubus77 commented Apr 29, 2018 •

edited

Loading