Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICMP RTT not accurate #315

Closed
mancubus77 opened this issue Apr 28, 2018 · 22 comments · Fixed by #346
Closed

ICMP RTT not accurate #315

mancubus77 opened this issue Apr 28, 2018 · 22 comments · Fixed by #346

Comments

@mancubus77
Copy link

Host operating system: output of uname -a

Linux basic 3.10.0-693.17.1.el7.x86_64 #1 SMP Sun Jan 14 10:36:03 EST 2018 x86_64 x86_64 x86_64 GNU/Linux

blackbox_exporter version: output of blackbox_exporter -version

master 
commit 969228a28772a448591e37ca7fbac5aad708b57c

What is the blackbox.yml module config.

modules:
  icmp:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: "ip4"

What is the prometheus.yml scrape config.

  - job_name: 'blackbox-icmp'
    scrape_interval: 30s
    metrics_path: /probe
    params:
      module: [icmp]
    static_configs:
      - targets:
        - 8.8.8.8    # DNS1
        - 1.1.1.1    # DNS2
        - 127.0.0.1
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 172.17.0.1:9115

What logging output did you get from adding &debug=true to the probe URL?

--log.lever=debug

Issue

ICMP RTT for localhost:
[root@ise-basic ~]# ping 127.0.0.1 -c 5

PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.030 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.049 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.032 ms
64 bytes from 127.0.0.1: icmp_seq=4 ttl=64 time=0.020 ms
64 bytes from 127.0.0.1: icmp_seq=5 ttl=64 time=0.034 ms

Meanwhile blackbox returns:

# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 2.0513e-05
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.000377028
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1

0.000377028s equals 0.377028ms, which is looks wrong for me.
Nevertheless in debug mode based on timestamp I notices nearly same time stamps:

ts=2018-04-28T12:42:14.747479787Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Beginning probe" probe=icmp timeout_seconds=4.5
ts=2018-04-28T12:42:14.747534755Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Resolving target address" preferred_ip_protocol=ip4
ts=2018-04-28T12:42:14.747557102Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Resolved target address" ip=127.0.0.1
ts=2018-04-28T12:42:14.747573825Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Creating socket"
ts=2018-04-28T12:42:14.747633676Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Creating ICMP packet" seq=41 id=46168
ts=2018-04-28T12:42:14.74765198Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Writing out packet"
ts=2018-04-28T12:42:14.747746052Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Waiting for reply packets"
ts=2018-04-28T12:42:14.747779592Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Found matching reply packet"
ts=2018-04-28T12:42:14.747812433Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Probe succeeded" duration_seconds=0.000301551

Based on the timestamps between "Waiting for reply packets" and "Found matching reply packet", waiting takes 747779592-747746052=33540ns, which is equals 0.033ms

Also I've noticed difference in timestamps:
ts=2018-04-28T12:42:14.74765198Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Writing out packet"
ts=2018-04-28T12:42:14.747746052Z caller=main.go:176 module=icmp target=127.0.0.1 level=debug msg="Waiting for reply packets"

I don't believe that it might be cause of issue based on source code, but how knows.
Any advices on this?

@brian-brazil
Copy link
Contributor

probe_duration_seconds is how long the probe as a whole takes.

@SuperQ
Copy link
Member

SuperQ commented Apr 28, 2018

I looked into this previously. I can confirm, the exporter is very bad at handling very low latency results for ICMP.

It seems like a problem with Go just taking a long time to get the packet off the network. I need to do some more careful testing to see if there is something we can improve here.

@mancubus77
Copy link
Author

@SuperQ
I read the code and made some measurements. Looks like icmp.go has overhead in execution time measurement, therefore in case user enables debug, response times soars:
image
red zone -- enabled debug log verbosity
green zone -- disabled log verbosity

Measurement happens here:
main.go

        **start := time.Now()**
        registry := prometheus.NewRegistry()
        registry.MustRegister(probeSuccessGauge)
        registry.MustRegister(probeDurationGauge)
        success := prober(ctx, target, module, registry, sl, http.ResponseWriter())
        duration := time.Since(start).Seconds()
        **probeDurationGauge.Set(duration)**

The application measures execution of called function, but not socket execution time. I think a solution might be to move measurement from main.go to icmp.go and calculate time difference between write/read socket functions:

        // Reply should be the same except for the message type.
        wm.Type = replyType
        wb, err = wm.Marshal(nil)
        if err != nil {
                level.Error(logger).Log("msg", "Error marshalling packet", "err", err)
                return
        }

        rb := make([]byte, 65536)
        if err := socket.SetReadDeadline(deadline); err != nil {
                level.Error(logger).Log("msg", "Error setting socket deadline", "err", err)
                return
        }
**>>> start := time.Now()**
        level.Info(logger).Log("msg", "Waiting for reply packets")
        for {
...
                if bytes.Compare(rb[:n], wb) == 0 {
                        **>>> rttTime := time.Since(start).Seconds()**
                        return true**, rttTime**
                }

I think it could be solution for precise times and rid of execution overhead.

@SuperQ
Copy link
Member

SuperQ commented Apr 29, 2018

Yes, I'm thinking if we adjust the interface to the prober functions slightly, that we return success, duration := prober(...). This would allow us to get a more accurate result for the probe duration without so much measuring the blackbox_exporter itself.

@brian-brazil
Copy link
Contributor

I don't think we should be changing the interface in that way, probe_duration_seconds should remain the length of the entire probe. That isn't to say the icmp module couldn't get a new metric.

@SuperQ
Copy link
Member

SuperQ commented Apr 29, 2018

No, the probe duration needs to be the duration of the probe, not the blackbox_exporter. We shouldn't be measuring the exporter itself.

@brian-brazil
Copy link
Contributor

probe_duration_seconds is in line with how every other exporter does it.

@mancubus77
Copy link
Author

@SuperQ I'm with you mate
@brian-brazil based on the code all modules have same behavior, do you reckon to have one more gauge like real_probe_duration_seconds to represent time waited for data from socket?

@brian-brazil
Copy link
Contributor

An icmp specific metric would be probe_icmp_something. reply_seconds maybe?

@SuperQ
Copy link
Member

SuperQ commented Apr 29, 2018

I don't see how this is comparable to other exporters. Most other exporters expose raw data from some other system, pre-measured. This is creating a new measurement, which should be close to the origin of the measurement, which is the duration of the probe, not the duration of the exporter.

We have scrape_duration_seconds for that.

@brian-brazil
Copy link
Contributor

scrape_duration_seconds is what Prometheus sees, not what the exporter sees. The exporter guidelines also recommend a metric from the exporter's perspective as differences between them help measure network latency to the exporter and spot exporter overload.

Keep in mind also that this is the blackbox exporter, not the icmp exporter. What you are suggesting makes no sense for the other probes.

@brian-brazil
Copy link
Contributor

Also, probe_duration_seconds is the duration of the probe.

@SuperQ
Copy link
Member

SuperQ commented Apr 29, 2018

Exporter overload/overhead should come from /metrics, not /probe. For the probe itself, we don't don't care about the latency to the exporter very much, we care about the probe itself, whether it's ICMP, http, or whatever.

Yes, it does make sense for all prober functions, as they should not include exporter overhead in their measurements either.

@brian-brazil
Copy link
Contributor

Look I'm not going to argue this further. probe_duration_seconds is doing exactly what it's meant to be doing.

If you want so measure a sub-part of a probe, you need a new metric.

@mancubus77
Copy link
Author

mancubus77 commented Apr 29, 2018

@brian-brazil I think this is general issue for blackbox_exporter. How is about to implement it over all modules? return true, rttTime
might work fine for this, therefore metric probe_<module_name>_reply_seconds will perfectly fit here.

@brian-brazil
Copy link
Contributor

might work fine for this, therefore metric probe_<module_name>_reply_seconds will perfectly fit here.

HTTP already has it's own thing for this, I wouldn't worry much about making it generic. Propose something that makes sense for icmp.

@mancubus77
Copy link
Author

mancubus77 commented Apr 29, 2018

@brian-brazil I'm ok with probe_icmp_rtt_seconds

# HELP probe_icmp_rtt_seconds Returns Round Trip Time (RTT) for ICMP query. 
# TYPE probe_icmp_rtt_seconds gauge

@brian-brazil
Copy link
Contributor

Sounds good, though I'm make it a bit clearer in the description exactly what duration is being measured.

@mancubus77
Copy link
Author

@brian-brazil do you need any help to implement this mate?

@brian-brazil
Copy link
Contributor

A PR would be welcome.

@mancubus77
Copy link
Author

@brian-brazil any recommendations from you how to implement it align with design?

@brian-brazil
Copy link
Contributor

It's just another metric, look at how it's done in other probes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants