Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICMP probes fail when running multiple blackbox_exporters behind NAT #411

Closed
BenoitKnecht opened this issue Jan 29, 2019 · 1 comment
Closed

Comments

@BenoitKnecht
Copy link
Contributor

Let me first describe my setup. I have two hosts on the same subnet, each running their own instance of Prometheus and a blackbox_exporter, running in Docker containers. Each Prometheus instance scrapes the corresponding blackbox_exporter on localhost. The two hosts have internet access through a firewall that performs NAT.

I use Ansible to manage those hosts, and whenever I change the blackbox_exporter configuration, both instances are restarted to pickup the new parameters. But after the restart, the ICMP probes from one of the hosts would fail, while the same probes on the other host were fine; but only the probes to public IP addresses, probes to IP addresses on the LAN were fine (so it's potentially related to #360, I don't know if the author of that bug also has multiple blackbox_exporters running in parallel).

After banging my head on my keyboard for a few days, I finally figured out what's going on.

The blackbox_exporter sets the ICMP echo ID to os.Getpid() & 0xffff, which in a Docker container is always 1, and after a simultaneous restart, int(getICMPSequence()) will be in sync on both blackbox_exporter instances:

body := &icmp.Echo{
    ID:   os.Getpid() & 0xffff,
    Seq:  int(getICMPSequence()),
    Data: data,
}

When the firewall NATs those packets, it creates a session based on the ID and sequence number, and when the target replies, it doesn't know which packet should be NATed back to which source IP address, so all the replies go to a single blackbox_exporter. Restarting one of the blackbox_exporters fixes the issue because it resets int(getICMPSequence()) so that they're out of sync between the two instances.

To be fair, it's technically a bug in the way this particular firewall (a Palo Alto) NATs ICMP traffic; a Linux firewall not only NATs the source IP address, it also NATs the ICMP echo ID in such a way that it can keep track of the two sessions separately.

However, it's not something that can easily be fixed, and my guess is that there are many more network equipments that would misbehave when faced with such ICMP traffic.

I see several ways in which the blackbox_exporter could help working around this issue:

  1. Set body.ID to a random integer instead of os.Getpid(); maybe only do that if os.Getpid() == 1, so that it only applies to instances running in a container.
  2. Initialize icmpSequence with a random offset, to make sure two instances started at the same time aren't in sync;
  3. Set both body.ID and body.Seq to random integers; if we don't care about the sequence number increasing monotonously, that would make the likelihood of multiple blackbox_exporters sending identical ICMP echo packets at the same time as small as possible.
  4. Make the ICMP ID configurable, so that it can be set by the user if needed.

What are your thoughts? Is this something you would be willing to address? If so, do you have a preferred solution? I'd be happy to submit a PR if there's a general consensus on the best approach.

@brian-brazil
Copy link
Contributor

We should probably go for 1 if it's pid 1, and 2 to be safe.

BenoitKnecht added a commit to BenoitKnecht/blackbox_exporter that referenced this issue Jan 30, 2019
This should help preventing issues with some network devices that have
trouble NATing ICMP packets with the same ID and sequence nubmer but a
different source IP address.

Currently, this can happen if the blackbox_exporter runs in a container
(the ID is set to the PID, which is typically 1 in a container), and
serveral blackbox_exporters are restarted at the same time (the sequence
numbers are reset to zero and stay in sync).

This commit sets the ICMP echo ID to a random value if the PID is 1, and
initializes the sequence number at a random offset.

See prometheus#411 for details.

Signed-off-by: Benoît Knecht <[email protected]>
brian-brazil pushed a commit that referenced this issue Jan 30, 2019
This should help preventing issues with some network devices that have
trouble NATing ICMP packets with the same ID and sequence nubmer but a
different source IP address.

Currently, this can happen if the blackbox_exporter runs in a container
(the ID is set to the PID, which is typically 1 in a container), and
serveral blackbox_exporters are restarted at the same time (the sequence
numbers are reset to zero and stay in sync).

This commit sets the ICMP echo ID to a random value if the PID is 1, and
initializes the sequence number at a random offset.

See #411 for details.

Signed-off-by: Benoît Knecht <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants