-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ICMP probes fail when running multiple blackbox_exporters behind NAT #411
Comments
We should probably go for 1 if it's pid 1, and 2 to be safe. |
BenoitKnecht
added a commit
to BenoitKnecht/blackbox_exporter
that referenced
this issue
Jan 30, 2019
This should help preventing issues with some network devices that have trouble NATing ICMP packets with the same ID and sequence nubmer but a different source IP address. Currently, this can happen if the blackbox_exporter runs in a container (the ID is set to the PID, which is typically 1 in a container), and serveral blackbox_exporters are restarted at the same time (the sequence numbers are reset to zero and stay in sync). This commit sets the ICMP echo ID to a random value if the PID is 1, and initializes the sequence number at a random offset. See prometheus#411 for details. Signed-off-by: Benoît Knecht <[email protected]>
brian-brazil
pushed a commit
that referenced
this issue
Jan 30, 2019
This should help preventing issues with some network devices that have trouble NATing ICMP packets with the same ID and sequence nubmer but a different source IP address. Currently, this can happen if the blackbox_exporter runs in a container (the ID is set to the PID, which is typically 1 in a container), and serveral blackbox_exporters are restarted at the same time (the sequence numbers are reset to zero and stay in sync). This commit sets the ICMP echo ID to a random value if the PID is 1, and initializes the sequence number at a random offset. See #411 for details. Signed-off-by: Benoît Knecht <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Let me first describe my setup. I have two hosts on the same subnet, each running their own instance of Prometheus and a blackbox_exporter, running in Docker containers. Each Prometheus instance scrapes the corresponding blackbox_exporter on localhost. The two hosts have internet access through a firewall that performs NAT.
I use Ansible to manage those hosts, and whenever I change the blackbox_exporter configuration, both instances are restarted to pickup the new parameters. But after the restart, the ICMP probes from one of the hosts would fail, while the same probes on the other host were fine; but only the probes to public IP addresses, probes to IP addresses on the LAN were fine (so it's potentially related to #360, I don't know if the author of that bug also has multiple blackbox_exporters running in parallel).
After banging my head on my keyboard for a few days, I finally figured out what's going on.
The blackbox_exporter sets the ICMP echo ID to
os.Getpid() & 0xffff
, which in a Docker container is always1
, and after a simultaneous restart,int(getICMPSequence())
will be in sync on both blackbox_exporter instances:When the firewall NATs those packets, it creates a session based on the ID and sequence number, and when the target replies, it doesn't know which packet should be NATed back to which source IP address, so all the replies go to a single blackbox_exporter. Restarting one of the blackbox_exporters fixes the issue because it resets
int(getICMPSequence())
so that they're out of sync between the two instances.To be fair, it's technically a bug in the way this particular firewall (a Palo Alto) NATs ICMP traffic; a Linux firewall not only NATs the source IP address, it also NATs the ICMP echo ID in such a way that it can keep track of the two sessions separately.
However, it's not something that can easily be fixed, and my guess is that there are many more network equipments that would misbehave when faced with such ICMP traffic.
I see several ways in which the blackbox_exporter could help working around this issue:
body.ID
to a random integer instead ofos.Getpid()
; maybe only do that ifos.Getpid() == 1
, so that it only applies to instances running in a container.icmpSequence
with a random offset, to make sure two instances started at the same time aren't in sync;body.ID
andbody.Seq
to random integers; if we don't care about the sequence number increasing monotonously, that would make the likelihood of multiple blackbox_exporters sending identical ICMP echo packets at the same time as small as possible.What are your thoughts? Is this something you would be willing to address? If so, do you have a preferred solution? I'd be happy to submit a PR if there's a general consensus on the best approach.
The text was updated successfully, but these errors were encountered: