feat(serf) override serf probe interval and timeout w/ environment variables #102
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request implements an optional mechanism for overriding serf probe interval (default: 1 s) and timeout (default: 0.5 s) via environment variables AGENT_PROBE_INTERVAL and AGENT_PROBE_TIMEOUT.
A little background: I'm running Docker Swarm workers on a network of Raspberry Pis scattered all around the site. Network connectivity to the Pis is reliable enough for a swarm with longer
dispatcher-heartbeat
setting to survive regular network slowdowns, but Portainer Agent, in my case, relies too heavily on refuted suspect messages to keep the other agents connected. Default timeout for receiving acknowledgments from peers is simply too low for such network (bad Wi-fi signal reception + underpowered hardware). Moreover, sometimes even refutes get lost and fall-back TCP ping mechanism does not wait for successful TCP retransmits before deciding the peer is dead. As a consequence, I'm experiencing all the symptoms described in portainer/portainer#2535.I tracked the instability issues to default (sane) LAN configuration of serf probe in Agent, namely the probe interval and timeout. By running a custom build of Agent where I replaced the Agent's default settings with serf's WAN presets, I got rid of endpoint instability issues completely and my agent logs are much cleaner now. I'm therefore submitting this pull request where probe interval and timeout can be overridden by two environment variables - it may come in handy to people running swarm on poor networks.