-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memberlist reconnect #1384
memberlist reconnect #1384
Conversation
Signed-off-by: stuart nelson <[email protected]>
Signed-off-by: stuart nelson <[email protected]>
Signed-off-by: stuart nelson <[email protected]>
Signed-off-by: stuart nelson <[email protected]>
Signed-off-by: stuart nelson <[email protected]>
DefaultProbeTimeout = 500 * time.Millisecond | ||
DefaultProbeInterval = 1 * time.Second | ||
DefaultReconnectInterval = 10 * time.Second | ||
DefaultReconnectTimeout = 6 * time.Hour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what should these values be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reconnection should probably be indefinite, for as long as SD returns the AM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Peers only exist from starting the binary (one of the --cluster.peer
flags), or as an instance that connects to a running AM later. AM doesn't do any form of SD lookup to find its peers, so I think there needs to be some form of timeout since we have no way of knowing if a former peer is unreachable or has ceased to exist.
Signed-off-by: stuart nelson <[email protected]>
eb955b2
to
80d831f
Compare
Signed-off-by: stuart nelson <[email protected]>
Signed-off-by: stuart nelson <[email protected]>
Signed-off-by: stuart nelson <[email protected]>
func (p *Peer) reconnect() { | ||
p.peerLock.RLock() | ||
failedPeers := p.failedPeers | ||
p.peerLock.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reconnect test was locking up when this was doing the normal Lock(); defer Unlock()
. I don't know * why *, though, and couldn't see any obvious reason.
Thanks, I'll have a look! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall.
AFAICT there are still a few situations that aren't handled properly:
- when a peer restarts, the other peer still tries to reconnect even after the successful rejoin of the first one because the name (ULID) of the first peer has changed. One solution would be use the peer's address instead of its name as the key.
- with asymmetric configurations (eg peer A is started without
--cluster.peer
, peer B with--cluster.peer=<peer A>
), peer B will never try to reconnect if A is down when B starts.
cluster/cluster.go
Outdated
@@ -143,9 +198,188 @@ func Join( | |||
if n > 0 { | |||
go p.warnIfAlone(l, 10*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we can get rid of this goroutine.
|
||
return float64(len(p.failedPeers)) | ||
}) | ||
p.failedReconnectionsCounter = prometheus.NewCounter(prometheus.CounterOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of failed/successful reconnections, it could be failed/total reconnections. Also _total
suffix for counters?
sounds good to me.
Ah, so checking the result of the initial |
If a peer is restarted, it will rejoin with the same IP but different ULID. So the node will rejoin the cluster, but its peers will never remove it from their internal list of failed nodes because its ULID has changed. Signed-off-by: stuart nelson <[email protected]>
Signed-off-by: stuart nelson <[email protected]>
Signed-off-by: stuart nelson <[email protected]>
Signed-off-by: stuart nelson <[email protected]>
updated. feel free to suggest a better way to grab the nodes that we failed to connect to. |
cluster/cluster.go
Outdated
|
||
p.reconnectionsCounter = prometheus.NewCounter(prometheus.CounterOpts{ | ||
Name: "alertmanager_cluster_reconnections_total", | ||
Help: "A counter of the number of successful cluster peer reconnections.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
successful
to be removed.
cluster/cluster.go
Outdated
return float64(len(p.failedPeers)) | ||
}) | ||
p.failedReconnectionsCounter = prometheus.NewCounter(prometheus.CounterOpts{ | ||
Name: "alertmanager_cluster_failed_reconnections_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alertmanager_cluster_reconnections_failed_total
might be more idiomatic?
cluster/cluster.go
Outdated
peers: map[string]peer{}, | ||
} | ||
|
||
if reconnectInterval != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be done at the very end of the function once it is certain that it will return without error.
cluster/cluster.go
Outdated
if n > 0 { | ||
go p.warnIfAlone(l, 10*time.Second) | ||
} | ||
p.setInitialFailed(resolvedPeers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it would be simpler to initialize p.failedPeers
with all known peers before calling ml.Join(...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's WAAAY better :)
pr, ok := p.peers[n.Address()] | ||
if !ok { | ||
// Why are we receiving an update from a node that never | ||
// joined? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be that the process has restarted and receives an out-of-bound notification?
7657bc7
to
36d80ab
Compare
Signed-off-by: stuart nelson <[email protected]>
@simonpasquier I think I've addressed all your comments, let me know what you think |
Going to merge this so we can get another rc out |
@stuartnelson3 Let me know if you want me to follow up with #1363 and #1364. |
as soon as I feel confident about #1389 let's push this out. That's the only thing (in my mind) blocking 0.15.0 There were issues with message queueing (we were generating more messages than could be gossiped) that I need to resolve; I'll be working on it tomorrow. EDIT: Thanks for your patience, I know this has been a realllllllllly long release cycle |
add reconnection support for dead peers
todo:
DefaultReconnectInterval
andDefaultReconnectTimeout
edit:
I also included the
logWriter{}
wrapper I was using to expose memberlist logging. It's very verbose, and doesn't really conform to how we've been logging, so I'm not sure how best to expose it (or if I should just remove it from this PR).