Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memberlist reconnect #1384

Merged
merged 14 commits into from
Jun 5, 2018
Merged

memberlist reconnect #1384

merged 14 commits into from
Jun 5, 2018

Conversation

stuartnelson3
Copy link
Contributor

@stuartnelson3 stuartnelson3 commented May 15, 2018

add reconnection support for dead peers

todo:

edit:

I also included the logWriter{} wrapper I was using to expose memberlist logging. It's very verbose, and doesn't really conform to how we've been logging, so I'm not sure how best to expose it (or if I should just remove it from this PR).

Signed-off-by: stuart nelson <[email protected]>
Signed-off-by: stuart nelson <[email protected]>
Signed-off-by: stuart nelson <[email protected]>
Signed-off-by: stuart nelson <[email protected]>
Signed-off-by: stuart nelson <[email protected]>
DefaultProbeTimeout = 500 * time.Millisecond
DefaultProbeInterval = 1 * time.Second
DefaultReconnectInterval = 10 * time.Second
DefaultReconnectTimeout = 6 * time.Hour
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what should these values be?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconnection should probably be indefinite, for as long as SD returns the AM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Peers only exist from starting the binary (one of the --cluster.peer flags), or as an instance that connects to a running AM later. AM doesn't do any form of SD lookup to find its peers, so I think there needs to be some form of timeout since we have no way of knowing if a former peer is unreachable or has ceased to exist.

Signed-off-by: stuart nelson <[email protected]>
@stuartnelson3 stuartnelson3 force-pushed the stn/memberlist-reconnect branch from eb955b2 to 80d831f Compare May 15, 2018 20:35
Signed-off-by: stuart nelson <[email protected]>
Signed-off-by: stuart nelson <[email protected]>
@stuartnelson3 stuartnelson3 changed the title [wip] memberlist reconnect memberlist reconnect May 16, 2018
func (p *Peer) reconnect() {
p.peerLock.RLock()
failedPeers := p.failedPeers
p.peerLock.RUnlock()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reconnect test was locking up when this was doing the normal Lock(); defer Unlock(). I don't know * why *, though, and couldn't see any obvious reason.

@stuartnelson3 stuartnelson3 requested review from fabxc, grobie and brancz May 16, 2018 13:07
@stuartnelson3
Copy link
Contributor Author

@simonpasquier

@simonpasquier
Copy link
Member

Thanks, I'll have a look!

Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall.

AFAICT there are still a few situations that aren't handled properly:

  • when a peer restarts, the other peer still tries to reconnect even after the successful rejoin of the first one because the name (ULID) of the first peer has changed. One solution would be use the peer's address instead of its name as the key.
  • with asymmetric configurations (eg peer A is started without --cluster.peer, peer B with --cluster.peer=<peer A>), peer B will never try to reconnect if A is down when B starts.

@@ -143,9 +198,188 @@ func Join(
if n > 0 {
go p.warnIfAlone(l, 10*time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we can get rid of this goroutine.


return float64(len(p.failedPeers))
})
p.failedReconnectionsCounter = prometheus.NewCounter(prometheus.CounterOpts{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of failed/successful reconnections, it could be failed/total reconnections. Also _total suffix for counters?

@stuartnelson3
Copy link
Contributor Author

when a peer restarts, the other peer still tries to reconnect even after the successful rejoin of the first one because the name (ULID) of the first peer has changed. One solution would be use the peer's address instead of its name as the key.

sounds good to me.

with asymmetric configurations (eg peer A is started without --cluster.peer, peer B with --cluster.peer=), peer B will never try to reconnect if A is down when B starts.

Ah, so checking the result of the initial memberlist.Join and adding any non-connected nodes to the failedPeers list. Makes sense.

If a peer is restarted, it will rejoin with the
same IP but different ULID. So the node will
rejoin the cluster, but its peers will never
remove it from their internal list of failed nodes
because its ULID has changed.

Signed-off-by: stuart nelson <[email protected]>
Signed-off-by: stuart nelson <[email protected]>
Signed-off-by: stuart nelson <[email protected]>
Signed-off-by: stuart nelson <[email protected]>
@stuartnelson3
Copy link
Contributor Author

updated. feel free to suggest a better way to grab the nodes that we failed to connect to.


p.reconnectionsCounter = prometheus.NewCounter(prometheus.CounterOpts{
Name: "alertmanager_cluster_reconnections_total",
Help: "A counter of the number of successful cluster peer reconnections.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

successful to be removed.

return float64(len(p.failedPeers))
})
p.failedReconnectionsCounter = prometheus.NewCounter(prometheus.CounterOpts{
Name: "alertmanager_cluster_failed_reconnections_total",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alertmanager_cluster_reconnections_failed_total might be more idiomatic?

peers: map[string]peer{},
}

if reconnectInterval != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be done at the very end of the function once it is certain that it will return without error.

if n > 0 {
go p.warnIfAlone(l, 10*time.Second)
}
p.setInitialFailed(resolvedPeers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would be simpler to initialize p.failedPeers with all known peers before calling ml.Join(...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's WAAAY better :)

pr, ok := p.peers[n.Address()]
if !ok {
// Why are we receiving an update from a node that never
// joined?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be that the process has restarted and receives an out-of-bound notification?

Signed-off-by: stuart nelson <[email protected]>
@stuartnelson3
Copy link
Contributor Author

@simonpasquier I think I've addressed all your comments, let me know what you think

@stuartnelson3
Copy link
Contributor Author

Going to merge this so we can get another rc out

@stuartnelson3 stuartnelson3 merged commit db4af95 into master Jun 5, 2018
@stuartnelson3 stuartnelson3 deleted the stn/memberlist-reconnect branch June 5, 2018 12:28
@mxinden
Copy link
Member

mxinden commented Jun 5, 2018

@stuartnelson3 Let me know if you want me to follow up with #1363 and #1364.

@stuartnelson3
Copy link
Contributor Author

stuartnelson3 commented Jun 5, 2018

as soon as I feel confident about #1389 let's push this out. That's the only thing (in my mind) blocking 0.15.0

There were issues with message queueing (we were generating more messages than could be gossiped) that I need to resolve; I'll be working on it tomorrow.

EDIT: Thanks for your patience, I know this has been a realllllllllly long release cycle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants