-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent-hash based broadcaster #26
base: master
Are you sure you want to change the base?
Consistent-hash based broadcaster #26
Conversation
In order for FastPaxos to also have its set of Settings, MemberShipService needs to be passed a common Settings class that implements both FastPaxos.ISettings and MembershipService.ISettings
It can happen that a node misses a part of the consensus messages whilst still being able to send out its own vote (unidirectional network partition, message overload, ...). In this case, the rest of the group will see this node as being part of the group and the monitoring mechanism will still be working as expected, but the stale node will run an old configuration. In order to enforce consistency in this case, the following new anti-entropy mechanism is used: - each node maintains a set of configurations it has been part of - probe messages now contain the configuration ID of the observer - when a node receives a probe message with a configuration ID it does not know, it will start a background task to check again after a configured timeout (1 minute by default) - if the configuration ID is still unknown after the timeout has reached, the node leaves (using the LEAVE protocol)
Instead of a two-phase protocol, joins are now happening in one phase where the seed node broadcasts K alerts (one for each observer). This should drastically reduce the amount of messages exchanged in the cluster when a large number of nodes join.
Clients cannot join through two separate seeds at the same time
In the rare case that a node attempts to join at the exact same time that a view change is in progress, it is now told to retry
Broadcasting nodes are discovered automatically via node metadata. Fast paxos votes are batched together. Currenlty, messaging order is _not_ preserved - Introduce Consistent Hash implementation - Implement ConsistentHashBroadcaster - Adjust the broadcaster API to enable new broadcasting strategy - Batching fast paxos vote messages - Only log TRACE if TRACE enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this patch mixes in the ByteString changeset, which is making it somewhat bloated. I left some feedback on the broadcaster implementation.
* | ||
* https://tom-e-white.com/2007/11/consistent-hashing.html | ||
*/ | ||
public class ConsistentHash<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale for adding another consistent hash based ring implementation? Wouldn't any of the K rings in the MembershipView data structure suffice?
* | ||
*/ | ||
@ExperimentalApi | ||
public class ConsistentHashBroadcaster implements IBroadcaster { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The broadcaster implementation looks like a hierarchical broadcaster, where the broadcaster nodes organize themselves along a ring. I'm not sure why the ring is required given that the implementation requires broadcasters to have a special field set in their metadata (it assumes configuration)?
The main concern however is: what happens if the number of broadcasters is less than N/4 (or N/2)? What are the resiliency properties here?
I assumed where this was going was that you'd use the 0th ring in MembershipView to deterministically organize the broadcasters. For example, deterministically pick B% out of N nodes from the 0th ring to be broadcasters and assign the remaining nodes as broadcastees (again deterministically) to a constant number of broadcasters for redundancy (so that even if a number of nodes fail, we can still make progress). Doing so has the advantage that the set of broadcasters in every membership round rotates.
Note that the expander graph has strong connectivity properties, which make it so that you can simply broadcast along it to get most of the benefits with less complexity. It also bounds the number of connections on the graph to 2K per node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a hierarchical approach indeed - it isn't the gossip-based variant from the paper.
The reason is that the broadcaster nodes are deployed on more powerful (and expensive) instances that are able to handle more concurrent connections.
I'm entirely aware of the resiliency concerns and properties of this approach.
Hi @manuelbernhardt: I'd like to merge the API changes in rapid that you need to build such a broadcaster (for example, the changes to |
@lalithsuresh yes, the changes to the |
@manuelbernhardt yes, a PR with only those changes would be great. Thanks! |
This PR is cased on the
simplify-join
andanti-entropy
branches.It introduces an experimental broadcasting mechanism based on consistent hashing of recipients. This approach allows to scale to a large quantity of nodes (at least 10000) under the correct conditions. Indeed, this strategy does not preserve message ordering, which is an issue as the protocol expects alerts to be received before consensus messages (there's no re-ordering mechanism in place).
There is a change to the
IBroadcaster
API related to the performance of updating the list of recipients when many nodes are part of the cluster (it would be wasteful to add the entire membership list rather than add / remove affected nodes for each round).