Consistent-hash based broadcaster #26

manuelbernhardt · 2020-05-05T11:41:20Z

This PR is cased on the simplify-join and anti-entropy branches.

It introduces an experimental broadcasting mechanism based on consistent hashing of recipients. This approach allows to scale to a large quantity of nodes (at least 10000) under the correct conditions. Indeed, this strategy does not preserve message ordering, which is an issue as the protocol expects alerts to be received before consensus messages (there's no re-ordering mechanism in place).

There is a change to the IBroadcaster API related to the performance of updating the list of recipients when many nodes are part of the cluster (it would be wasteful to add the entire membership list rather than add / remove affected nodes for each round).

In order for FastPaxos to also have its set of Settings, MemberShipService needs to be passed a common Settings class that implements both FastPaxos.ISettings and MembershipService.ISettings

It can happen that a node misses a part of the consensus messages whilst still being able to send out its own vote (unidirectional network partition, message overload, ...). In this case, the rest of the group will see this node as being part of the group and the monitoring mechanism will still be working as expected, but the stale node will run an old configuration. In order to enforce consistency in this case, the following new anti-entropy mechanism is used: - each node maintains a set of configurations it has been part of - probe messages now contain the configuration ID of the observer - when a node receives a probe message with a configuration ID it does not know, it will start a background task to check again after a configured timeout (1 minute by default) - if the configuration ID is still unknown after the timeout has reached, the node leaves (using the LEAVE protocol)

Instead of a two-phase protocol, joins are now happening in one phase where the seed node broadcasts K alerts (one for each observer). This should drastically reduce the amount of messages exchanged in the cluster when a large number of nodes join.

Clients cannot join through two separate seeds at the same time

In the rare case that a node attempts to join at the exact same time that a view change is in progress, it is now told to retry

Broadcasting nodes are discovered automatically via node metadata. Fast paxos votes are batched together. Currenlty, messaging order is _not_ preserved - Introduce Consistent Hash implementation - Implement ConsistentHashBroadcaster - Adjust the broadcaster API to enable new broadcasting strategy - Batching fast paxos vote messages - Only log TRACE if TRACE enabled

lalithsuresh

I assume this patch mixes in the ByteString changeset, which is making it somewhat bloated. I left some feedback on the broadcaster implementation.

lalithsuresh · 2020-05-13T05:25:51Z

rapid/src/main/java/com/vrg/rapid/ConsistentHash.java

+ *
+ * https://tom-e-white.com/2007/11/consistent-hashing.html
+ */
+public class ConsistentHash<T> {


What is the rationale for adding another consistent hash based ring implementation? Wouldn't any of the K rings in the MembershipView data structure suffice?

lalithsuresh · 2020-05-13T05:35:05Z

rapid/src/main/java/com/vrg/rapid/ConsistentHashBroadcaster.java

+ *
+ */
+@ExperimentalApi
+public class ConsistentHashBroadcaster implements IBroadcaster {


The broadcaster implementation looks like a hierarchical broadcaster, where the broadcaster nodes organize themselves along a ring. I'm not sure why the ring is required given that the implementation requires broadcasters to have a special field set in their metadata (it assumes configuration)?

The main concern however is: what happens if the number of broadcasters is less than N/4 (or N/2)? What are the resiliency properties here?

I assumed where this was going was that you'd use the 0th ring in MembershipView to deterministically organize the broadcasters. For example, deterministically pick B% out of N nodes from the 0th ring to be broadcasters and assign the remaining nodes as broadcastees (again deterministically) to a constant number of broadcasters for redundancy (so that even if a number of nodes fail, we can still make progress). Doing so has the advantage that the set of broadcasters in every membership round rotates.

Note that the expander graph has strong connectivity properties, which make it so that you can simply broadcast along it to get most of the benefits with less complexity. It also bounds the number of connections on the graph to 2K per node.

This is a hierarchical approach indeed - it isn't the gossip-based variant from the paper.
The reason is that the broadcaster nodes are deployed on more powerful (and expensive) instances that are able to handle more concurrent connections.

I'm entirely aware of the resiliency concerns and properties of this approach.

lalithsuresh · 2020-06-05T17:45:53Z

Hi @manuelbernhardt: I'd like to merge the API changes in rapid that you need to build such a broadcaster (for example, the changes to IBroadcaster and in the surrounding code). Given the very specific nature of the broadcaster itself (resiliency assumptions and the out-of-band configuration requirement), it might make more sense to keep the broadcaster itself outside Rapid for now.

manuelbernhardt · 2020-06-10T07:11:05Z

@lalithsuresh yes, the changes to the IBroadcaster API would make sense to have in Rapid / master. I could try to make a PR that only includes those changes

lalithsuresh · 2020-06-10T16:19:27Z

@manuelbernhardt yes, a PR with only those changes would be great. Thanks!

manuelbernhardt added 10 commits April 23, 2020 13:13

Fixing compilation error

52a3b36

In order for FastPaxos to also have its set of Settings, MemberShipService needs to be passed a common Settings class that implements both FastPaxos.ISettings and MembershipService.ISettings

Simplifying join protocol

e92b180

Instead of a two-phase protocol, joins are now happening in one phase where the seed node broadcasts K alerts (one for each observer). This should drastically reduce the amount of messages exchanged in the cluster when a large number of nodes join.

Improving comment

2c5ec4d

Fixing bug & adding safeguard to client

fe57237

Clients cannot join through two separate seeds at the same time

Guard against nodes attempting to join during a view change

1c05756

In the rare case that a node attempts to join at the exact same time that a view change is in progress, it is now told to retry

Allows a node to catch up if it misses a consensus round

23461a3

Merge

5819b8c

Refactoring alert batcher

b1fee8c

lalithsuresh reviewed May 13, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent-hash based broadcaster #26

Consistent-hash based broadcaster #26

manuelbernhardt commented May 5, 2020

lalithsuresh left a comment

lalithsuresh May 13, 2020

lalithsuresh May 13, 2020 •

edited

Loading

manuelbernhardt May 15, 2020

lalithsuresh commented Jun 5, 2020

manuelbernhardt commented Jun 10, 2020

lalithsuresh commented Jun 10, 2020

Consistent-hash based broadcaster #26

Are you sure you want to change the base?

Consistent-hash based broadcaster #26

Conversation

manuelbernhardt commented May 5, 2020

lalithsuresh left a comment

Choose a reason for hiding this comment

lalithsuresh May 13, 2020

Choose a reason for hiding this comment

lalithsuresh May 13, 2020 • edited Loading

Choose a reason for hiding this comment

manuelbernhardt May 15, 2020

Choose a reason for hiding this comment

lalithsuresh commented Jun 5, 2020

manuelbernhardt commented Jun 10, 2020

lalithsuresh commented Jun 10, 2020

lalithsuresh May 13, 2020 •

edited

Loading