Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent-hash based broadcaster #26

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

manuelbernhardt
Copy link
Contributor

This PR is cased on the simplify-join and anti-entropy branches.

It introduces an experimental broadcasting mechanism based on consistent hashing of recipients. This approach allows to scale to a large quantity of nodes (at least 10000) under the correct conditions. Indeed, this strategy does not preserve message ordering, which is an issue as the protocol expects alerts to be received before consensus messages (there's no re-ordering mechanism in place).

There is a change to the IBroadcaster API related to the performance of updating the list of recipients when many nodes are part of the cluster (it would be wasteful to add the entire membership list rather than add / remove affected nodes for each round).

In order for FastPaxos to also have its set of Settings, MemberShipService needs to be passed a common Settings class that implements both FastPaxos.ISettings and MembershipService.ISettings
It can happen that a node misses a part of the consensus messages whilst still being able to send out its own vote (unidirectional network partition, message overload, ...). In this case, the rest of the group will see this node as being part of the group and the monitoring mechanism will still be working as expected, but the stale node will run an old configuration.

In order to enforce consistency in this case, the following new anti-entropy mechanism is used:

- each node maintains a set of configurations it has been part of
- probe messages now contain the configuration ID of the observer
- when a node receives a probe message with a configuration ID it does not know, it will start a background task to check again after a configured timeout (1 minute by default)
- if the configuration ID is still unknown after the timeout has reached, the node leaves (using the LEAVE protocol)
Instead of a two-phase protocol, joins are now happening in one phase where the seed node broadcasts K alerts (one for each observer).
This should drastically reduce the amount of messages exchanged in the cluster when a large number of nodes join.
Clients cannot join through two separate seeds at the same time
In the rare case that a node attempts to join at the exact same time that a view change is in progress, it is now told to retry
Broadcasting nodes are discovered automatically via node metadata.
Fast paxos votes are batched together.

Currenlty, messaging order is _not_ preserved

- Introduce Consistent Hash implementation
- Implement ConsistentHashBroadcaster
- Adjust the broadcaster API to enable new broadcasting strategy
- Batching fast paxos vote messages
- Only log TRACE if TRACE enabled
Copy link
Owner

@lalithsuresh lalithsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this patch mixes in the ByteString changeset, which is making it somewhat bloated. I left some feedback on the broadcaster implementation.

*
* https://tom-e-white.com/2007/11/consistent-hashing.html
*/
public class ConsistentHash<T> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for adding another consistent hash based ring implementation? Wouldn't any of the K rings in the MembershipView data structure suffice?

*
*/
@ExperimentalApi
public class ConsistentHashBroadcaster implements IBroadcaster {
Copy link
Owner

@lalithsuresh lalithsuresh May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broadcaster implementation looks like a hierarchical broadcaster, where the broadcaster nodes organize themselves along a ring. I'm not sure why the ring is required given that the implementation requires broadcasters to have a special field set in their metadata (it assumes configuration)?

The main concern however is: what happens if the number of broadcasters is less than N/4 (or N/2)? What are the resiliency properties here?

I assumed where this was going was that you'd use the 0th ring in MembershipView to deterministically organize the broadcasters. For example, deterministically pick B% out of N nodes from the 0th ring to be broadcasters and assign the remaining nodes as broadcastees (again deterministically) to a constant number of broadcasters for redundancy (so that even if a number of nodes fail, we can still make progress). Doing so has the advantage that the set of broadcasters in every membership round rotates.

Note that the expander graph has strong connectivity properties, which make it so that you can simply broadcast along it to get most of the benefits with less complexity. It also bounds the number of connections on the graph to 2K per node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hierarchical approach indeed - it isn't the gossip-based variant from the paper.
The reason is that the broadcaster nodes are deployed on more powerful (and expensive) instances that are able to handle more concurrent connections.

I'm entirely aware of the resiliency concerns and properties of this approach.

@lalithsuresh
Copy link
Owner

Hi @manuelbernhardt: I'd like to merge the API changes in rapid that you need to build such a broadcaster (for example, the changes to IBroadcaster and in the surrounding code). Given the very specific nature of the broadcaster itself (resiliency assumptions and the out-of-band configuration requirement), it might make more sense to keep the broadcaster itself outside Rapid for now.

@manuelbernhardt
Copy link
Contributor Author

@lalithsuresh yes, the changes to the IBroadcaster API would make sense to have in Rapid / master. I could try to make a PR that only includes those changes

@lalithsuresh
Copy link
Owner

@manuelbernhardt yes, a PR with only those changes would be great. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants