Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiraft Implementation #20

Closed
andybons opened this issue Jun 3, 2014 · 13 comments
Closed

Multiraft Implementation #20

andybons opened this issue Jun 3, 2014 · 13 comments
Assignees

Comments

@andybons
Copy link
Contributor

andybons commented Jun 3, 2014

No description provided.

@philips
Copy link
Contributor

philips commented Jul 23, 2014

What do you mean by multiraft? We have a new raft implementation here which might be of interest: etcd-io/etcd#874

@bdarnell
Copy link
Contributor

Multiraft is the as-yet-unfinished raft implementation in https://github.com/cockroachdb/cockroach/tree/master/multiraft. It differs from most existing raft implementations in that it optimizes for the case where each server is a member of many (partially overlapping) consensus groups by e.g. consolidating heartbeats per pair of nodes.

@philips
Copy link
Contributor

philips commented Jul 23, 2014

@bdarnell It would be great if we could work on a shared raft implementation with this feature in mind.

/cc @bmizerany @xiangli @unihorn

@bmizerany
Copy link

Agreed. Please let me know what we can do to work together, if possible.

@xiang90
Copy link
Contributor

xiang90 commented Jul 23, 2014

@bdarnell Why you want to have some overlapped raft clusters rather than well-partitioned ones?
Consolidating heartbeats can only happens if one machine is the leader of more than two raft clusters and they have a significant number of overlapped nodes, right?

@spencerkimball
Copy link
Member

@xiangli, "overlapped" in this context means that each node in the cluster
is likely participating in many raft consensus groups with any other given
node.

With cockroach, we cache rpc connections between nodes and use the periodic
heartbeat to compute clock skew and link latency. Individual raft consensus
groups communicate over these established links as necessary, but they
don't originate the heartbeats themselves. Followers in consensus groups
examine the health of connections (i.e. whether or not a heartbeat was
received within the last heartbeat interval) after their timeouts expire to
determine whether to become candidates.

On Wed, Jul 23, 2014 at 11:45 AM, Xiang Li [email protected] wrote:

@bdarnell https://github.com/bdarnell Why you want to have some
overlapped raft clusters rather than well-partitioned ones?
Consolidating heartbeats can only happens if one machine is the leader of
more than two raft clusters and they have a significant number of
overlapped nodes, right?


Reply to this email directly or view it on GitHub
#20 (comment)
.

@xiang90
Copy link
Contributor

xiang90 commented Jul 23, 2014

@spencerkimball It seems like what exactly we did for our new raft. We do not trigger heartbeat/election inside raft. We do not have network layer inside raft. You can send virtual heartbeat from any source as you like.

@bdarnell
Copy link
Contributor

Thanks for the interest; I'll definitely take a closer look at etcd's raft implementation (hashicorp's is on my list too). It would be great to collaborate on a single high-quality implementation.

@philips
Copy link
Contributor

philips commented Sep 15, 2014

@bdarnell An update on this. We have merged our raft implementation and the separate WAL that is used by etcd over here: http://godoc.org/github.com/coreos/etcd/raft and here http://godoc.org/github.com/coreos/etcd/wal

It would be great to get some feedback and perhaps work together.

@bdarnell
Copy link
Contributor

@philips Thanks for letting me know the etcd raft implementation have been merged. I've looked over it and I like a lot of the abstractions here (especially the way you batch up different kinds of updates in a single Ready struct. It took me while to convince myself that that was safe, but I think it's simpler than separating update channels for different types of events). That said, there are a few big things we need that aren't there yet:

  1. Online membership change. It looks like the only way to change the membership of the cluster is to stop and restart all the nodes; we need the ability to add and remove nodes on the fly for rebalancing, etc.
  2. Storing raftLog state on disk. We can't afford to keep full snapshots in memory, and we probably can't even afford all log entries since the last snapshot. We'd like to be able to drop entries from the in-memory log as soon as they are applied, and have some sort of interface for raft to reach back out to the application to retrieve older log entries and snapshots as needed to catch up out-of-date peers.
  3. Coalesced heartbeats. We need to keep track of our last successful communication with each peer and only send heartbeats to those nodes we haven't heard from in a while (and we need to be able to update this per-peer timestamp from the outside so a response in one raft cluster can serve as a heartbeat for all other clusters involving that pair of nodes).

This adds up to a sizable amount of complexity, although at least some of it would be useful for etcd and other projects as well.

@philips
Copy link
Contributor

philips commented Sep 16, 2014

@bdarnell We are absolutely looking at online membership change and plan on doing that in the next week or so. The other two are things we would like to do but haven't yet. Working together on either of those two would be great.

@xiang90
Copy link
Contributor

xiang90 commented Sep 16, 2014

@bdarnell

  1. We have the code ready, just need to think through the interface
  2. storage/log interface is doable (you control how to get/truncate): right now it is just a wrapped slice
  3. that is doable. and we have the plan.

@bdarnell
Copy link
Contributor

I'm closing this now that we've incorporated etcd's raft implementation; I'll open new issues for tracking the remaining work.

soniabhishek pushed a commit to soniabhishek/cockroach that referenced this issue Feb 15, 2017
petermattis added a commit to petermattis/cockroach that referenced this issue Oct 30, 2018
Pick up cockroachdb#20 which fixes a performance regression
that caused range tombstones to be added to some sstables unnecessarily
which in turn could cause compactions that are larger than necessary.

Revert the workaround to `TestRocksDBDeleteRangeCompaction` which was made
due to the now fixed bug.

Release note: None
craig bot pushed a commit that referenced this issue Oct 30, 2018
32007: c-deps: bump RocksDB to pick up perf fix r=benesch a=petermattis

Pick up #20 which fixes a performance regression
that caused range tombstones to be added to some sstables unnecessarily
which in turn could cause compactions that are larger than necessary.

Revert the workaround to `TestRocksDBDeleteRangeCompaction` which was made
due to the now fixed bug.

Release note: None

Co-authored-by: Peter Mattis <[email protected]>
petermattis added a commit to petermattis/cockroach that referenced this issue Oct 31, 2018
Pick up cockroachdb#20 which fixes a performance regression
that caused range tombstones to be added to some sstables unnecessarily
which in turn could cause compactions that are larger than necessary.

Revert the workaround to `TestRocksDBDeleteRangeCompaction` which was made
due to the now fixed bug.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants