Multiraft Implementation #20

andybons · 2014-06-03T19:39:40Z

No description provided.

philips · 2014-07-23T18:57:40Z

What do you mean by multiraft? We have a new raft implementation here which might be of interest: etcd-io/etcd#874

bdarnell · 2014-07-23T19:01:56Z

Multiraft is the as-yet-unfinished raft implementation in https://github.com/cockroachdb/cockroach/tree/master/multiraft. It differs from most existing raft implementations in that it optimizes for the case where each server is a member of many (partially overlapping) consensus groups by e.g. consolidating heartbeats per pair of nodes.

philips · 2014-07-23T19:36:48Z

@bdarnell It would be great if we could work on a shared raft implementation with this feature in mind.

/cc @bmizerany @xiangli @unihorn

bmizerany · 2014-07-23T19:43:25Z

Agreed. Please let me know what we can do to work together, if possible.

xiang90 · 2014-07-23T19:45:26Z

@bdarnell Why you want to have some overlapped raft clusters rather than well-partitioned ones?
Consolidating heartbeats can only happens if one machine is the leader of more than two raft clusters and they have a significant number of overlapped nodes, right?

spencerkimball · 2014-07-23T20:07:14Z

@xiangli, "overlapped" in this context means that each node in the cluster
is likely participating in many raft consensus groups with any other given
node.

With cockroach, we cache rpc connections between nodes and use the periodic
heartbeat to compute clock skew and link latency. Individual raft consensus
groups communicate over these established links as necessary, but they
don't originate the heartbeats themselves. Followers in consensus groups
examine the health of connections (i.e. whether or not a heartbeat was
received within the last heartbeat interval) after their timeouts expire to
determine whether to become candidates.

On Wed, Jul 23, 2014 at 11:45 AM, Xiang Li [email protected] wrote:

@bdarnell https://github.com/bdarnell Why you want to have some
overlapped raft clusters rather than well-partitioned ones?
Consolidating heartbeats can only happens if one machine is the leader of
more than two raft clusters and they have a significant number of
overlapped nodes, right?

—
Reply to this email directly or view it on GitHub
#20 (comment)
.

xiang90 · 2014-07-23T20:20:57Z

@spencerkimball It seems like what exactly we did for our new raft. We do not trigger heartbeat/election inside raft. We do not have network layer inside raft. You can send virtual heartbeat from any source as you like.

bdarnell · 2014-07-23T22:31:23Z

Thanks for the interest; I'll definitely take a closer look at etcd's raft implementation (hashicorp's is on my list too). It would be great to collaborate on a single high-quality implementation.

philips · 2014-09-15T21:28:44Z

@bdarnell An update on this. We have merged our raft implementation and the separate WAL that is used by etcd over here: http://godoc.org/github.com/coreos/etcd/raft and here http://godoc.org/github.com/coreos/etcd/wal

It would be great to get some feedback and perhaps work together.

bdarnell · 2014-09-16T20:47:41Z

@philips Thanks for letting me know the etcd raft implementation have been merged. I've looked over it and I like a lot of the abstractions here (especially the way you batch up different kinds of updates in a single Ready struct. It took me while to convince myself that that was safe, but I think it's simpler than separating update channels for different types of events). That said, there are a few big things we need that aren't there yet:

Online membership change. It looks like the only way to change the membership of the cluster is to stop and restart all the nodes; we need the ability to add and remove nodes on the fly for rebalancing, etc.
Storing raftLog state on disk. We can't afford to keep full snapshots in memory, and we probably can't even afford all log entries since the last snapshot. We'd like to be able to drop entries from the in-memory log as soon as they are applied, and have some sort of interface for raft to reach back out to the application to retrieve older log entries and snapshots as needed to catch up out-of-date peers.
Coalesced heartbeats. We need to keep track of our last successful communication with each peer and only send heartbeats to those nodes we haven't heard from in a while (and we need to be able to update this per-peer timestamp from the outside so a response in one raft cluster can serve as a heartbeat for all other clusters involving that pair of nodes).

This adds up to a sizable amount of complexity, although at least some of it would be useful for etcd and other projects as well.

philips · 2014-09-16T20:52:00Z

@bdarnell We are absolutely looking at online membership change and plan on doing that in the next week or so. The other two are things we would like to do but haven't yet. Working together on either of those two would be great.

xiang90 · 2014-09-16T20:55:01Z

@bdarnell

We have the code ready, just need to think through the interface
storage/log interface is doable (you control how to get/truncate): right now it is just a wrapped slice
that is doable. and we have the plan.

bdarnell · 2014-12-16T18:52:17Z

I'm closing this now that we've incorporated etcd's raft implementation; I'll open new issues for tracking the remaining work.

removing wait for clear buffer

Pick up cockroachdb#20 which fixes a performance regression that caused range tombstones to be added to some sstables unnecessarily which in turn could cause compactions that are larger than necessary. Revert the workaround to `TestRocksDBDeleteRangeCompaction` which was made due to the now fixed bug. Release note: None

32007: c-deps: bump RocksDB to pick up perf fix r=benesch a=petermattis Pick up #20 which fixes a performance regression that caused range tombstones to be added to some sstables unnecessarily which in turn could cause compactions that are larger than necessary. Revert the workaround to `TestRocksDBDeleteRangeCompaction` which was made due to the now fixed bug. Release note: None Co-authored-by: Peter Mattis <[email protected]>

Pick up cockroachdb#20 which fixes a performance regression that caused range tombstones to be added to some sstables unnecessarily which in turn could cause compactions that are larger than necessary. Revert the workaround to `TestRocksDBDeleteRangeCompaction` which was made due to the now fixed bug. Release note: None

Fixes etcd-io/raft#19

andybons added the enhancement label Jun 3, 2014

andybons assigned bdarnell Jun 3, 2014

andybons removed the enhancement label Jun 3, 2014

bdarnell mentioned this issue Sep 23, 2014

Add etcd as a submodule so we can use its raft implementation. #77

Merged

bdarnell closed this as completed Dec 16, 2014

tbg mentioned this issue Apr 12, 2016

stability: panic on RangeLookup dispatched to correct range #6000

Closed

zhaoyuxi mentioned this issue Aug 16, 2016

sql/performance: independent filter expressions on JOINs should be pushed down to the operands #8566

Closed

soniabhishek pushed a commit to soniabhishek/cockroach that referenced this issue Feb 15, 2017

Merge pull request cockroachdb#20 from crowdflux/himanshu-github

c2a909d

removing wait for clear buffer

tbg mentioned this issue Oct 13, 2017

stability: stuck requests after rebalancing-inducing downtime #19165

Closed

petermattis mentioned this issue Jun 5, 2018

storage: dropping a large table will brick a cluster due to compactions #24029

Closed

petermattis mentioned this issue Oct 30, 2018

c-deps: bump RocksDB to pick up perf fix #32007

Merged

bra-fsn mentioned this issue Jul 12, 2019

One node eats all CPU and an increasing amount of memory #38788

Closed

petermattis mentioned this issue Dec 11, 2019

Single node has a very high CPU utilization when attempting a split of a range, causing range to be inaccessible. #43106

Closed

pav-kv pushed a commit to pav-kv/cockroach that referenced this issue Mar 5, 2024

Merge pull request cockroachdb#20 from pacoxu/main

eb615da

Fixes etcd-io/raft#19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiraft Implementation #20

Multiraft Implementation #20

andybons commented Jun 3, 2014

philips commented Jul 23, 2014

bdarnell commented Jul 23, 2014

philips commented Jul 23, 2014

bmizerany commented Jul 23, 2014

xiang90 commented Jul 23, 2014

spencerkimball commented Jul 23, 2014

xiang90 commented Jul 23, 2014

bdarnell commented Jul 23, 2014

philips commented Sep 15, 2014

bdarnell commented Sep 16, 2014

philips commented Sep 16, 2014

xiang90 commented Sep 16, 2014

bdarnell commented Dec 16, 2014

Multiraft Implementation #20

Multiraft Implementation #20

Comments

andybons commented Jun 3, 2014

philips commented Jul 23, 2014

bdarnell commented Jul 23, 2014

philips commented Jul 23, 2014

bmizerany commented Jul 23, 2014

xiang90 commented Jul 23, 2014

spencerkimball commented Jul 23, 2014

xiang90 commented Jul 23, 2014

bdarnell commented Jul 23, 2014

philips commented Sep 15, 2014

bdarnell commented Sep 16, 2014

philips commented Sep 16, 2014

xiang90 commented Sep 16, 2014

bdarnell commented Dec 16, 2014