storage: schedule Raft log catch-up of behind nodes #12485

petermattis · 2016-12-19T20:49:07Z

Per discussion with @spencerkimball. When a node goes down, replicas on the down node will fall behind as activity occurs on the associated ranges. When the node comes back up, Raft will notice that the replicas are behind and send Raft log entries via MsgApp requests. This catch-up work is performed by Raft as quickly as possible across all replicas on the down node and it is triggered by heartbeats which means it starts happening almost immediately.

Consider a 3-node cluster with 1k ranges performing 1k ops/sec evenly spread across those ranges. On average we should see 1 op/range/sec. If a node is down for 5m we'd expect each replica on that down node to be 300 ops behind the current commit index for the range. As mentioned above, when the down node restarts heartbeats will very quickly determine that the replicas are behind and Raft will merrily generate a ton of MsgApp traffic which can overwhelm the node making it perform slowly enough that replicas take a long time to catch up.

Rather than the current free-for-all, we should schedule Raft catch-up. See #12238 for a proposal for how to schedule this catch-up for Raft snapshots. If etcd-io/etcd#7037 is addressed, once a follower enters ProgressStateProbe it will only exit that state when it receives a MsgHeartbeatResp. Or we could see about introducing a new API to raft.RawNode to explicitly mark a follower as "application paused". Regardless of the precise mechanism, once a follower enters ProgressStateProbe we could let replicateQueue take care of sending either a Raft snapshot or Raft logs when the node the follower is on becomes live again.

Cc @cockroachdb/stability, @spencerkimball, @bdarnell

The text was updated successfully, but these errors were encountered:

spencerkimball · 2016-12-19T20:52:50Z

Good writeup.

bdarnell · 2016-12-20T06:30:37Z

Another option would be to extend etcd/raft's current MaxInFlightMsgs mechanism. Currently, we configure each raft group with an allowance of 4 in-flight MsgApps (other messages are not affected by this limit). If we could pass in a semaphore or similar shared object instead of just a count, then we could have global control over the number of in-flight messages. (this would also let us have more than 4 messages in flight for workloads that are concentrated on a small number of ranges)

petermattis · 2016-12-20T13:39:23Z

Providing more global control over the number of in-flight messages is interesting, though I think we would need something more complex than a shared semaphore. I don't think Raft could block waiting for a slot to send a message as doing so seems deadlock prone given our limited number of Raft worker goroutines. So we'd need an interface to tell Raft to try to send more messages for a Raft group but that seems to get back to the same sort of scheduling as in my proposal. It would be worthwhile to sketch out your suggestion if you had something concrete in mind.

bdarnell · 2016-12-20T14:04:22Z

No concrete proposal. I was thinking that the current in-flight restriction works fine without blocking, but that's because the RawNode always wakes itself up when it gets a MsgAppResp. Here, we'd need some way to wake up some other RawNode when a slot is freed up. That's certainly doable, but I don't see a particularly elegant solution.

petermattis · 2017-01-02T17:50:16Z

I'm going to experiment with this over the next couple of days.

tbg · 2018-10-11T10:02:51Z

Closing as outdated.

xiang90 mentioned this issue Dec 21, 2016

raft: resume paused followers on receipt of MsgHeartbeatResp etcd-io/etcd#7042

Merged

petermattis mentioned this issue Jan 9, 2017

stability: tail of under-replicated ranges lasts too long #11984

Closed

petermattis added this to the Later milestone Feb 23, 2017

petermattis added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. labels Jul 21, 2018

petermattis removed this from the Later milestone Oct 5, 2018

tbg closed this as completed Oct 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: schedule Raft log catch-up of behind nodes #12485

storage: schedule Raft log catch-up of behind nodes #12485

petermattis commented Dec 19, 2016

spencerkimball commented Dec 19, 2016

bdarnell commented Dec 20, 2016

petermattis commented Dec 20, 2016

bdarnell commented Dec 20, 2016

petermattis commented Jan 2, 2017

tbg commented Oct 11, 2018

storage: schedule Raft log catch-up of behind nodes #12485

storage: schedule Raft log catch-up of behind nodes #12485

Comments

petermattis commented Dec 19, 2016

spencerkimball commented Dec 19, 2016

bdarnell commented Dec 20, 2016

petermattis commented Dec 20, 2016

bdarnell commented Dec 20, 2016

petermattis commented Jan 2, 2017

tbg commented Oct 11, 2018