Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: schedule Raft log catch-up of behind nodes #12485

Closed
petermattis opened this issue Dec 19, 2016 · 6 comments
Closed

storage: schedule Raft log catch-up of behind nodes #12485

petermattis opened this issue Dec 19, 2016 · 6 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@petermattis
Copy link
Collaborator

Per discussion with @spencerkimball. When a node goes down, replicas on the down node will fall behind as activity occurs on the associated ranges. When the node comes back up, Raft will notice that the replicas are behind and send Raft log entries via MsgApp requests. This catch-up work is performed by Raft as quickly as possible across all replicas on the down node and it is triggered by heartbeats which means it starts happening almost immediately.

Consider a 3-node cluster with 1k ranges performing 1k ops/sec evenly spread across those ranges. On average we should see 1 op/range/sec. If a node is down for 5m we'd expect each replica on that down node to be 300 ops behind the current commit index for the range. As mentioned above, when the down node restarts heartbeats will very quickly determine that the replicas are behind and Raft will merrily generate a ton of MsgApp traffic which can overwhelm the node making it perform slowly enough that replicas take a long time to catch up.

Rather than the current free-for-all, we should schedule Raft catch-up. See #12238 for a proposal for how to schedule this catch-up for Raft snapshots. If etcd-io/etcd#7037 is addressed, once a follower enters ProgressStateProbe it will only exit that state when it receives a MsgHeartbeatResp. Or we could see about introducing a new API to raft.RawNode to explicitly mark a follower as "application paused". Regardless of the precise mechanism, once a follower enters ProgressStateProbe we could let replicateQueue take care of sending either a Raft snapshot or Raft logs when the node the follower is on becomes live again.

Cc @cockroachdb/stability, @spencerkimball, @bdarnell

@spencerkimball
Copy link
Member

Good writeup.

@bdarnell
Copy link
Contributor

Another option would be to extend etcd/raft's current MaxInFlightMsgs mechanism. Currently, we configure each raft group with an allowance of 4 in-flight MsgApps (other messages are not affected by this limit). If we could pass in a semaphore or similar shared object instead of just a count, then we could have global control over the number of in-flight messages. (this would also let us have more than 4 messages in flight for workloads that are concentrated on a small number of ranges)

@petermattis
Copy link
Collaborator Author

Providing more global control over the number of in-flight messages is interesting, though I think we would need something more complex than a shared semaphore. I don't think Raft could block waiting for a slot to send a message as doing so seems deadlock prone given our limited number of Raft worker goroutines. So we'd need an interface to tell Raft to try to send more messages for a Raft group but that seems to get back to the same sort of scheduling as in my proposal. It would be worthwhile to sketch out your suggestion if you had something concrete in mind.

@bdarnell
Copy link
Contributor

No concrete proposal. I was thinking that the current in-flight restriction works fine without blocking, but that's because the RawNode always wakes itself up when it gets a MsgAppResp. Here, we'd need some way to wake up some other RawNode when a slot is freed up. That's certainly doable, but I don't see a particularly elegant solution.

@petermattis
Copy link
Collaborator Author

I'm going to experiment with this over the next couple of days.

@petermattis petermattis added this to the Later milestone Feb 23, 2017
@petermattis petermattis added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. labels Jul 21, 2018
@petermattis petermattis removed this from the Later milestone Oct 5, 2018
@tbg
Copy link
Member

tbg commented Oct 11, 2018

Closing as outdated.

@tbg tbg closed this as completed Oct 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

4 participants