-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: schedule Raft log catch-up of behind nodes #12485
Comments
Good writeup. |
Another option would be to extend etcd/raft's current |
Providing more global control over the number of in-flight messages is interesting, though I think we would need something more complex than a shared semaphore. I don't think Raft could block waiting for a slot to send a message as doing so seems deadlock prone given our limited number of Raft worker goroutines. So we'd need an interface to tell Raft to try to send more messages for a Raft group but that seems to get back to the same sort of scheduling as in my proposal. It would be worthwhile to sketch out your suggestion if you had something concrete in mind. |
No concrete proposal. I was thinking that the current in-flight restriction works fine without blocking, but that's because the RawNode always wakes itself up when it gets a MsgAppResp. Here, we'd need some way to wake up some other RawNode when a slot is freed up. That's certainly doable, but I don't see a particularly elegant solution. |
I'm going to experiment with this over the next couple of days. |
Closing as outdated. |
Per discussion with @spencerkimball. When a node goes down, replicas on the down node will fall behind as activity occurs on the associated ranges. When the node comes back up, Raft will notice that the replicas are behind and send Raft log entries via
MsgApp
requests. This catch-up work is performed by Raft as quickly as possible across all replicas on the down node and it is triggered by heartbeats which means it starts happening almost immediately.Consider a 3-node cluster with 1k ranges performing 1k ops/sec evenly spread across those ranges. On average we should see 1 op/range/sec. If a node is down for 5m we'd expect each replica on that down node to be 300 ops behind the current commit index for the range. As mentioned above, when the down node restarts heartbeats will very quickly determine that the replicas are behind and Raft will merrily generate a ton of
MsgApp
traffic which can overwhelm the node making it perform slowly enough that replicas take a long time to catch up.Rather than the current free-for-all, we should schedule Raft catch-up. See #12238 for a proposal for how to schedule this catch-up for Raft snapshots. If etcd-io/etcd#7037 is addressed, once a follower enters
ProgressStateProbe
it will only exit that state when it receives aMsgHeartbeatResp
. Or we could see about introducing a new API toraft.RawNode
to explicitly mark a follower as "application paused". Regardless of the precise mechanism, once a follower entersProgressStateProbe
we could letreplicateQueue
take care of sending either a Raft snapshot or Raft logs when the node the follower is on becomes live again.Cc @cockroachdb/stability, @spencerkimball, @bdarnell
The text was updated successfully, but these errors were encountered: