-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: harden artifical quiesce heartbeat #26908
storage: harden artifical quiesce heartbeat #26908
Conversation
Review status: complete! 0 of 0 LGTMs obtained pkg/storage/replica.go, line 4168 at r1 (raw file):
Leave a comment above this line pointing into Raft at Comments from Reviewable |
Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires cockroachdb#26908 to function properly Fixes cockroachdb#9446
pkg/storage/replica.go
Outdated
@@ -4157,7 +4165,7 @@ func (r *Replica) quiesceAndNotifyLocked(ctx context.Context, status *raft.Statu | |||
To: id, | |||
Type: raftpb.MsgHeartbeat, | |||
Term: status.Term, | |||
Commit: status.Commit, | |||
Commit: min(prog.Match, status.Commit), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment such as this one:
// In common operation, we only quiesce when all followers are up-to-date, and
// so the status.Commit field is safe to send. However, when we quiesce in the
// presence of dead nodes, a follower which is behind and counted as dead may
// not have the log entry referenced by status.Commit
and would explode if it
// were told to commit up to that point. So reduce the commit index to an index
// we know the follower has acknowledged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've incorporated this into the new mechanism suggested by @bdarnell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bdarnell I think when I ran into this during my quiesce experiments, you mentioned that it was odd (?) to send a Commit update here in the first place. Is that correct? ISTM that if we didn't do that we could sometimes end up quiescing a follower that has the latest log index but not letting it know it's committed.
Yeah, I think this is the wrong fix. If we quiesce a replica that doesn't have the up-to-date commit information, it has no way of catching up later unless the replica unquiesces. If we're knowingly quiescing with a downed replica, we should drop that replica's heartbeat on the floor instead of giving it a stale commit index (which will be dropped by the network in the common case but might make it through if things are flaky but not completely down). Review status: complete! 0 of 0 LGTMs obtained Comments from Reviewable |
This makes sense to me, but I want to double check that I understand the desired behavior here. We don't want to send the heartbeat to the replica with a stale commit index because we never want that replica to quiesce, right? We either want it to be dead or to unquiesce the range until it is caught up? |
Right. If it's dead, it doesn't matter whether we send the heartbeat or not. If it's alive, we want it to remain unquiesced until it catches up. |
And what actions will it take if it's unquiesced and alive while the rest of the range is quiesced to try to catch up? Will it campaign (non-disruptively and unsuccesfully because of pre-vote) and wake up the other replicas, allowing the leader to continue to catch it up with MsgApps? |
Yes, exactly. |
064bd50
to
1eb96b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 4168 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Leave a comment above this line pointing into Raft at
raft.sendHeartbeat
.
Done.
We previously had the assumption when sending quiesce messages that the Commit field could always be set to the Raft group's `status.Commit`. With upcoming changes to quiesce ranges even with replicas that are behind but non-live, this value could be set incorrectly and still received by a supposedly dead replica. This change mirrors the logic in the raft implementation for setting the `raftpb.Message.Commit` field. Release note: None
1eb96b4
to
53dc851
Compare
Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires cockroachdb#26908 to function properly Fixes cockroachdb#9446
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
bors r+ |
26908: storage: harden artifical quiesce heartbeat r=spencerkimball a=spencerkimball We previously had the assumption when sending quiesce messages that the Commit field could always be set to the Raft group's `status.Commit`. With upcoming changes to quiesce ranges even with replicas that are behind but non-live, this value could be set incorrectly and still received by a supposedly dead replica. This change mirrors the logic in the raft implementation for setting the `raftpb.Message.Commit` field. Release note: None Co-authored-by: Spencer Kimball <[email protected]>
Build succeeded |
Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires cockroachdb#26908 to function properly Fixes cockroachdb#9446
Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires cockroachdb#26908 to function properly Fixes cockroachdb#9446
26911: storage: quiesce ranges which have non-live replicas r=spencerkimball a=spencerkimball Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires #26908 to function properly Fixes #9446 Co-authored-by: Spencer Kimball <[email protected]>
@bdarnell the original code here would have avoided the issue we see in #30064 (comment). It was changed because of #26908 (comment). We can't revert to the original code because the hazard mentioned in that comment is very real, but I think the correct fix is to send the heartbeat to the straggler replica (which we think is dead) without specifying that the Replica should quiesce ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
We previously had the assumption when sending quiesce messages
that the Commit field could always be set to the Raft group's
status.Commit
. With upcoming changes to quiesce ranges evenwith replicas that are behind but non-live, this value could be
set incorrectly and still received by a supposedly dead replica.
This change mirrors the logic in the raft implementation for
setting the
raftpb.Message.Commit
field.Release note: None