storage: harden artifical quiesce heartbeat #26908

spencerkimball · 2018-06-21T22:16:02Z

We previously had the assumption when sending quiesce messages
that the Commit field could always be set to the Raft group's
status.Commit. With upcoming changes to quiesce ranges even
with replicas that are behind but non-live, this value could be
set incorrectly and still received by a supposedly dead replica.

This change mirrors the logic in the raft implementation for
setting the raftpb.Message.Commit field.

Release note: None

cockroach-teamcity · 2018-06-21T22:16:07Z

This change is

nvanbenschoten · 2018-06-21T22:28:59Z

Review status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 4168 at r1 (raw file):

			Type:   raftpb.MsgHeartbeat,
			Term:   status.Term,
			Commit: min(prog.Match, status.Commit),

Leave a comment above this line pointing into Raft at raft.sendHeartbeat.

Comments from Reviewable

Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires cockroachdb#26908 to function properly Fixes cockroachdb#9446

tbg · 2018-06-23T12:13:10Z

pkg/storage/replica.go

@@ -4157,7 +4165,7 @@ func (r *Replica) quiesceAndNotifyLocked(ctx context.Context, status *raft.Statu
 			To:     id,
 			Type:   raftpb.MsgHeartbeat,
 			Term:   status.Term,
-			Commit: status.Commit,
+			Commit: min(prog.Match, status.Commit),


Please add a comment such as this one:

// In common operation, we only quiesce when all followers are up-to-date, and
// so the status.Commit field is safe to send. However, when we quiesce in the
// presence of dead nodes, a follower which is behind and counted as dead may
// not have the log entry referenced by status.Commit and would explode if it
// were told to commit up to that point. So reduce the commit index to an index
// we know the follower has acknowledged.

I've incorporated this into the new mechanism suggested by @bdarnell.

tbg

@bdarnell I think when I ran into this during my quiesce experiments, you mentioned that it was odd (?) to send a Commit update here in the first place. Is that correct? ISTM that if we didn't do that we could sometimes end up quiescing a follower that has the latest log index but not letting it know it's committed.

bdarnell · 2018-06-26T03:25:59Z

Yeah, I think this is the wrong fix. If we quiesce a replica that doesn't have the up-to-date commit information, it has no way of catching up later unless the replica unquiesces. If we're knowingly quiescing with a downed replica, we should drop that replica's heartbeat on the floor instead of giving it a stale commit index (which will be dropped by the network in the common case but might make it through if things are flaky but not completely down).

Review status: complete! 0 of 0 LGTMs obtained

Comments from Reviewable

nvanbenschoten · 2018-06-26T18:55:58Z

If we quiesce a replica that doesn't have the up-to-date commit information, it has no way of catching up later unless the replica unquiesces.

This makes sense to me, but I want to double check that I understand the desired behavior here. We don't want to send the heartbeat to the replica with a stale commit index because we never want that replica to quiesce, right? We either want it to be dead or to unquiesce the range until it is caught up?

bdarnell · 2018-06-26T18:56:43Z

Right. If it's dead, it doesn't matter whether we send the heartbeat or not. If it's alive, we want it to remain unquiesced until it catches up.

nvanbenschoten · 2018-06-26T19:03:32Z

If it's alive, we want it to remain unquiesced until it catches up.

And what actions will it take if it's unquiesced and alive while the rest of the range is quiesced to try to catch up? Will it campaign (non-disruptively and unsuccesfully because of pre-vote) and wake up the other replicas, allowing the leader to continue to catch it up with MsgApps?

bdarnell · 2018-06-26T19:05:52Z

Yes, exactly.

spencerkimball

OK, PTAL.

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 4168 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Leave a comment above this line pointing into Raft at raft.sendHeartbeat.

Done.

We previously had the assumption when sending quiesce messages that the Commit field could always be set to the Raft group's `status.Commit`. With upcoming changes to quiesce ranges even with replicas that are behind but non-live, this value could be set incorrectly and still received by a supposedly dead replica. This change mirrors the logic in the raft implementation for setting the `raftpb.Message.Commit` field. Release note: None

Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires cockroachdb#26908 to function properly Fixes cockroachdb#9446

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

spencerkimball · 2018-07-12T18:05:12Z

bors r+

26908: storage: harden artifical quiesce heartbeat r=spencerkimball a=spencerkimball We previously had the assumption when sending quiesce messages that the Commit field could always be set to the Raft group's `status.Commit`. With upcoming changes to quiesce ranges even with replicas that are behind but non-live, this value could be set incorrectly and still received by a supposedly dead replica. This change mirrors the logic in the raft implementation for setting the `raftpb.Message.Commit` field. Release note: None Co-authored-by: Spencer Kimball <[email protected]>

craig · 2018-07-12T18:32:54Z

Build succeeded

GitHub CI (Cockroach)

Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires cockroachdb#26908 to function properly Fixes cockroachdb#9446

26911: storage: quiesce ranges which have non-live replicas r=spencerkimball a=spencerkimball Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires #26908 to function properly Fixes #9446 Co-authored-by: Spencer Kimball <[email protected]>

nvanbenschoten · 2018-10-08T18:37:14Z

@bdarnell the original code here would have avoided the issue we see in #30064 (comment). It was changed because of #26908 (comment). We can't revert to the original code because the hazard mentioned in that comment is very real, but I think the correct fix is to send the heartbeat to the straggler replica (which we think is dead) without specifying that the Replica should quiesce (Quiesce: false). Doing so means that the straggler Replica will still get a heartbeat if it needs one to join the Raft group (see #30064 (comment)) and that it will wake the Range up if it happens to come back to life so that it can catch up. Thoughts?

bdarnell

SGTM

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

spencerkimball requested review from nvanbenschoten and a team June 21, 2018 22:16

spencerkimball mentioned this pull request Jun 21, 2018

storage: quiesce ranges which have non-live replicas #26911

Merged

tbg reviewed Jun 23, 2018

View reviewed changes

tbg approved these changes Jun 23, 2018

View reviewed changes

spencerkimball force-pushed the correct-quiesce-commit branch from 064bd50 to 1eb96b4 Compare July 11, 2018 21:46

spencerkimball commented Jul 11, 2018

View reviewed changes

spencerkimball force-pushed the correct-quiesce-commit branch from 1eb96b4 to 53dc851 Compare July 11, 2018 22:09

bdarnell approved these changes Jul 12, 2018

View reviewed changes

craig bot merged commit 53dc851 into cockroachdb:master Jul 12, 2018

bdarnell reviewed Oct 8, 2018

View reviewed changes

spencerkimball deleted the correct-quiesce-commit branch October 22, 2018 01:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: harden artifical quiesce heartbeat #26908

storage: harden artifical quiesce heartbeat #26908

spencerkimball commented Jun 21, 2018

cockroach-teamcity commented Jun 21, 2018

nvanbenschoten commented Jun 21, 2018

tbg Jun 23, 2018

spencerkimball Jul 11, 2018

tbg left a comment

bdarnell commented Jun 26, 2018

nvanbenschoten commented Jun 26, 2018

bdarnell commented Jun 26, 2018

nvanbenschoten commented Jun 26, 2018

bdarnell commented Jun 26, 2018

spencerkimball left a comment

bdarnell left a comment

spencerkimball commented Jul 12, 2018

craig bot commented Jul 12, 2018

nvanbenschoten commented Oct 8, 2018

bdarnell left a comment

storage: harden artifical quiesce heartbeat #26908

storage: harden artifical quiesce heartbeat #26908

Conversation

spencerkimball commented Jun 21, 2018

cockroach-teamcity commented Jun 21, 2018

nvanbenschoten commented Jun 21, 2018

tbg Jun 23, 2018

Choose a reason for hiding this comment

spencerkimball Jul 11, 2018

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

bdarnell commented Jun 26, 2018

nvanbenschoten commented Jun 26, 2018

bdarnell commented Jun 26, 2018

nvanbenschoten commented Jun 26, 2018

bdarnell commented Jun 26, 2018

spencerkimball left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

spencerkimball commented Jul 12, 2018

craig bot commented Jul 12, 2018

Build succeeded

nvanbenschoten commented Oct 8, 2018

bdarnell left a comment

Choose a reason for hiding this comment