Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: harden artifical quiesce heartbeat #26908

Merged
merged 1 commit into from
Jul 12, 2018

Conversation

spencerkimball
Copy link
Member

We previously had the assumption when sending quiesce messages
that the Commit field could always be set to the Raft group's
status.Commit. With upcoming changes to quiesce ranges even
with replicas that are behind but non-live, this value could be
set incorrectly and still received by a supposedly dead replica.

This change mirrors the logic in the raft implementation for
setting the raftpb.Message.Commit field.

Release note: None

@spencerkimball spencerkimball requested review from nvanbenschoten and a team June 21, 2018 22:16
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten
Copy link
Member

Review status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 4168 at r1 (raw file):

			Type:   raftpb.MsgHeartbeat,
			Term:   status.Term,
			Commit: min(prog.Match, status.Commit),

Leave a comment above this line pointing into Raft at raft.sendHeartbeat.


Comments from Reviewable

spencerkimball added a commit to spencerkimball/cockroach that referenced this pull request Jun 21, 2018
Previously all replicas had to be completely up to date in order to
quiesce ranges. This made the loss of a node in a cluster with many
ranges an expensive proposition, as a significant number of ranges
could be kept unquiesced for as long as the node was down.

This change refreshes a liveness map from the `NodeLiveness`
object on every Raft ticker loop and then passes that to
`Replica.tick()` to allow the leader to disregard non-live nodes
when making its should-quiesce determination.

Release note (performance improvement): prevent dead nodes in clusters
with many ranges from causing unnecessarily high CPU usage.

Note that this PR requires cockroachdb#26908 to function properly

Fixes cockroachdb#9446
@@ -4157,7 +4165,7 @@ func (r *Replica) quiesceAndNotifyLocked(ctx context.Context, status *raft.Statu
To: id,
Type: raftpb.MsgHeartbeat,
Term: status.Term,
Commit: status.Commit,
Commit: min(prog.Match, status.Commit),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment such as this one:

// In common operation, we only quiesce when all followers are up-to-date, and
// so the status.Commit field is safe to send. However, when we quiesce in the
// presence of dead nodes, a follower which is behind and counted as dead may
// not have the log entry referenced by status.Commit and would explode if it
// were told to commit up to that point. So reduce the commit index to an index
// we know the follower has acknowledged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've incorporated this into the new mechanism suggested by @bdarnell.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdarnell I think when I ran into this during my quiesce experiments, you mentioned that it was odd (?) to send a Commit update here in the first place. Is that correct? ISTM that if we didn't do that we could sometimes end up quiescing a follower that has the latest log index but not letting it know it's committed.

@bdarnell
Copy link
Contributor

Yeah, I think this is the wrong fix. If we quiesce a replica that doesn't have the up-to-date commit information, it has no way of catching up later unless the replica unquiesces. If we're knowingly quiescing with a downed replica, we should drop that replica's heartbeat on the floor instead of giving it a stale commit index (which will be dropped by the network in the common case but might make it through if things are flaky but not completely down).


Review status: :shipit: complete! 0 of 0 LGTMs obtained


Comments from Reviewable

@nvanbenschoten
Copy link
Member

If we quiesce a replica that doesn't have the up-to-date commit information, it has no way of catching up later unless the replica unquiesces.

This makes sense to me, but I want to double check that I understand the desired behavior here. We don't want to send the heartbeat to the replica with a stale commit index because we never want that replica to quiesce, right? We either want it to be dead or to unquiesce the range until it is caught up?

@bdarnell
Copy link
Contributor

Right. If it's dead, it doesn't matter whether we send the heartbeat or not. If it's alive, we want it to remain unquiesced until it catches up.

@nvanbenschoten
Copy link
Member

If it's alive, we want it to remain unquiesced until it catches up.

And what actions will it take if it's unquiesced and alive while the rest of the range is quiesced to try to catch up? Will it campaign (non-disruptively and unsuccesfully because of pre-vote) and wake up the other replicas, allowing the leader to continue to catch it up with MsgApps?

@bdarnell
Copy link
Contributor

Yes, exactly.

@spencerkimball spencerkimball force-pushed the correct-quiesce-commit branch from 064bd50 to 1eb96b4 Compare July 11, 2018 21:46
Copy link
Member Author

@spencerkimball spencerkimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 4168 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Leave a comment above this line pointing into Raft at raft.sendHeartbeat.

Done.

We previously had the assumption when sending quiesce messages
that the Commit field could always be set to the Raft group's
`status.Commit`. With upcoming changes to quiesce ranges even
with replicas that are behind but non-live, this value could be
set incorrectly and still received by a supposedly dead replica.

This change mirrors the logic in the raft implementation for
setting the `raftpb.Message.Commit` field.

Release note: None
@spencerkimball spencerkimball force-pushed the correct-quiesce-commit branch from 1eb96b4 to 53dc851 Compare July 11, 2018 22:09
spencerkimball added a commit to spencerkimball/cockroach that referenced this pull request Jul 11, 2018
Previously all replicas had to be completely up to date in order to
quiesce ranges. This made the loss of a node in a cluster with many
ranges an expensive proposition, as a significant number of ranges
could be kept unquiesced for as long as the node was down.

This change refreshes a liveness map from the `NodeLiveness`
object on every Raft ticker loop and then passes that to
`Replica.tick()` to allow the leader to disregard non-live nodes
when making its should-quiesce determination.

Release note (performance improvement): prevent dead nodes in clusters
with many ranges from causing unnecessarily high CPU usage.

Note that this PR requires cockroachdb#26908 to function properly

Fixes cockroachdb#9446
Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

@spencerkimball
Copy link
Member Author

bors r+

craig bot pushed a commit that referenced this pull request Jul 12, 2018
26908: storage: harden artifical quiesce heartbeat r=spencerkimball a=spencerkimball

We previously had the assumption when sending quiesce messages
that the Commit field could always be set to the Raft group's
`status.Commit`. With upcoming changes to quiesce ranges even
with replicas that are behind but non-live, this value could be
set incorrectly and still received by a supposedly dead replica.

This change mirrors the logic in the raft implementation for
setting the `raftpb.Message.Commit` field.

Release note: None

Co-authored-by: Spencer Kimball <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jul 12, 2018

Build succeeded

@craig craig bot merged commit 53dc851 into cockroachdb:master Jul 12, 2018
spencerkimball added a commit to spencerkimball/cockroach that referenced this pull request Jul 12, 2018
Previously all replicas had to be completely up to date in order to
quiesce ranges. This made the loss of a node in a cluster with many
ranges an expensive proposition, as a significant number of ranges
could be kept unquiesced for as long as the node was down.

This change refreshes a liveness map from the `NodeLiveness`
object on every Raft ticker loop and then passes that to
`Replica.tick()` to allow the leader to disregard non-live nodes
when making its should-quiesce determination.

Release note (performance improvement): prevent dead nodes in clusters
with many ranges from causing unnecessarily high CPU usage.

Note that this PR requires cockroachdb#26908 to function properly

Fixes cockroachdb#9446
spencerkimball added a commit to spencerkimball/cockroach that referenced this pull request Jul 13, 2018
Previously all replicas had to be completely up to date in order to
quiesce ranges. This made the loss of a node in a cluster with many
ranges an expensive proposition, as a significant number of ranges
could be kept unquiesced for as long as the node was down.

This change refreshes a liveness map from the `NodeLiveness`
object on every Raft ticker loop and then passes that to
`Replica.tick()` to allow the leader to disregard non-live nodes
when making its should-quiesce determination.

Release note (performance improvement): prevent dead nodes in clusters
with many ranges from causing unnecessarily high CPU usage.

Note that this PR requires cockroachdb#26908 to function properly

Fixes cockroachdb#9446
craig bot pushed a commit that referenced this pull request Jul 13, 2018
26911: storage: quiesce ranges which have non-live replicas r=spencerkimball a=spencerkimball

Previously all replicas had to be completely up to date in order to
quiesce ranges. This made the loss of a node in a cluster with many
ranges an expensive proposition, as a significant number of ranges
could be kept unquiesced for as long as the node was down.

This change refreshes a liveness map from the `NodeLiveness`
object on every Raft ticker loop and then passes that to
`Replica.tick()` to allow the leader to disregard non-live nodes
when making its should-quiesce determination.

Release note (performance improvement): prevent dead nodes in clusters
with many ranges from causing unnecessarily high CPU usage.

Note that this PR requires #26908 to function properly

Fixes #9446

Co-authored-by: Spencer Kimball <[email protected]>
@nvanbenschoten
Copy link
Member

@bdarnell the original code here would have avoided the issue we see in #30064 (comment). It was changed because of #26908 (comment). We can't revert to the original code because the hazard mentioned in that comment is very real, but I think the correct fix is to send the heartbeat to the straggler replica (which we think is dead) without specifying that the Replica should quiesce (Quiesce: false). Doing so means that the straggler Replica will still get a heartbeat if it needs one to join the Raft group (see #30064 (comment)) and that it will wake the Range up if it happens to come back to life so that it can catch up. Thoughts?

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

@spencerkimball spencerkimball deleted the correct-quiesce-commit branch October 22, 2018 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants