storage: quiesce ranges which have non-live replicas #26911

spencerkimball · 2018-06-21T22:33:40Z

Previously all replicas had to be completely up to date in order to
quiesce ranges. This made the loss of a node in a cluster with many
ranges an expensive proposition, as a significant number of ranges
could be kept unquiesced for as long as the node was down.

This change refreshes a liveness map from the NodeLiveness
object on every Raft ticker loop and then passes that to
Replica.tick() to allow the leader to disregard non-live nodes
when making its should-quiesce determination.

Release note (performance improvement): prevent dead nodes in clusters
with many ranges from causing unnecessarily high CPU usage.

Note that this PR requires #26908 to function properly

Fixes #9446

cockroach-teamcity · 2018-06-21T22:33:45Z

This change is

bdarnell · 2018-06-26T03:50:13Z

LGTM once #26908 is resolved.

Reviewed 3 of 3 files at r1.
Review status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 3998 at r1 (raw file):

// correctness issues.
//
// TODO(peter): When a node goes down, any range which has a replica on the

Replace this comment with a discussion of the new feature and the livenessMap argument.

pkg/storage/replica.go, line 4101 at r1 (raw file):

		return nil, false
	}
	var foundSelf bool

I don't think the new code that matches the range descriptor's replica map against the raft status can replace the foundSelf check (I can't quite remember what that was for, but it's weird enough that it must have fixed a real bug at some point)

Comments from Reviewable

spencerkimball

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 3998 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Replace this comment with a discussion of the new feature and the livenessMap argument.

Done.

pkg/storage/replica.go, line 4101 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't think the new code that matches the range descriptor's replica map against the raft status can replace the foundSelf check (I can't quite remember what that was for, but it's weird enough that it must have fixed a real bug at some point)

I was assuming that because we were going through our replica descriptor, that must contain the ID in the raft status, but that was a shaky assumption, so I've restored the foundSelf check. Changed.

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

nvanbenschoten

Reviewed 3 of 3 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale)

pkg/storage/replica.go, line 4184 at r2 (raw file):

// node liveness instance invokes a callback which causes all nodes to
// wakes up any ranges containing replicas owned by the newly-live
// node, allowing the out-of-date replicas to be brought back up to date.

Mention what it means for the map to be nil.

pkg/storage/replica.go, line 4311 at r2 (raw file):

		}
		if progress, ok := status.Progress[uint64(rep.ReplicaID)]; !ok {
			if log.V(4) {

Add a comment describing when this can happen.

pkg/storage/store.go, line 3679 at r2 (raw file):

}

// nodeIsLiveCallback is invoked when a node transitions from non-live

This is naturally going to be racy, right? It's possible that a node comes back to life right after the liveness map has been .Stored but before the corresponding range has processed the tick. This could create scenarios where the node is already unquiesced, this callback is called and is a no-op, and then replica.maybeQuiesceLocked is called and quiesces even with the newly live node out of date.

This should all be fine I think because the newly live node will eventually force the range to unquiesce, but I would like to see the race explicitly called out and rationalized here.

spencerkimball

Reviewable status: complete! 2 of 0 LGTMs obtained

pkg/storage/replica.go, line 4184 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Mention what it means for the map to be nil.

Done.

pkg/storage/replica.go, line 4311 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Add a comment describing when this can happen.

I don't know of any reason this could happen (@bdarnell?). I took this step because while these two objects should mirror each other, they are separately maintained.

pkg/storage/store.go, line 3679 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is naturally going to be racy, right? It's possible that a node comes back to life right after the liveness map has been .Stored but before the corresponding range has processed the tick. This could create scenarios where the node is already unquiesced, this callback is called and is a no-op, and then replica.maybeQuiesceLocked is called and quiesces even with the newly live node out of date.

This should all be fine I think because the newly live node will eventually force the range to unquiesce, but I would like to see the race explicitly called out and rationalized here.

Done, and I've added code to make that race significantly less likely by updating the liveness map first thing in the callback.

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)

pkg/storage/replica.go, line 4311 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I don't know of any reason this could happen (@bdarnell?). I took this step because while these two objects should mirror each other, they are separately maintained.

It should never happen (except for the race between the two updates, but that should never happen concurrently with this code). Still good to be defensive here.

spencerkimball

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)

pkg/storage/replica.go, line 4311 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It should never happen (except for the race between the two updates, but that should never happen concurrently with this code). Still good to be defensive here.

OK

tbg · 2018-07-12T22:11:54Z

pkg/storage/replica.go

+// node, allowing the out-of-date replicas to be brought back up to date.
+// If livenessMap is nil, liveness data will not be used, meaning no range
+// will quiesce if any replicas are behind, whether or not they are live.
+//


Should mention that when the map is not nil, absence of an entry is taken as non-live (?).

tbg · 2018-07-12T22:13:59Z

pkg/storage/replica.go

+		} else if progress.Match != status.Applied {
+			// Skip any node in the descriptor which is not live.
+			if livenessMap != nil && !livenessMap[rep.NodeID] {
+				log.Infof(ctx, "skipping node %d because not live. Progress=%+v",


Maybe I'm missing something, but wouldn't you also want to not send a message to that follower?

I don't understand what you're suggesting. What message are you referring to that we shouldn't send to the follower, and what does this bit of code have to do with the message?

But thanks for pointing to this snippet of code because I realize the log.Infof need to be inside an if log.V(4) block.

Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires cockroachdb#26908 to function properly Fixes cockroachdb#9446

spencerkimball · 2018-07-13T16:22:33Z

bors r+

26911: storage: quiesce ranges which have non-live replicas r=spencerkimball a=spencerkimball Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires #26908 to function properly Fixes #9446 Co-authored-by: Spencer Kimball <[email protected]>

craig · 2018-07-13T16:41:40Z

Build succeeded

GitHub CI (Cockroach)

PR cockroachdb#26911 added a mechanism which during quiescence decisions ignored nodes which (at some recent point in time) seemed non-live. Unfortunately, the corresponding mechanism introduced to unquiesce when the node becomes live is quite racy, at least when tormented with the wild timings we throw at it in unit testing. The basic discrepancy is that unquiescing is triggered by an explicit liveness event (i.e. the kv record being overwritten), whereas a node may seem non-live for mere reasons of timing inbetween. In effect this means that quiescence is more aggressive than unquiescence. I explored various venues to address this (going as far as introducing an unquiesce message to the Raft scheduler and serializing unquiescence through the store Raft ticker goroutine), but it's hopeless. It's an equally idle diversion to try to adjust the test liveness durations. So, in this commit, the liveness check becomes more lenient: even when a node seems non-live, we count it as live if there's a healthy connection established to it. This effectively deals with issues in tests but also seems like a good addition for real world workloads: when the liveness range breaks down, we don't want to aggressively quiesce. Fixes cockroachdb#27545. Fixes cockroachdb#27607. Release note: None

27617: storage: quiesce less aggressively r=nvanbenschoten,bdarnell a=tschottdorf PR #26911 added a mechanism which during quiescence decisions ignored nodes which (at some recent point in time) seemed non-live. Unfortunately, the corresponding mechanism introduced to unquiesce when the node becomes live is quite racy, at least when tormented with the wild timings we throw at it in unit testing. The basic discrepancy is that unquiescing is triggered by an explicit liveness event (i.e. the kv record being overwritten), whereas a node may seem non-live for mere reasons of timing inbetween. In effect this means that quiescence is more aggressive than unquiescence. I explored various venues to address this (going as far as introducing an unquiesce message to the Raft scheduler and serializing unquiescence through the store Raft ticker goroutine), but it's hopeless. It's an equally idle diversion to try to adjust the test liveness durations. So, in this commit, the liveness check becomes more lenient: even when a node seems non-live, we count it as live if there's a healthy connection established to it. This effectively deals with issues in tests but also seems like a good addition for real world workloads: when the liveness range breaks down, we don't want to aggressively quiesce. Fixes #27545. Fixes #27607. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>

Mitigates the runtime regression in cockroachdb#50865. Implements the proposal from cockroachdb#50865 (comment). Based on cockroachdb#51816. In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_quiescence` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This commit deliberately does not worry about the migration of this change for clarity. The next commit adds in the migration. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing.

Mitigates the runtime regression in cockroachdb#50865. Implements the proposal from cockroachdb#50865 (comment). Based on cockroachdb#51816. In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_quiescence` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This is a large improvement on top of the existing behavior because it reduces the set of ranges with a replica on a given dead node that unquiesce when that node becomes live to only those ranges that had write activity after the node went down. This is typically only a small fraction of the total number of ranges on each node (operating on the usual assumption that in an average large cluster, most data is write-cold), especially when the outage is short. However, if all ranges had write activity and subsequently quiesced before the outage resolved, we would still have to unquiesce all ranges upon the node coming back up. For this reason, the change is primarily targeted towards reducing the impact of node liveness blips and not reducing the impact of bringing nodes back up after sustained node outages. This is intentional, as the former category of outage is the one more likely to be caused by overload due to unquiescence traffic itself (as we see in cockroachdb#50865), so it is the category we choose to focus on here. This commit deliberately does not worry about the migration of this change for clarity. The next commit adds in the migration. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing.

Mitigates the runtime regression in cockroachdb#50865. Implements the proposal from cockroachdb#50865 (comment). Based on cockroachdb#51816. In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_followers_on_quiesce` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This is a large improvement on top of the existing behavior because it reduces the set of ranges with a replica on a given dead node that unquiesce when that node becomes live to only those ranges that had write activity after the node went down. This is typically only a small fraction of the total number of ranges on each node (operating on the usual assumption that in an average large cluster, most data is write-cold), especially when the outage is short. However, if all ranges had write activity and subsequently quiesced before the outage resolved, we would still have to unquiesce all ranges upon the node coming back up. For this reason, the change is primarily targeted towards reducing the impact of node liveness blips and not reducing the impact of bringing nodes back up after sustained node outages. This is intentional, as the former category of outage is the one more likely to be caused by overload due to unquiescence traffic itself (as we see in cockroachdb#50865), so it is the category we choose to focus on here. This commit deliberately does not worry about the migration of this change for clarity. The next commit adds in the migration. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing.

@petermattis

51894: kvserver: don't unquiesce on liveness changes of up-to-date replicas r=nvanbenschoten a=nvanbenschoten Mitigates the runtime regression in #50865. Implements the proposal from #50865 (comment). In #26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like #50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like #51888 and #45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_quiescence` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see #50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This is a large improvement on top of the existing behavior because it reduces the set of ranges with a replica on a given dead node that unquiesce when that node becomes live to only those ranges that had write activity after the node went down. This is typically only a small fraction of the total number of ranges on each node (operating on the usual assumption that in an average large cluster, most data is write-cold), especially when the outage is short. However, if all ranges had write activity and subsequently quiesced before the outage resolved, we would still have to unquiesce all ranges upon the node coming back up. For this reason, the change is primarily targeted towards reducing the impact of node liveness blips and not reducing the impact of bringing nodes back up after sustained node outages. This is intentional, as the former category of outage is the one more likely to be caused by overload due to unquiescence traffic itself (as we see in #50865), so it is the category we choose to focus on here. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing. @petermattis I'm adding you as a reviewer because you've done more thinking about Range quiescence than anyone else. But I can find a new second reviewer if you don't have time to take a look at this. Co-authored-by: Nathan VanBenschoten <[email protected]>

spencerkimball assigned tbg Jun 21, 2018

spencerkimball requested review from nvanbenschoten and a team June 21, 2018 22:33

spencerkimball force-pushed the quiesce-non-live branch 2 times, most recently from e0cabfe to 10ee80c Compare July 11, 2018 22:13

spencerkimball commented Jul 11, 2018

View reviewed changes

bdarnell approved these changes Jul 12, 2018

View reviewed changes

nvanbenschoten approved these changes Jul 12, 2018

View reviewed changes

spencerkimball force-pushed the quiesce-non-live branch from 10ee80c to 7ef0794 Compare July 12, 2018 21:35

spencerkimball commented Jul 12, 2018

View reviewed changes

bdarnell reviewed Jul 12, 2018

View reviewed changes

spencerkimball commented Jul 12, 2018

View reviewed changes

tbg reviewed Jul 12, 2018

View reviewed changes

spencerkimball force-pushed the quiesce-non-live branch from 7ef0794 to c47ace3 Compare July 13, 2018 01:16

craig bot merged commit c47ace3 into cockroachdb:master Jul 13, 2018

spencerkimball deleted the quiesce-non-live branch July 13, 2018 18:53

tbg mentioned this pull request Jul 16, 2018

github.com/cockroachdb/cockroach/pkg/storage: TestReplicateRestartAfterTruncationWithRemoveAndReAdd failed under stress #27545

Closed

tbg mentioned this pull request Jul 16, 2018

storage: quiesce less aggressively #27617

Merged

tbg mentioned this pull request Jul 16, 2018

roachtest: kv/quiescence/nodes=3 failed on master #26563

Closed

benesch mentioned this pull request Sep 19, 2018

production: registration outage postmortem 9/19 #30399

Closed

nvanbenschoten mentioned this pull request Jul 22, 2020

roachtest: kv/splits/nodes=3/quiesce=true failed #50865

Closed

nvanbenschoten mentioned this pull request Jul 24, 2020

kvserver: don't unquiesce on liveness changes of up-to-date replicas #51894

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: quiesce ranges which have non-live replicas #26911

storage: quiesce ranges which have non-live replicas #26911

spencerkimball commented Jun 21, 2018

cockroach-teamcity commented Jun 21, 2018

bdarnell commented Jun 26, 2018

spencerkimball left a comment

bdarnell left a comment

nvanbenschoten left a comment

spencerkimball left a comment

bdarnell left a comment

spencerkimball left a comment

tbg Jul 12, 2018

spencerkimball Jul 13, 2018

tbg Jul 12, 2018

spencerkimball Jul 13, 2018

spencerkimball commented Jul 13, 2018

craig bot commented Jul 13, 2018

storage: quiesce ranges which have non-live replicas #26911

storage: quiesce ranges which have non-live replicas #26911

Conversation

spencerkimball commented Jun 21, 2018

cockroach-teamcity commented Jun 21, 2018

bdarnell commented Jun 26, 2018

spencerkimball left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

spencerkimball left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

spencerkimball left a comment

Choose a reason for hiding this comment

tbg Jul 12, 2018

Choose a reason for hiding this comment

spencerkimball Jul 13, 2018

Choose a reason for hiding this comment

tbg Jul 12, 2018

Choose a reason for hiding this comment

spencerkimball Jul 13, 2018

Choose a reason for hiding this comment

spencerkimball commented Jul 13, 2018

craig bot commented Jul 13, 2018

Build succeeded