-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: quiesce ranges which have non-live replicas #26911
Conversation
LGTM once #26908 is resolved. Reviewed 3 of 3 files at r1. pkg/storage/replica.go, line 3998 at r1 (raw file):
Replace this comment with a discussion of the new feature and the livenessMap argument. pkg/storage/replica.go, line 4101 at r1 (raw file):
I don't think the new code that matches the range descriptor's replica map against the raft status can replace the Comments from Reviewable |
e0cabfe
to
10ee80c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 3998 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Replace this comment with a discussion of the new feature and the livenessMap argument.
Done.
pkg/storage/replica.go, line 4101 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
I don't think the new code that matches the range descriptor's replica map against the raft status can replace the
foundSelf
check (I can't quite remember what that was for, but it's weird enough that it must have fixed a real bug at some point)
I was assuming that because we were going through our replica descriptor, that must contain the ID in the raft status, but that was a shaky assumption, so I've restored the foundSelf
check. Changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r2.
Reviewable status:complete! 1 of 0 LGTMs obtained (and 1 stale)
pkg/storage/replica.go, line 4184 at r2 (raw file):
// node liveness instance invokes a callback which causes all nodes to // wakes up any ranges containing replicas owned by the newly-live // node, allowing the out-of-date replicas to be brought back up to date.
Mention what it means for the map to be nil.
pkg/storage/replica.go, line 4311 at r2 (raw file):
} if progress, ok := status.Progress[uint64(rep.ReplicaID)]; !ok { if log.V(4) {
Add a comment describing when this can happen.
pkg/storage/store.go, line 3679 at r2 (raw file):
} // nodeIsLiveCallback is invoked when a node transitions from non-live
This is naturally going to be racy, right? It's possible that a node comes back to life right after the liveness map has been .Stored
but before the corresponding range has processed the tick. This could create scenarios where the node is already unquiesced, this callback is called and is a no-op, and then replica.maybeQuiesceLocked
is called and quiesces even with the newly live node out of date.
This should all be fine I think because the newly live node will eventually force the range to unquiesce, but I would like to see the race explicitly called out and rationalized here.
10ee80c
to
7ef0794
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 2 of 0 LGTMs obtained
pkg/storage/replica.go, line 4184 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Mention what it means for the map to be nil.
Done.
pkg/storage/replica.go, line 4311 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Add a comment describing when this can happen.
I don't know of any reason this could happen (@bdarnell?). I took this step because while these two objects should mirror each other, they are separately maintained.
pkg/storage/store.go, line 3679 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This is naturally going to be racy, right? It's possible that a node comes back to life right after the liveness map has been
.Stored
but before the corresponding range has processed the tick. This could create scenarios where the node is already unquiesced, this callback is called and is a no-op, and thenreplica.maybeQuiesceLocked
is called and quiesces even with the newly live node out of date.This should all be fine I think because the newly live node will eventually force the range to unquiesce, but I would like to see the race explicitly called out and rationalized here.
Done, and I've added code to make that race significantly less likely by updating the liveness map first thing in the callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale)
pkg/storage/replica.go, line 4311 at r2 (raw file):
Previously, spencerkimball (Spencer Kimball) wrote…
I don't know of any reason this could happen (@bdarnell?). I took this step because while these two objects should mirror each other, they are separately maintained.
It should never happen (except for the race between the two updates, but that should never happen concurrently with this code). Still good to be defensive here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale)
pkg/storage/replica.go, line 4311 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
It should never happen (except for the race between the two updates, but that should never happen concurrently with this code). Still good to be defensive here.
OK
pkg/storage/replica.go
Outdated
// node, allowing the out-of-date replicas to be brought back up to date. | ||
// If livenessMap is nil, liveness data will not be used, meaning no range | ||
// will quiesce if any replicas are behind, whether or not they are live. | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should mention that when the map is not nil, absence of an entry is taken as non-live (?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/storage/replica.go
Outdated
} else if progress.Match != status.Applied { | ||
// Skip any node in the descriptor which is not live. | ||
if livenessMap != nil && !livenessMap[rep.NodeID] { | ||
log.Infof(ctx, "skipping node %d because not live. Progress=%+v", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something, but wouldn't you also want to not send a message to that follower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you're suggesting. What message are you referring to that we shouldn't send to the follower, and what does this bit of code have to do with the message?
But thanks for pointing to this snippet of code because I realize the log.Infof
need to be inside an if log.V(4)
block.
Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires cockroachdb#26908 to function properly Fixes cockroachdb#9446
7ef0794
to
c47ace3
Compare
bors r+ |
26911: storage: quiesce ranges which have non-live replicas r=spencerkimball a=spencerkimball Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires #26908 to function properly Fixes #9446 Co-authored-by: Spencer Kimball <[email protected]>
Build succeeded |
PR cockroachdb#26911 added a mechanism which during quiescence decisions ignored nodes which (at some recent point in time) seemed non-live. Unfortunately, the corresponding mechanism introduced to unquiesce when the node becomes live is quite racy, at least when tormented with the wild timings we throw at it in unit testing. The basic discrepancy is that unquiescing is triggered by an explicit liveness event (i.e. the kv record being overwritten), whereas a node may seem non-live for mere reasons of timing inbetween. In effect this means that quiescence is more aggressive than unquiescence. I explored various venues to address this (going as far as introducing an unquiesce message to the Raft scheduler and serializing unquiescence through the store Raft ticker goroutine), but it's hopeless. It's an equally idle diversion to try to adjust the test liveness durations. So, in this commit, the liveness check becomes more lenient: even when a node seems non-live, we count it as live if there's a healthy connection established to it. This effectively deals with issues in tests but also seems like a good addition for real world workloads: when the liveness range breaks down, we don't want to aggressively quiesce. Fixes cockroachdb#27545. Fixes cockroachdb#27607. Release note: None
PR cockroachdb#26911 added a mechanism which during quiescence decisions ignored nodes which (at some recent point in time) seemed non-live. Unfortunately, the corresponding mechanism introduced to unquiesce when the node becomes live is quite racy, at least when tormented with the wild timings we throw at it in unit testing. The basic discrepancy is that unquiescing is triggered by an explicit liveness event (i.e. the kv record being overwritten), whereas a node may seem non-live for mere reasons of timing inbetween. In effect this means that quiescence is more aggressive than unquiescence. I explored various venues to address this (going as far as introducing an unquiesce message to the Raft scheduler and serializing unquiescence through the store Raft ticker goroutine), but it's hopeless. It's an equally idle diversion to try to adjust the test liveness durations. So, in this commit, the liveness check becomes more lenient: even when a node seems non-live, we count it as live if there's a healthy connection established to it. This effectively deals with issues in tests but also seems like a good addition for real world workloads: when the liveness range breaks down, we don't want to aggressively quiesce. Fixes cockroachdb#27545. Fixes cockroachdb#27607. Release note: None
27617: storage: quiesce less aggressively r=nvanbenschoten,bdarnell a=tschottdorf PR #26911 added a mechanism which during quiescence decisions ignored nodes which (at some recent point in time) seemed non-live. Unfortunately, the corresponding mechanism introduced to unquiesce when the node becomes live is quite racy, at least when tormented with the wild timings we throw at it in unit testing. The basic discrepancy is that unquiescing is triggered by an explicit liveness event (i.e. the kv record being overwritten), whereas a node may seem non-live for mere reasons of timing inbetween. In effect this means that quiescence is more aggressive than unquiescence. I explored various venues to address this (going as far as introducing an unquiesce message to the Raft scheduler and serializing unquiescence through the store Raft ticker goroutine), but it's hopeless. It's an equally idle diversion to try to adjust the test liveness durations. So, in this commit, the liveness check becomes more lenient: even when a node seems non-live, we count it as live if there's a healthy connection established to it. This effectively deals with issues in tests but also seems like a good addition for real world workloads: when the liveness range breaks down, we don't want to aggressively quiesce. Fixes #27545. Fixes #27607. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
Mitigates the runtime regression in cockroachdb#50865. Implements the proposal from cockroachdb#50865 (comment). Based on cockroachdb#51816. In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_quiescence` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This commit deliberately does not worry about the migration of this change for clarity. The next commit adds in the migration. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing.
Mitigates the runtime regression in cockroachdb#50865. Implements the proposal from cockroachdb#50865 (comment). Based on cockroachdb#51816. In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_quiescence` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This is a large improvement on top of the existing behavior because it reduces the set of ranges with a replica on a given dead node that unquiesce when that node becomes live to only those ranges that had write activity after the node went down. This is typically only a small fraction of the total number of ranges on each node (operating on the usual assumption that in an average large cluster, most data is write-cold), especially when the outage is short. However, if all ranges had write activity and subsequently quiesced before the outage resolved, we would still have to unquiesce all ranges upon the node coming back up. For this reason, the change is primarily targeted towards reducing the impact of node liveness blips and not reducing the impact of bringing nodes back up after sustained node outages. This is intentional, as the former category of outage is the one more likely to be caused by overload due to unquiescence traffic itself (as we see in cockroachdb#50865), so it is the category we choose to focus on here. This commit deliberately does not worry about the migration of this change for clarity. The next commit adds in the migration. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing.
Mitigates the runtime regression in cockroachdb#50865. Implements the proposal from cockroachdb#50865 (comment). Based on cockroachdb#51816. In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_followers_on_quiesce` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This is a large improvement on top of the existing behavior because it reduces the set of ranges with a replica on a given dead node that unquiesce when that node becomes live to only those ranges that had write activity after the node went down. This is typically only a small fraction of the total number of ranges on each node (operating on the usual assumption that in an average large cluster, most data is write-cold), especially when the outage is short. However, if all ranges had write activity and subsequently quiesced before the outage resolved, we would still have to unquiesce all ranges upon the node coming back up. For this reason, the change is primarily targeted towards reducing the impact of node liveness blips and not reducing the impact of bringing nodes back up after sustained node outages. This is intentional, as the former category of outage is the one more likely to be caused by overload due to unquiescence traffic itself (as we see in cockroachdb#50865), so it is the category we choose to focus on here. This commit deliberately does not worry about the migration of this change for clarity. The next commit adds in the migration. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing.
Mitigates the runtime regression in cockroachdb#50865. Implements the proposal from cockroachdb#50865 (comment). Based on cockroachdb#51816. In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_followers_on_quiesce` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This is a large improvement on top of the existing behavior because it reduces the set of ranges with a replica on a given dead node that unquiesce when that node becomes live to only those ranges that had write activity after the node went down. This is typically only a small fraction of the total number of ranges on each node (operating on the usual assumption that in an average large cluster, most data is write-cold), especially when the outage is short. However, if all ranges had write activity and subsequently quiesced before the outage resolved, we would still have to unquiesce all ranges upon the node coming back up. For this reason, the change is primarily targeted towards reducing the impact of node liveness blips and not reducing the impact of bringing nodes back up after sustained node outages. This is intentional, as the former category of outage is the one more likely to be caused by overload due to unquiescence traffic itself (as we see in cockroachdb#50865), so it is the category we choose to focus on here. This commit deliberately does not worry about the migration of this change for clarity. The next commit adds in the migration. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing.
51894: kvserver: don't unquiesce on liveness changes of up-to-date replicas r=nvanbenschoten a=nvanbenschoten Mitigates the runtime regression in #50865. Implements the proposal from #50865 (comment). In #26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like #50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like #51888 and #45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_quiescence` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see #50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This is a large improvement on top of the existing behavior because it reduces the set of ranges with a replica on a given dead node that unquiesce when that node becomes live to only those ranges that had write activity after the node went down. This is typically only a small fraction of the total number of ranges on each node (operating on the usual assumption that in an average large cluster, most data is write-cold), especially when the outage is short. However, if all ranges had write activity and subsequently quiesced before the outage resolved, we would still have to unquiesce all ranges upon the node coming back up. For this reason, the change is primarily targeted towards reducing the impact of node liveness blips and not reducing the impact of bringing nodes back up after sustained node outages. This is intentional, as the former category of outage is the one more likely to be caused by overload due to unquiescence traffic itself (as we see in #50865), so it is the category we choose to focus on here. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing. @petermattis I'm adding you as a reviewer because you've done more thinking about Range quiescence than anyone else. But I can find a new second reviewer if you don't have time to take a look at this. Co-authored-by: Nathan VanBenschoten <[email protected]>
Previously all replicas had to be completely up to date in order to
quiesce ranges. This made the loss of a node in a cluster with many
ranges an expensive proposition, as a significant number of ranges
could be kept unquiesced for as long as the node was down.
This change refreshes a liveness map from the
NodeLiveness
object on every Raft ticker loop and then passes that to
Replica.tick()
to allow the leader to disregard non-live nodeswhen making its should-quiesce determination.
Release note (performance improvement): prevent dead nodes in clusters
with many ranges from causing unnecessarily high CPU usage.
Note that this PR requires #26908 to function properly
Fixes #9446