-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: nodes flapping on their liveness can stall cluster recovery operations #79266
Comments
I wonder how much of this caution is necessary after @tbg's circuit breaker work. |
Even with circuit breakers, proposing a write on a range that has yet to detect unavailability is going to block for 60 seconds. I don’t know that we want to open the floodgates for the replicate queue to wait 60s on each unavailable range, indefinitely. Perhaps we should be drawing a distinction between Concretely, I'm wondering if we want the following change: diff --git a/pkg/kv/kvserver/store_pool.go b/pkg/kv/kvserver/store_pool.go
index dc6e6b6c11..bba34c2be1 100644
--- a/pkg/kv/kvserver/store_pool.go
+++ b/pkg/kv/kvserver/store_pool.go
@@ -707,6 +707,9 @@ func (sp *StorePool) liveAndDeadReplicas(
// and should be used for up-replication if necessary.
liveReplicas = append(liveReplicas, repl)
case storeStatusUnknown:
+ if includeSuspectAndDrainingStores {
+ liveReplicas = append(liveReplicas, repl)
+ }
// No-op.
case storeStatusSuspect, storeStatusDraining:
if includeSuspectAndDrainingStores { |
That makes sense. I'm onboard with the proposal. |
Note: This PR is an alternative to, but subsumes, cockroachdb#80695. Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This had a few issues: 1. It meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). 2. If the replicateQueue ran into an error while rebalancing a decommissioning replica (see cockroachdb#79266 for instance), it would only retry that replica after either one full scanner interval or after the purgatory interval. This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. This callback spins up an async task that will first proactively enqueue all of the decommissioning nodes ranges (that have a replica on the local node) into the local node's replicateQueues. Then, this task will periodically nudge the decommissioning node's straggling replicas in order to requeue them (to alleviate (2) from above). All this is managed by a lightweight `decommissionMonitor`, which is responsible for managing the lifecycle of these async tasks. Release note: None
Note: This PR is an alternative to, but subsumes, cockroachdb#80695. Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This had a few issues: 1. It meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). 2. If the replicateQueue ran into an error while rebalancing a decommissioning replica (see cockroachdb#79266 for instance), it would only retry that replica after either one full scanner interval or after the purgatory interval. This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. This callback spins up an async task that will first proactively enqueue all of the decommissioning nodes ranges (that have a replica on the local node) into the local node's replicateQueues. Then, this task will periodically nudge the decommissioning node's straggling replicas in order to requeue them (to alleviate (2) from above). All this is managed by a lightweight `decommissionMonitor`, which is responsible for managing the lifecycle of these async tasks. Release note: None
Do we expect this to be a problem in v22.1+ with liveness requests bypassing all admission queuing and admission control throttling all other operations? (mod those not integrated fully, yet, like follower writes/snapshot ingestions.) In anycase, the patch suggested above seems like a reasonable defense-in-depth thing to have. |
@AlexTalks Do you think we should address this during stability? It seems like a small enough fix that's not too risky. |
When trying to determine what action to take on a range, the allocator will first determine if the range can currently achieve quorum. Ranges that are deemed to have a majority quorum of their replicas on non-live nodes (based on the current state of their node liveness records) are deemed to be unavailable and are not immediately acted upon (they are re-considered after a full scanner interval, which is 10 mins by default). The reason this check exists is because we want the replicate queue to avoid processing ranges that are unavailable, lest it stays blocked on that one range until its processing timeout is hit.
Unfortunately, a non-live node liveness record doesn't always mean that the node is actually dead. It could simply be a symptom of an oversaturated node struggling to heartbeat its liveness. Furthermore, the presence of these nodes is often cause for manual intervention, where we often try to decommission these nodes out of the cluster to avoid instability.
Effectively, this means that operations like decommissioning these struggling nodes often stall or have sporadic slowdowns depending on these nodes' liveness records at the time their ranges are checked by the replicate queue for processing.
Jira issue: CRDB-14665
Epic CRDB-39952
The text was updated successfully, but these errors were encountered: