kvserver: nodes flapping on their liveness can stall cluster recovery operations #79266

aayushshah15 · 2022-04-01T22:22:40Z

When trying to determine what action to take on a range, the allocator will first determine if the range can currently achieve quorum. Ranges that are deemed to have a majority quorum of their replicas on non-live nodes (based on the current state of their node liveness records) are deemed to be unavailable and are not immediately acted upon (they are re-considered after a full scanner interval, which is 10 mins by default). The reason this check exists is because we want the replicate queue to avoid processing ranges that are unavailable, lest it stays blocked on that one range until its processing timeout is hit.

Unfortunately, a non-live node liveness record doesn't always mean that the node is actually dead. It could simply be a symptom of an oversaturated node struggling to heartbeat its liveness. Furthermore, the presence of these nodes is often cause for manual intervention, where we often try to decommission these nodes out of the cluster to avoid instability.

Effectively, this means that operations like decommissioning these struggling nodes often stall or have sporadic slowdowns depending on these nodes' liveness records at the time their ranges are checked by the replicate queue for processing.

Jira issue: CRDB-14665

Epic CRDB-39952

aayushshah15 · 2022-04-01T22:25:43Z

The reason this check exists is because we want the replicate queue to avoid processing ranges that are unavailable, lest it stays blocked on that one range until its processing timeout is hit.

I wonder how much of this caution is necessary after @tbg's circuit breaker work.

nvanbenschoten · 2022-04-04T18:55:42Z

Even with circuit breakers, proposing a write on a range that has yet to detect unavailability is going to block for 60 seconds. I don’t know that we want to open the floodgates for the replicate queue to wait 60s on each unavailable range, indefinitely.

Perhaps we should be drawing a distinction between DEAD nodes and UNAVAILABLE nodes (in the node liveness sense). If a node has been offline for 5+ minutes, there's very little chance that it will contribute to a timely consensus vote. However, if it's been offline for less than server.time_until_store_dead, it could be flapping. By drawing this distinction in liveAndDeadReplicas (which currently ignores storeStatusUnknown), we could cap the degree of disruption that a dead node could cause to the replicateQueue (up to 5 minutes of blocking) while ensuring that it still attempts to remove replicas from flapping nodes.

Concretely, I'm wondering if we want the following change:

diff --git a/pkg/kv/kvserver/store_pool.go b/pkg/kv/kvserver/store_pool.go
index dc6e6b6c11..bba34c2be1 100644
--- a/pkg/kv/kvserver/store_pool.go
+++ b/pkg/kv/kvserver/store_pool.go
@@ -707,6 +707,9 @@ func (sp *StorePool) liveAndDeadReplicas(
                        // and should be used for up-replication if necessary.
                        liveReplicas = append(liveReplicas, repl)
                case storeStatusUnknown:
+                       if includeSuspectAndDrainingStores {
+                               liveReplicas = append(liveReplicas, repl)
+                       }
                // No-op.
                case storeStatusSuspect, storeStatusDraining:
                        if includeSuspectAndDrainingStores {

aayushshah15 · 2022-04-04T19:03:28Z

By drawing this distinction in liveAndDeadReplicas (which currently ignores storeStatusUnknown), we could cap the degree of disruption that a dead node could cause to the replicateQueue (up to 5 minutes of blocking)

That makes sense. I'm onboard with the proposal.

Note: This PR is an alternative to, but subsumes, cockroachdb#80695. Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This had a few issues: 1. It meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). 2. If the replicateQueue ran into an error while rebalancing a decommissioning replica (see cockroachdb#79266 for instance), it would only retry that replica after either one full scanner interval or after the purgatory interval. This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. This callback spins up an async task that will first proactively enqueue all of the decommissioning nodes ranges (that have a replica on the local node) into the local node's replicateQueues. Then, this task will periodically nudge the decommissioning node's straggling replicas in order to requeue them (to alleviate (2) from above). All this is managed by a lightweight `decommissionMonitor`, which is responsible for managing the lifecycle of these async tasks. Release note: None

irfansharif · 2022-07-04T21:59:56Z

Unfortunately, a non-live node liveness record doesn't always mean that the node is actually dead. It could simply be a symptom of an oversaturated node struggling to heartbeat its liveness.

Do we expect this to be a problem in v22.1+ with liveness requests bypassing all admission queuing and admission control throttling all other operations? (mod those not integrated fully, yet, like follower writes/snapshot ingestions.) In anycase, the patch suggested above seems like a reasonable defense-in-depth thing to have.

aayushshah15 · 2022-08-16T17:04:11Z

@AlexTalks Do you think we should address this during stability? It seems like a small enough fix that's not too risky.

aayushshah15 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-distribution Relating to rebalancing and leasing. A-kv-recovery labels Apr 1, 2022

blathers-crl bot added the T-kv KV Team label Apr 3, 2022

aayushshah15 mentioned this issue Apr 5, 2022

kvserver: proactively enqueue replicas for a decommissioning node #79453

Closed

mwang1026 assigned AlexTalks Apr 6, 2022

aayushshah15 mentioned this issue May 1, 2022

server: proactively rebalance decommissioning nodes' replicas #80836

Closed

jlinder added sync-me-3 and removed sync-me-3 labels May 24, 2022

andrewbaptist self-assigned this May 26, 2023

exalate-issue-sync bot unassigned AlexTalks and andrewbaptist Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: nodes flapping on their liveness can stall cluster recovery operations #79266

kvserver: nodes flapping on their liveness can stall cluster recovery operations #79266

aayushshah15 commented Apr 1, 2022 •

edited by exalate-issue-sync bot

Loading

aayushshah15 commented Apr 1, 2022

nvanbenschoten commented Apr 4, 2022

aayushshah15 commented Apr 4, 2022

irfansharif commented Jul 4, 2022

aayushshah15 commented Aug 16, 2022

kvserver: nodes flapping on their liveness can stall cluster recovery operations #79266

kvserver: nodes flapping on their liveness can stall cluster recovery operations #79266

Comments

aayushshah15 commented Apr 1, 2022 • edited by exalate-issue-sync bot Loading

aayushshah15 commented Apr 1, 2022

nvanbenschoten commented Apr 4, 2022

aayushshah15 commented Apr 4, 2022

irfansharif commented Jul 4, 2022

aayushshah15 commented Aug 16, 2022

aayushshah15 commented Apr 1, 2022 •

edited by exalate-issue-sync bot

Loading