Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: nodes flapping on their liveness can stall cluster recovery operations #79266

Open
aayushshah15 opened this issue Apr 1, 2022 · 5 comments
Labels
A-kv-distribution Relating to rebalancing and leasing. A-kv-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@aayushshah15
Copy link
Contributor

aayushshah15 commented Apr 1, 2022

When trying to determine what action to take on a range, the allocator will first determine if the range can currently achieve quorum. Ranges that are deemed to have a majority quorum of their replicas on non-live nodes (based on the current state of their node liveness records) are deemed to be unavailable and are not immediately acted upon (they are re-considered after a full scanner interval, which is 10 mins by default). The reason this check exists is because we want the replicate queue to avoid processing ranges that are unavailable, lest it stays blocked on that one range until its processing timeout is hit.

Unfortunately, a non-live node liveness record doesn't always mean that the node is actually dead. It could simply be a symptom of an oversaturated node struggling to heartbeat its liveness. Furthermore, the presence of these nodes is often cause for manual intervention, where we often try to decommission these nodes out of the cluster to avoid instability.

Effectively, this means that operations like decommissioning these struggling nodes often stall or have sporadic slowdowns depending on these nodes' liveness records at the time their ranges are checked by the replicate queue for processing.

Jira issue: CRDB-14665

Epic CRDB-39952

@aayushshah15 aayushshah15 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-distribution Relating to rebalancing and leasing. A-kv-recovery labels Apr 1, 2022
@aayushshah15
Copy link
Contributor Author

The reason this check exists is because we want the replicate queue to avoid processing ranges that are unavailable, lest it stays blocked on that one range until its processing timeout is hit.

I wonder how much of this caution is necessary after @tbg's circuit breaker work.

@blathers-crl blathers-crl bot added the T-kv KV Team label Apr 3, 2022
@nvanbenschoten
Copy link
Member

Even with circuit breakers, proposing a write on a range that has yet to detect unavailability is going to block for 60 seconds. I don’t know that we want to open the floodgates for the replicate queue to wait 60s on each unavailable range, indefinitely.

Perhaps we should be drawing a distinction between DEAD nodes and UNAVAILABLE nodes (in the node liveness sense). If a node has been offline for 5+ minutes, there's very little chance that it will contribute to a timely consensus vote. However, if it's been offline for less than server.time_until_store_dead, it could be flapping. By drawing this distinction in liveAndDeadReplicas (which currently ignores storeStatusUnknown), we could cap the degree of disruption that a dead node could cause to the replicateQueue (up to 5 minutes of blocking) while ensuring that it still attempts to remove replicas from flapping nodes.

Concretely, I'm wondering if we want the following change:

diff --git a/pkg/kv/kvserver/store_pool.go b/pkg/kv/kvserver/store_pool.go
index dc6e6b6c11..bba34c2be1 100644
--- a/pkg/kv/kvserver/store_pool.go
+++ b/pkg/kv/kvserver/store_pool.go
@@ -707,6 +707,9 @@ func (sp *StorePool) liveAndDeadReplicas(
                        // and should be used for up-replication if necessary.
                        liveReplicas = append(liveReplicas, repl)
                case storeStatusUnknown:
+                       if includeSuspectAndDrainingStores {
+                               liveReplicas = append(liveReplicas, repl)
+                       }
                // No-op.
                case storeStatusSuspect, storeStatusDraining:
                        if includeSuspectAndDrainingStores {

@aayushshah15
Copy link
Contributor Author

By drawing this distinction in liveAndDeadReplicas (which currently ignores storeStatusUnknown), we could cap the degree of disruption that a dead node could cause to the replicateQueue (up to 5 minutes of blocking)

That makes sense. I'm onboard with the proposal.

aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue May 2, 2022
Note: This PR is an alternative to, but subsumes,
cockroachdb#80695.

Previously, when a node was marked `DECOMMISSIONING`, other nodes in the
system would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their `replicaScanner` to gradually
run into the decommissioning node's ranges and rely on their
`replicateQueue` to then rebalance them.

This had a few issues:
1. It meant that even when decommissioning a mostly empty node, our
   worst case lower bound for marking that node fully decommissioned was
   _one full scanner interval_ (which is 10 minutes by default).
2. If the replicateQueue ran into an error while rebalancing a
   decommissioning replica (see cockroachdb#79266 for instance), it would only
   retry that replica after either one full scanner interval or after
   the purgatory interval.

This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be `DECOMMISSIONING`.
This callback spins up an async task that will first proactively enqueue
all of the decommissioning nodes ranges (that have a replica on the
local node) into the local node's replicateQueues. Then, this task will
periodically nudge the decommissioning node's straggling replicas in
order to requeue them (to alleviate (2) from above).

All this is managed by a lightweight `decommissionMonitor`, which is
responsible for managing the lifecycle of these async tasks.

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue May 3, 2022
Note: This PR is an alternative to, but subsumes,
cockroachdb#80695.

Previously, when a node was marked `DECOMMISSIONING`, other nodes in the
system would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their `replicaScanner` to gradually
run into the decommissioning node's ranges and rely on their
`replicateQueue` to then rebalance them.

This had a few issues:
1. It meant that even when decommissioning a mostly empty node, our
   worst case lower bound for marking that node fully decommissioned was
   _one full scanner interval_ (which is 10 minutes by default).
2. If the replicateQueue ran into an error while rebalancing a
   decommissioning replica (see cockroachdb#79266 for instance), it would only
   retry that replica after either one full scanner interval or after
   the purgatory interval.

This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be `DECOMMISSIONING`.
This callback spins up an async task that will first proactively enqueue
all of the decommissioning nodes ranges (that have a replica on the
local node) into the local node's replicateQueues. Then, this task will
periodically nudge the decommissioning node's straggling replicas in
order to requeue them (to alleviate (2) from above).

All this is managed by a lightweight `decommissionMonitor`, which is
responsible for managing the lifecycle of these async tasks.

Release note: None
@irfansharif
Copy link
Contributor

Unfortunately, a non-live node liveness record doesn't always mean that the node is actually dead. It could simply be a symptom of an oversaturated node struggling to heartbeat its liveness.

Do we expect this to be a problem in v22.1+ with liveness requests bypassing all admission queuing and admission control throttling all other operations? (mod those not integrated fully, yet, like follower writes/snapshot ingestions.) In anycase, the patch suggested above seems like a reasonable defense-in-depth thing to have.

@aayushshah15
Copy link
Contributor Author

@AlexTalks Do you think we should address this during stability? It seems like a small enough fix that's not too risky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. A-kv-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

6 participants