server: introduce a decommission monitor task #80695

aayushshah15 · 2022-04-28T08:31:45Z

NOTE: This is a WIP. Don't review / look yet.

This commit introduces a decommissionMonitor that is responsible for a few
key things:

When a node begins decommissioning, its decommissionMonitor is spun up,
which proactively tells other nodes in the system to enqueue its ranges into
the replicateQueue. This means that a decommission process no longer has the
worst case lower bound of 10 mins (i.e. the default replica scanner interval).
In a future patch, this decommissionMonitor will selectively nudge some of
its straggling replicas' ranges to be re-enqueued into their leaseholder
store's replicateQueues. This will be done with the intention of collecting and
persisting (or dumping, initially) traces from these enqueue operations. This
should help reduce our time to RCA a slow decommission process.

Comparing the time taken to decommission a node out of a 6 node cluster
containing a 200-warehouse TPC-C dataset:

With this patch:

time rp ssh aayushs-test:1 -- "./cockroach node decommission --insecure --self";
...
roachprod ssh aayushs-test:1 --   0.07s user 0.07s system 0% cpu 59.125 total

Without this patch:

time rp ssh aayushs-test-wo:1 -- "./cockroach node decommission --insecure --self";
...
roachprod ssh aayushs-test-wo:1 --   0.17s user 0.15s system 0% cpu 3:47.34 total

Resolves #79453

Release note: None

cockroach-teamcity · 2022-04-28T08:31:53Z

This change is

Release note: None

This commit introduces a `decommissionMonitor` that is responsible for a few key things: - When a node begins decommissioning, its `decommissionMonitor` is spun up, which proactively tells other nodes in the system to enqueue its ranges into the `replicateQueue`. This means that a decommission process no longer has the worst case lower bound of 10 mins (i.e. the default replica scanner interval). - In a future patch, this `decommissionMonitor` will selectively nudge some of its straggling replicas' ranges to be re-enqueued into their leaseholder store's replicateQueues. This will be done with the intention of collecting and persisting (or dumping, initially) traces from these enqueue operations. This should help reduce our time to RCA a slow decommission process. Comparing the time taken to decommission a node out of a 6 node cluster containing a 200-warehouse TPC-C dataset: With this patch: ``` time rp ssh aayushs-test:1 -- "./cockroach node decommission --insecure --self"; ... roachprod ssh aayushs-test:1 -- 0.07s user 0.07s system 0% cpu 59.125 total ``` Without this patch: ``` time rp ssh aayushs-test-wo:1 -- "./cockroach node decommission --insecure --self"; ... roachprod ssh aayushs-test-wo:1 -- 0.17s user 0.15s system 0% cpu 3:47.34 total ``` Release note: None

Note: This PR is an alternative to, but subsumes, cockroachdb#80695. Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This had a few issues: 1. It meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). 2. If the replicateQueue ran into an error while rebalancing a decommissioning replica (see cockroachdb#79266 for instance), it would only retry that replica after either one full scanner interval or after the purgatory interval. This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. This callback spins up an async task that will first proactively enqueue all of the decommissioning nodes ranges (that have a replica on the local node) into the local node's replicateQueues. Then, this task will periodically nudge the decommissioning node's straggling replicas in order to requeue them (to alleviate (2) from above). All this is managed by a lightweight `decommissionMonitor`, which is responsible for managing the lifecycle of these async tasks. Release note: None

aayushshah15 · 2022-05-04T21:33:04Z

Closing in favor of #80993

aayushshah15 force-pushed the 20220427_proactiveDecommission branch from b77e9ea to 5d2d33b Compare April 28, 2022 08:38

aayushshah15 added 2 commits April 28, 2022 12:35

kvserver: add an async parameter to Store.ManuallyEnqueue()

abf3ddc

Release note: None

aayushshah15 force-pushed the 20220427_proactiveDecommission branch from 5d2d33b to 5a7969f Compare April 28, 2022 16:35

aayushshah15 mentioned this pull request Apr 28, 2022

kvserver: fix bug in baseQueue.AddAsync #80690

Merged

aayushshah15 mentioned this pull request May 1, 2022

server: proactively rebalance decommissioning nodes' replicas #80836

Closed

aayushshah15 closed this May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: introduce a decommission monitor task #80695

server: introduce a decommission monitor task #80695

aayushshah15 commented Apr 28, 2022 •

edited

Loading

cockroach-teamcity commented Apr 28, 2022

aayushshah15 commented May 4, 2022

server: introduce a decommission monitor task #80695

server: introduce a decommission monitor task #80695

Conversation

aayushshah15 commented Apr 28, 2022 • edited Loading

cockroach-teamcity commented Apr 28, 2022

aayushshah15 commented May 4, 2022

aayushshah15 commented Apr 28, 2022 •

edited

Loading