Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: introduce a decommission monitor task #80695

Conversation

aayushshah15
Copy link
Contributor

@aayushshah15 aayushshah15 commented Apr 28, 2022

NOTE: This is a WIP. Don't review / look yet.

This commit introduces a decommissionMonitor that is responsible for a few
key things:

  • When a node begins decommissioning, its decommissionMonitor is spun up,
    which proactively tells other nodes in the system to enqueue its ranges into
    the replicateQueue. This means that a decommission process no longer has the
    worst case lower bound of 10 mins (i.e. the default replica scanner interval).

  • In a future patch, this decommissionMonitor will selectively nudge some of
    its straggling replicas' ranges to be re-enqueued into their leaseholder
    store's replicateQueues. This will be done with the intention of collecting and
    persisting (or dumping, initially) traces from these enqueue operations. This
    should help reduce our time to RCA a slow decommission process.

Comparing the time taken to decommission a node out of a 6 node cluster
containing a 200-warehouse TPC-C dataset:

With this patch:

time rp ssh aayushs-test:1 -- "./cockroach node decommission --insecure --self";
...
roachprod ssh aayushs-test:1 --   0.07s user 0.07s system 0% cpu 59.125 total

Without this patch:

time rp ssh aayushs-test-wo:1 -- "./cockroach node decommission --insecure --self";
...
roachprod ssh aayushs-test-wo:1 --   0.17s user 0.15s system 0% cpu 3:47.34 total

Resolves #79453

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aayushshah15 aayushshah15 force-pushed the 20220427_proactiveDecommission branch from b77e9ea to 5d2d33b Compare April 28, 2022 08:38
This commit introduces a `decommissionMonitor` that is responsible for a few
key things:

- When a node begins decommissioning, its `decommissionMonitor` is spun up,
which proactively tells other nodes in the system to enqueue its ranges into
the `replicateQueue`. This means that a decommission process no longer has the
worst case lower bound of 10 mins (i.e. the default replica scanner interval).

- In a future patch, this `decommissionMonitor` will selectively nudge some of
its straggling replicas' ranges to be re-enqueued into their leaseholder
store's replicateQueues. This will be done with the intention of collecting and
persisting (or dumping, initially) traces from these enqueue operations. This
should help reduce our time to RCA a slow decommission process.

Comparing the time taken to decommission a node out of a 6 node cluster
containing a 200-warehouse TPC-C dataset:

With this patch:
```
time rp ssh aayushs-test:1 -- "./cockroach node decommission --insecure --self";
...
roachprod ssh aayushs-test:1 --   0.07s user 0.07s system 0% cpu 59.125 total
```

Without this patch:
```
time rp ssh aayushs-test-wo:1 -- "./cockroach node decommission --insecure --self";
...
roachprod ssh aayushs-test-wo:1 --   0.17s user 0.15s system 0% cpu 3:47.34 total
```

Release note: None
@aayushshah15 aayushshah15 force-pushed the 20220427_proactiveDecommission branch from 5d2d33b to 5a7969f Compare April 28, 2022 16:35
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request May 2, 2022
Note: This PR is an alternative to, but subsumes,
cockroachdb#80695.

Previously, when a node was marked `DECOMMISSIONING`, other nodes in the
system would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their `replicaScanner` to gradually
run into the decommissioning node's ranges and rely on their
`replicateQueue` to then rebalance them.

This had a few issues:
1. It meant that even when decommissioning a mostly empty node, our
   worst case lower bound for marking that node fully decommissioned was
   _one full scanner interval_ (which is 10 minutes by default).
2. If the replicateQueue ran into an error while rebalancing a
   decommissioning replica (see cockroachdb#79266 for instance), it would only
   retry that replica after either one full scanner interval or after
   the purgatory interval.

This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be `DECOMMISSIONING`.
This callback spins up an async task that will first proactively enqueue
all of the decommissioning nodes ranges (that have a replica on the
local node) into the local node's replicateQueues. Then, this task will
periodically nudge the decommissioning node's straggling replicas in
order to requeue them (to alleviate (2) from above).

All this is managed by a lightweight `decommissionMonitor`, which is
responsible for managing the lifecycle of these async tasks.

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request May 3, 2022
Note: This PR is an alternative to, but subsumes,
cockroachdb#80695.

Previously, when a node was marked `DECOMMISSIONING`, other nodes in the
system would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their `replicaScanner` to gradually
run into the decommissioning node's ranges and rely on their
`replicateQueue` to then rebalance them.

This had a few issues:
1. It meant that even when decommissioning a mostly empty node, our
   worst case lower bound for marking that node fully decommissioned was
   _one full scanner interval_ (which is 10 minutes by default).
2. If the replicateQueue ran into an error while rebalancing a
   decommissioning replica (see cockroachdb#79266 for instance), it would only
   retry that replica after either one full scanner interval or after
   the purgatory interval.

This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be `DECOMMISSIONING`.
This callback spins up an async task that will first proactively enqueue
all of the decommissioning nodes ranges (that have a replica on the
local node) into the local node's replicateQueues. Then, this task will
periodically nudge the decommissioning node's straggling replicas in
order to requeue them (to alleviate (2) from above).

All this is managed by a lightweight `decommissionMonitor`, which is
responsible for managing the lifecycle of these async tasks.

Release note: None
@aayushshah15
Copy link
Contributor Author

Closing in favor of #80993

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: proactively enqueue replicas for a decommissioning node
2 participants