release-22.1: server: react to decommissioning nodes by proactively enqueuing their replicas #82680

aayushshah15 · 2022-06-09T16:48:05Z

Backport 2/2 commits from #80993 and 1/1 commit from #82683.

/cc @cockroachdb/release

Note: This patch implements a subset of #80836

Previously, when a node was marked DECOMMISSIONING, other nodes in the
system would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their replicaScanner to gradually
run into the decommissioning node's ranges and rely on their
replicateQueue to then rebalance them.

This meant that even when decommissioning a mostly empty node, our worst
case lower bound for marking that node fully decommissioned was one
full scanner interval (which is 10 minutes by default).

This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be DECOMMISSIONING.
When it is run, the callback enqueues all the replicas on the local
stores that are on ranges that also have replicas on the decommissioning
node. Note that when nodes in the system restart, they'll re-invoke this callback
for any already DECOMMISSIONING node.

Resolves #79453

Release note (performance improvement): Decommissioning should now be
substantially faster, particularly for small to moderately loaded nodes.

Release justification: non-invasive performance improvement for node decommissioning

Release note: None

… replicas Note: This patch implements a subset of cockroachdb#80836 Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. When it is run, the callback enqueues all the replicas on the local stores that are on ranges that also have replicas on the decommissioning node. Release note (performance improvement): Decommissioning should now be substantially faster, particularly for small to moderately loaded nodes.

blathers-crl · 2022-06-09T16:48:07Z

cockroach-teamcity · 2022-06-09T16:48:15Z

This change is

aayushshah15 · 2022-06-09T17:01:32Z

Don't stamp yet, there's a bug in the original patch that I need to fix first.

This commit fixes a bug from cockroachdb#80993. Without this commit, nodes might re-run the callback to enqueue a decommissioning node's ranges into their replicate queues if they received a gossip update from that decommissioning node that was perceived to be newer. Re-running this callback on every newer gossip update from a decommissioning node will be too expensive for nodes with a lot of replicas. Release note: None

aayushshah15 · 2022-06-21T21:28:10Z

@kvoli and / or @AlexTalks: could I get a stamp on this? This patch has been baking on master for over a week and we haven't seen any fallout related to it.

kvoli

LGTM

This patch fixes a merge skew introduced by cockroachdb#82680 and 82800 Release note: None

aayushshah15 added 2 commits June 9, 2022 12:33

kvserver: add an async parameter to Store.ManuallyEnqueue()

a6e2f17

Release note: None

aayushshah15 requested review from AlexTalks and kvoli June 9, 2022 16:48

aayushshah15 marked this pull request as ready for review June 9, 2022 20:12

aayushshah15 requested review from a team as code owners June 9, 2022 20:12

aayushshah15 requested a review from a team June 9, 2022 20:12

aayushshah15 requested a review from a team as a code owner June 9, 2022 20:12

aayushshah15 requested review from shermanCRL and removed request for a team and shermanCRL June 9, 2022 20:12

aayushshah15 force-pushed the backport22.1-80993 branch from df69c6a to 66570f2 Compare June 10, 2022 21:52

aayushshah15 requested a review from a team as a code owner June 10, 2022 21:52

aayushshah15 requested review from miretskiy and removed request for a team and miretskiy June 10, 2022 21:52

kvoli approved these changes Jun 21, 2022

View reviewed changes

aayushshah15 merged commit 6882d94 into cockroachdb:release-22.1 Jul 5, 2022

aayushshah15 deleted the backport22.1-80993 branch July 5, 2022 18:00

nvanbenschoten mentioned this pull request Jul 6, 2022

release-22.1: kvserver: retry failures to rebalance decommissioning replicas #82800

Merged

aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Jul 6, 2022

kvserver: fix merge skew from previous patch

19cf9f6

This patch fixes a merge skew introduced by cockroachdb#82680 and 82800 Release note: None

aayushshah15 mentioned this pull request Jul 6, 2022

release-22.1: kvserver: fix merge skew from previous patch #83911

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-22.1: server: react to decommissioning nodes by proactively enqueuing their replicas #82680

release-22.1: server: react to decommissioning nodes by proactively enqueuing their replicas #82680

aayushshah15 commented Jun 9, 2022 •

edited

Loading

blathers-crl bot commented Jun 9, 2022 •

edited by aayushshah15

Loading

cockroach-teamcity commented Jun 9, 2022

aayushshah15 commented Jun 9, 2022

aayushshah15 commented Jun 21, 2022

kvoli left a comment

release-22.1: server: react to decommissioning nodes by proactively enqueuing their replicas #82680

release-22.1: server: react to decommissioning nodes by proactively enqueuing their replicas #82680

Conversation

aayushshah15 commented Jun 9, 2022 • edited Loading

blathers-crl bot commented Jun 9, 2022 • edited by aayushshah15 Loading

cockroach-teamcity commented Jun 9, 2022

aayushshah15 commented Jun 9, 2022

aayushshah15 commented Jun 21, 2022

kvoli left a comment

Choose a reason for hiding this comment

aayushshah15 commented Jun 9, 2022 •

edited

Loading

blathers-crl bot commented Jun 9, 2022 •

edited by aayushshah15

Loading