kvserver: retry failures to rebalance decommissioning replicas #81005

aayushshah15 · 2022-05-04T20:43:55Z

Related to #80993
Relates to #79453

This commit makes it such that failures to rebalance replicas on
decommissioning nodes no longer move the replica out of the
replicateQueue as they previously used to. Instead, these failures now
put these replicas into the replicateQueue's purgatory, which will retry
these replicas every minute.

All this is intended to improve the speed of decommissioning towards
its tail end, since previously, failures to rebalance these replicas
meant that they were only retried after about 10 minutes.

Release note: None

cockroach-teamcity · 2022-05-04T20:44:03Z

This change is

kvoli

Are there any existing tests for purgatory retry; or ensuring that on failure it is reprocessed at replicateQueuePurgatoryCheckInterval?

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @AlexTalks, and @nvanbenschoten)

pkg/kv/kvserver/replicate_queue.go line 395 at r1 (raw file):

	}

	// Register gossip and node liveness callbacks to signal that

I'm curious why not have both a callback on a gossip update and also a retry interval?

aayushshah15

Are there any existing tests purgatory; or ensuring that on failure it is reprocessed at replicateQueuePurgatoryCheckInterval?

There's TestBaseQueuePurgatory, which asserts on the behavior of the base queue purgatory. This patch isn't doing much on top of that, since we're just tying the purg channel to a golang ticker (this is also what the mergeQueue currently does).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @kvoli, and @nvanbenschoten)

pkg/kv/kvserver/replicate_queue.go line 395 at r1 (raw file):

Previously, kvoli (Austen) wrote…

I'm curious why not have both a callback on a gossip update and also a retry interval?

That callback was never being used because none of the replicate queue errors were being marked as purgatory errors. Those callbacks would've only been useful if we were appropriately marking some "interesting" allocator errors as purg-errors.

However, determining "interesting" here with some accuracy is kind of hard without a refactor of some of the allocator's rebalancing logic. We'd need to restructure that code in a way that makes it more feasible to disambiguate between when the allocator can't find rebalance opportunities due to zone configs and when it can't find opportunities due to other reasons (like when the system is already balanced).

At the moment, it seems simpler to switch this to a more general "retry replicas in the purgatory every minute".

AlexTalks

Overall looks good, but would it be worth adding a test for the case where we (now) get one of the decommissionPurgatoryErrors on replacing/removing a decommissioning replica? (Happy to take over and add if needed of course, which could be a useful exercise).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @nvanbenschoten)

aayushshah15

but would it be worth adding a test for the case where we (now) get one of the decommissionPurgatoryErrors on replacing/removing a decommissioning replica

Done, PTAL.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @nvanbenschoten)

kvoli

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

AlexTalks

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @nvanbenschoten)

aayushshah15 · 2022-06-07T18:18:25Z

TFTRs!

bors r+

craig · 2022-06-07T18:33:13Z

This PR was included in a batch that was canceled, it will be automatically retried

aayushshah15 · 2022-06-07T18:33:36Z

bors r-

craig · 2022-06-07T18:33:38Z

Canceled.

aayushshah15

That callback was never being used because none of the replicate queue errors were being marked as purgatory errors.

Turns out this was a lie. There were indeed a couple of cases where the replicateQueue's errors were being marked as purgatory errors.

See the first commit in this patch for how I've addressed the issue. cc @kvoli

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @kvoli and @nvanbenschoten)

kvoli

Reviewed 8 of 14 files at r4, 6 of 6 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @nvanbenschoten)

Previously, replicas in the replicateQueue purgatory were only reprocessed after changes to the cluster's topology. This included things like changes to any node's liveness or changes to individual store descriptors. This commit makes it such that replicas in the replicateQueue purgatory are also (attempted to be) reprocessed every minute. This extension allows for, for instance, replicateQueue purgatory errors that were caused as a result of the cluster having hardware related issues (e.g. rebalances failing due to snapshots temporarily timing out). Release note: None

This commit makes it such that failures to rebalance replicas on decommissioning nodes no longer move the replica out of the replicateQueue as they previously used to. Instead, these failures now put these replicas into the replicateQueue's purgatory, which will retry these replicas every minute. All this is intended to improve the speed of decommissioning towards its tail end, since previously, failures to rebalance these replicas meant that they were only retried after about 10 minutes. Release note: None

aayushshah15 · 2022-06-12T19:46:48Z

TFTRs

bors r+

craig · 2022-06-12T20:43:59Z

Build succeeded:

GitHub CI (Cockroach)

aayushshah15 force-pushed the 20220504_putDecommissioningReplicasInPurgatory branch 4 times, most recently from 62e916a to 7823bd2 Compare May 4, 2022 21:16

aayushshah15 marked this pull request as ready for review May 4, 2022 22:05

aayushshah15 requested a review from a team as a code owner May 4, 2022 22:05

aayushshah15 requested review from nvanbenschoten, AlexTalks and kvoli May 4, 2022 22:05

aayushshah15 force-pushed the 20220504_putDecommissioningReplicasInPurgatory branch 3 times, most recently from cf9817c to 9e0a6ee Compare May 5, 2022 20:27

kvoli reviewed May 8, 2022

View reviewed changes

kvoli mentioned this pull request May 13, 2022

kvserver: replicate queue should more aggressively upreplicate #79318

Open

aayushshah15 commented May 16, 2022

View reviewed changes

AlexTalks reviewed May 17, 2022

View reviewed changes

aayushshah15 force-pushed the 20220504_putDecommissioningReplicasInPurgatory branch 3 times, most recently from 64a1c62 to 9f8af91 Compare May 17, 2022 17:30

aayushshah15 commented May 17, 2022

View reviewed changes

kvoli approved these changes May 17, 2022

View reviewed changes

AlexTalks approved these changes May 17, 2022

View reviewed changes

irfansharif mentioned this pull request Jun 6, 2022

kv: model decommissioning/upreplication as a global queue of work #82475

Open

aayushshah15 force-pushed the 20220504_putDecommissioningReplicasInPurgatory branch 2 times, most recently from 5f268b7 to 02d85af Compare June 7, 2022 18:17

aayushshah15 force-pushed the 20220504_putDecommissioningReplicasInPurgatory branch 6 times, most recently from eeb16ee to db0f0e8 Compare June 11, 2022 19:19

aayushshah15 commented Jun 11, 2022

View reviewed changes

aayushshah15 requested a review from kvoli June 11, 2022 19:26

kvoli approved these changes Jun 11, 2022

View reviewed changes

aayushshah15 added 2 commits June 12, 2022 01:22

aayushshah15 force-pushed the 20220504_putDecommissioningReplicasInPurgatory branch from db0f0e8 to 6f5122b Compare June 12, 2022 05:26

craig bot merged commit 87da966 into cockroachdb:master Jun 12, 2022

aayushshah15 mentioned this pull request Jun 12, 2022

release-22.1: kvserver: retry failures to rebalance decommissioning replicas #82800

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: retry failures to rebalance decommissioning replicas #81005

kvserver: retry failures to rebalance decommissioning replicas #81005

aayushshah15 commented May 4, 2022 •

edited

Loading

cockroach-teamcity commented May 4, 2022

kvoli left a comment •

edited

Loading

aayushshah15 left a comment

AlexTalks left a comment

aayushshah15 left a comment

kvoli left a comment

AlexTalks left a comment

aayushshah15 commented Jun 7, 2022

craig bot commented Jun 7, 2022

aayushshah15 commented Jun 7, 2022

craig bot commented Jun 7, 2022

aayushshah15 left a comment

kvoli left a comment

aayushshah15 commented Jun 12, 2022

craig bot commented Jun 12, 2022

kvserver: retry failures to rebalance decommissioning replicas #81005

kvserver: retry failures to rebalance decommissioning replicas #81005

Conversation

aayushshah15 commented May 4, 2022 • edited Loading

cockroach-teamcity commented May 4, 2022

kvoli left a comment • edited Loading

Choose a reason for hiding this comment

aayushshah15 left a comment

Choose a reason for hiding this comment

AlexTalks left a comment

Choose a reason for hiding this comment

aayushshah15 left a comment

Choose a reason for hiding this comment

kvoli left a comment

Choose a reason for hiding this comment

AlexTalks left a comment

Choose a reason for hiding this comment

aayushshah15 commented Jun 7, 2022

craig bot commented Jun 7, 2022

aayushshah15 commented Jun 7, 2022

craig bot commented Jun 7, 2022

aayushshah15 left a comment

Choose a reason for hiding this comment

kvoli left a comment

Choose a reason for hiding this comment

aayushshah15 commented Jun 12, 2022

craig bot commented Jun 12, 2022

aayushshah15 commented May 4, 2022 •

edited

Loading

kvoli left a comment •

edited

Loading