Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: retry failures to rebalance decommissioning replicas #81005

Conversation

aayushshah15
Copy link
Contributor

@aayushshah15 aayushshah15 commented May 4, 2022

Related to #80993
Relates to #79453

This commit makes it such that failures to rebalance replicas on
decommissioning nodes no longer move the replica out of the
replicateQueue as they previously used to. Instead, these failures now
put these replicas into the replicateQueue's purgatory, which will retry
these replicas every minute.

All this is intended to improve the speed of decommissioning towards
its tail end, since previously, failures to rebalance these replicas
meant that they were only retried after about 10 minutes.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aayushshah15 aayushshah15 force-pushed the 20220504_putDecommissioningReplicasInPurgatory branch 4 times, most recently from 62e916a to 7823bd2 Compare May 4, 2022 21:16
@aayushshah15 aayushshah15 marked this pull request as ready for review May 4, 2022 22:05
@aayushshah15 aayushshah15 requested a review from a team as a code owner May 4, 2022 22:05
@aayushshah15 aayushshah15 force-pushed the 20220504_putDecommissioningReplicasInPurgatory branch 3 times, most recently from cf9817c to 9e0a6ee Compare May 5, 2022 20:27
Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any existing tests for purgatory retry; or ensuring that on failure it is reprocessed at replicateQueuePurgatoryCheckInterval?

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @AlexTalks, and @nvanbenschoten)


pkg/kv/kvserver/replicate_queue.go line 395 at r1 (raw file):

	}

	// Register gossip and node liveness callbacks to signal that

I'm curious why not have both a callback on a gossip update and also a retry interval?

Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any existing tests purgatory; or ensuring that on failure it is reprocessed at replicateQueuePurgatoryCheckInterval?

There's TestBaseQueuePurgatory, which asserts on the behavior of the base queue purgatory. This patch isn't doing much on top of that, since we're just tying the purg channel to a golang ticker (this is also what the mergeQueue currently does).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @kvoli, and @nvanbenschoten)


pkg/kv/kvserver/replicate_queue.go line 395 at r1 (raw file):

Previously, kvoli (Austen) wrote…

I'm curious why not have both a callback on a gossip update and also a retry interval?

That callback was never being used because none of the replicate queue errors were being marked as purgatory errors. Those callbacks would've only been useful if we were appropriately marking some "interesting" allocator errors as purg-errors.

However, determining "interesting" here with some accuracy is kind of hard without a refactor of some of the allocator's rebalancing logic. We'd need to restructure that code in a way that makes it more feasible to disambiguate between when the allocator can't find rebalance opportunities due to zone configs and when it can't find opportunities due to other reasons (like when the system is already balanced).

At the moment, it seems simpler to switch this to a more general "retry replicas in the purgatory every minute".

Copy link
Contributor

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, but would it be worth adding a test for the case where we (now) get one of the decommissionPurgatoryErrors on replacing/removing a decommissioning replica? (Happy to take over and add if needed of course, which could be a useful exercise).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @nvanbenschoten)

@aayushshah15 aayushshah15 force-pushed the 20220504_putDecommissioningReplicasInPurgatory branch 3 times, most recently from 64a1c62 to 9f8af91 Compare May 17, 2022 17:30
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but would it be worth adding a test for the case where we (now) get one of the decommissionPurgatoryErrors on replacing/removing a decommissioning replica

Done, PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @nvanbenschoten)

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

Copy link
Contributor

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @nvanbenschoten)

@aayushshah15 aayushshah15 force-pushed the 20220504_putDecommissioningReplicasInPurgatory branch 2 times, most recently from 5f268b7 to 02d85af Compare June 7, 2022 18:17
@aayushshah15
Copy link
Contributor Author

TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 7, 2022

This PR was included in a batch that was canceled, it will be automatically retried

@aayushshah15
Copy link
Contributor Author

bors r-

@craig
Copy link
Contributor

craig bot commented Jun 7, 2022

Canceled.

@aayushshah15 aayushshah15 force-pushed the 20220504_putDecommissioningReplicasInPurgatory branch 6 times, most recently from eeb16ee to db0f0e8 Compare June 11, 2022 19:19
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That callback was never being used because none of the replicate queue errors were being marked as purgatory errors.

Turns out this was a lie. There were indeed a couple of cases where the replicateQueue's errors were being marked as purgatory errors.

See the first commit in this patch for how I've addressed the issue. cc @kvoli

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @kvoli and @nvanbenschoten)

@aayushshah15 aayushshah15 requested a review from kvoli June 11, 2022 19:26
Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 14 files at r4, 6 of 6 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @nvanbenschoten)

Previously, replicas in the replicateQueue purgatory were only
reprocessed after changes to the cluster's topology. This included
things like changes to any node's liveness or changes to individual
store descriptors.

This commit makes it such that replicas in the replicateQueue purgatory
are also (attempted to be) reprocessed every minute. This extension
allows for, for instance, replicateQueue purgatory errors that were
caused as a result of the cluster having hardware related issues (e.g.
rebalances failing due to snapshots temporarily timing out).

Release note: None
This commit makes it such that failures to rebalance replicas on
decommissioning nodes no longer move the replica out of the
replicateQueue as they previously used to. Instead, these failures now
put these replicas into the replicateQueue's purgatory, which will retry
these replicas every minute.

All this is intended to improve the speed of decommissioning towards
its tail end, since previously, failures to rebalance these replicas
meant that they were only retried after about 10 minutes.

Release note: None
@aayushshah15 aayushshah15 force-pushed the 20220504_putDecommissioningReplicasInPurgatory branch from db0f0e8 to 6f5122b Compare June 12, 2022 05:26
@aayushshah15
Copy link
Contributor Author

TFTRs

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 12, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants