-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: retry failures to rebalance decommissioning replicas #81005
kvserver: retry failures to rebalance decommissioning replicas #81005
Conversation
62e916a
to
7823bd2
Compare
cf9817c
to
9e0a6ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any existing tests for purgatory retry; or ensuring that on failure it is reprocessed at replicateQueuePurgatoryCheckInterval?
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @AlexTalks, and @nvanbenschoten)
pkg/kv/kvserver/replicate_queue.go
line 395 at r1 (raw file):
} // Register gossip and node liveness callbacks to signal that
I'm curious why not have both a callback on a gossip update and also a retry interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any existing tests purgatory; or ensuring that on failure it is reprocessed at replicateQueuePurgatoryCheckInterval?
There's TestBaseQueuePurgatory
, which asserts on the behavior of the base queue purgatory. This patch isn't doing much on top of that, since we're just tying the purg channel to a golang ticker (this is also what the mergeQueue currently does).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @kvoli, and @nvanbenschoten)
pkg/kv/kvserver/replicate_queue.go
line 395 at r1 (raw file):
Previously, kvoli (Austen) wrote…
I'm curious why not have both a callback on a gossip update and also a retry interval?
That callback was never being used because none of the replicate queue errors were being marked as purgatory errors. Those callbacks would've only been useful if we were appropriately marking some "interesting" allocator errors as purg-errors.
However, determining "interesting" here with some accuracy is kind of hard without a refactor of some of the allocator's rebalancing logic. We'd need to restructure that code in a way that makes it more feasible to disambiguate between when the allocator can't find rebalance opportunities due to zone configs and when it can't find opportunities due to other reasons (like when the system is already balanced).
At the moment, it seems simpler to switch this to a more general "retry replicas in the purgatory every minute".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, but would it be worth adding a test for the case where we (now) get one of the decommissionPurgatoryError
s on replacing/removing a decommissioning replica? (Happy to take over and add if needed of course, which could be a useful exercise).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @nvanbenschoten)
64a1c62
to
9f8af91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but would it be worth adding a test for the case where we (now) get one of the decommissionPurgatoryErrors on replacing/removing a decommissioning replica
Done, PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @nvanbenschoten)
5f268b7
to
02d85af
Compare
TFTRs! bors r+ |
This PR was included in a batch that was canceled, it will be automatically retried |
bors r- |
Canceled. |
eeb16ee
to
db0f0e8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That callback was never being used because none of the replicate queue errors were being marked as purgatory errors.
Turns out this was a lie. There were indeed a couple of cases where the replicateQueue's errors were being marked as purgatory errors.
See the first commit in this patch for how I've addressed the issue. cc @kvoli
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @kvoli and @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 14 files at r4, 6 of 6 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @nvanbenschoten)
Previously, replicas in the replicateQueue purgatory were only reprocessed after changes to the cluster's topology. This included things like changes to any node's liveness or changes to individual store descriptors. This commit makes it such that replicas in the replicateQueue purgatory are also (attempted to be) reprocessed every minute. This extension allows for, for instance, replicateQueue purgatory errors that were caused as a result of the cluster having hardware related issues (e.g. rebalances failing due to snapshots temporarily timing out). Release note: None
This commit makes it such that failures to rebalance replicas on decommissioning nodes no longer move the replica out of the replicateQueue as they previously used to. Instead, these failures now put these replicas into the replicateQueue's purgatory, which will retry these replicas every minute. All this is intended to improve the speed of decommissioning towards its tail end, since previously, failures to rebalance these replicas meant that they were only retried after about 10 minutes. Release note: None
db0f0e8
to
6f5122b
Compare
TFTRs bors r+ |
Build succeeded: |
Related to #80993
Relates to #79453
This commit makes it such that failures to rebalance replicas on
decommissioning nodes no longer move the replica out of the
replicateQueue as they previously used to. Instead, these failures now
put these replicas into the replicateQueue's purgatory, which will retry
these replicas every minute.
All this is intended to improve the speed of decommissioning towards
its tail end, since previously, failures to rebalance these replicas
meant that they were only retried after about 10 minutes.
Release note: None