-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: add api for decommission pre-flight checks #90222
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need a separate API endpoint instead of executing pre-flight checks as part of the Decom request?
Reviewable status: complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be done as part of the Decommission
request, however I think having it as a separate request will make things a bit more flexible for a few reasons.
- The
Decommission
request is idempotent and can be/is run repeatedly, even if the node is already decommissioning, and always takes a new target liveness membership (eitherDECOMMISSIONING
orDECOMMISSIONED
), whereas we want to run the pre-checks once, prior to being in the decommissioning state. - Making the
DecommissionPreCheck
a separate API call means that it will be simpler to be flexible on what we call based on the flags passed to thecockroach node decommission
command. For example, if we run with a--skip-checks
flag, we can only call theDecommission
RPC. If we run with a--checks-only
flag, we can call theDecommissionPreCheck
RPC and return after. (By default, the plan is to call theDecommissionPreCheck
RPC, and then theDecommission
RPC if the first had no issue). Making the API separate means we won't have to pass in such flags/options via gRPC, particularly in the existing request. - The output that we want from the pre-checks, as opposed to the "decommissioning status" responses returned to the periodic repeating
Decommission
calls, is much different - we want allocator errors and traces for the replicas that we couldn't find upreplication targets for. If possible, we'd also like to summarize said errors across ranges, and provide potential remediation steps.
Some of these can definitely be worked around, but in conjunction my thought is it makes a bit more sense as a separate API - let me know if these don't make sense though at all as I'm happy to chat!
Reviewable status: complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhartunian let me know if you have any additional thoughts! Also tagged @andrewbaptist for KV review...
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code itself looks good to me. However, I'm not sure about the value of merging the API before the implementation is done. It seems likely there could be some change in the protobuf that becomes apparent once you implement it. Since this is all new code, there is no "merge conflict" benefit of merging early, and potentially a compatibility consideration if you do.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks)
pkg/server/serverpb/admin.proto
line 502 at r2 (raw file):
message Replica { int32 replica_id = 2 [ (gogoproto.customname) = "ReplicaID",
Is there a reason you started at 2?
b3d0b14
to
7397660
Compare
This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in cockroachdb#93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on cockroachdb#93758, cockroachdb#90222. Epic: CRDB-20924 Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhartunian let me know if you have any additional thoughts!
I'm good, thx for the detailed justification 👍
since the implementation commit is now also up and does not contain further changes.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @AlexTalks and @andrewbaptist)
This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in cockroachdb#93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on cockroachdb#93758, cockroachdb#90222. Epic: CRDB-20924 Release note: None
188b456
to
9e7a243
Compare
While we have an API for checking the status of an in-progress decommission, we did not previously have an API to execute sanity checks prior to requesting a node to move into the `DECOMMISSIONING` state. This adds an API to do just that, intended to be called by the CLI prior to issuing a subsequent `Decommission` RPC request. Fixes cockroachdb#91568. Release note: None
This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in cockroachdb#93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on cockroachdb#93758, cockroachdb#90222. Epic: CRDB-20924 Release note: None
bors r+ |
👎 Rejected by code reviews |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @AlexTalks)
bors r+ |
Build failed (retrying...): |
bors r+ |
Already running a review |
Build succeeded: |
This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in cockroachdb#93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on cockroachdb#93758, cockroachdb#90222. Epic: CRDB-20924 Release note: None
This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in cockroachdb#93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on cockroachdb#93758, cockroachdb#90222. Epic: CRDB-20924 Release note: None
93950: server: implement decommission pre-check api r=AlexTalks a=AlexTalks This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in #93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on #93758, #90222. Epic: [CRDB-20924](https://cockroachlabs.atlassian.net/browse/CRDB-20924) Release note: None Co-authored-by: Alex Sarkesian <[email protected]>
While we have an API for checking the status of an in-progress decommission, we did not previously have an API to execute sanity checks prior to requesting a node to move into the
DECOMMISSIONING
state. This adds an API to do just that, intended to be called by the CLI prior to issuing a subsequentDecommission
RPC request.Fixes #91568.
Release note: None