Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: add api for decommission pre-flight checks #90222

Merged
merged 1 commit into from
Jan 20, 2023

Conversation

AlexTalks
Copy link
Contributor

@AlexTalks AlexTalks commented Oct 19, 2022

While we have an API for checking the status of an in-progress decommission, we did not previously have an API to execute sanity checks prior to requesting a node to move into the DECOMMISSIONING state. This adds an API to do just that, intended to be called by the CLI prior to issuing a subsequent Decommission RPC request.

Fixes #91568.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@AlexTalks AlexTalks marked this pull request as ready for review November 8, 2022 04:26
@AlexTalks AlexTalks requested a review from a team November 8, 2022 04:26
@AlexTalks AlexTalks requested a review from a team as a code owner November 8, 2022 04:26
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need a separate API endpoint instead of executing pre-flight checks as part of the Decom request?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be done as part of the Decommission request, however I think having it as a separate request will make things a bit more flexible for a few reasons.

  1. The Decommission request is idempotent and can be/is run repeatedly, even if the node is already decommissioning, and always takes a new target liveness membership (either DECOMMISSIONING or DECOMMISSIONED), whereas we want to run the pre-checks once, prior to being in the decommissioning state.
  2. Making the DecommissionPreCheck a separate API call means that it will be simpler to be flexible on what we call based on the flags passed to the cockroach node decommission command. For example, if we run with a --skip-checks flag, we can only call the Decommission RPC. If we run with a --checks-only flag, we can call the DecommissionPreCheck RPC and return after. (By default, the plan is to call the DecommissionPreCheck RPC, and then the Decommission RPC if the first had no issue). Making the API separate means we won't have to pass in such flags/options via gRPC, particularly in the existing request.
  3. The output that we want from the pre-checks, as opposed to the "decommissioning status" responses returned to the periodic repeating Decommission calls, is much different - we want allocator errors and traces for the replicas that we couldn't find upreplication targets for. If possible, we'd also like to summarize said errors across ranges, and provide potential remediation steps.

Some of these can definitely be worked around, but in conjunction my thought is it makes a bit more sense as a separate API - let me know if these don't make sense though at all as I'm happy to chat!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@AlexTalks AlexTalks requested a review from a team as a code owner November 10, 2022 20:35
@AlexTalks AlexTalks requested review from aayushshah15 and andrewbaptist and removed request for aayushshah15 November 10, 2022 20:35
Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhartunian let me know if you have any additional thoughts! Also tagged @andrewbaptist for KV review...

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)

Copy link
Collaborator

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code itself looks good to me. However, I'm not sure about the value of merging the API before the implementation is done. It seems likely there could be some change in the protobuf that becomes apparent once you implement it. Since this is all new code, there is no "merge conflict" benefit of merging early, and potentially a compatibility consideration if you do.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks)


pkg/server/serverpb/admin.proto line 502 at r2 (raw file):

  message Replica {
    int32 replica_id = 2 [ (gogoproto.customname) = "ReplicaID",

Is there a reason you started at 2?

@AlexTalks AlexTalks force-pushed the dpf_api branch 3 times, most recently from b3d0b14 to 7397660 Compare December 19, 2022 22:05
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Dec 19, 2022
This change implements the `DecommissionPreCheck` RPC on the `Admin`
service, using the support for evaluating node decommission readiness by
checking each range introduced in cockroachdb#93758. In checking node decommission
readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness
status are checked, and ranges with replicas on the checked nodes that
encounter errors in attempting to allocate replacement replicas are
reported in the response. Ranges that have replicas on multiple checked
nodes have their errors reported for each nodeID in the request list.

Depends on cockroachdb#93758, cockroachdb#90222.

Epic: CRDB-20924

Release note: None
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhartunian let me know if you have any additional thoughts!

I'm good, thx for the detailed justification 👍
:lgtm: since the implementation commit is now also up and does not contain further changes.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @AlexTalks and @andrewbaptist)

AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Jan 7, 2023
This change implements the `DecommissionPreCheck` RPC on the `Admin`
service, using the support for evaluating node decommission readiness by
checking each range introduced in cockroachdb#93758. In checking node decommission
readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness
status are checked, and ranges with replicas on the checked nodes that
encounter errors in attempting to allocate replacement replicas are
reported in the response. Ranges that have replicas on multiple checked
nodes have their errors reported for each nodeID in the request list.

Depends on cockroachdb#93758, cockroachdb#90222.

Epic: CRDB-20924

Release note: None
@AlexTalks AlexTalks force-pushed the dpf_api branch 3 times, most recently from 188b456 to 9e7a243 Compare January 19, 2023 22:09
While we have an API for checking the status of an in-progress
decommission, we did not previously have an API to execute sanity checks
prior to requesting a node to move into the `DECOMMISSIONING` state.
This adds an API to do just that, intended to be called by the CLI prior
to issuing a subsequent `Decommission` RPC request.

Fixes cockroachdb#91568.

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Jan 19, 2023
This change implements the `DecommissionPreCheck` RPC on the `Admin`
service, using the support for evaluating node decommission readiness by
checking each range introduced in cockroachdb#93758. In checking node decommission
readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness
status are checked, and ranges with replicas on the checked nodes that
encounter errors in attempting to allocate replacement replicas are
reported in the response. Ranges that have replicas on multiple checked
nodes have their errors reported for each nodeID in the request list.

Depends on cockroachdb#93758, cockroachdb#90222.

Epic: CRDB-20924

Release note: None
@AlexTalks
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 20, 2023

👎 Rejected by code reviews

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @AlexTalks)

@AlexTalks
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 20, 2023

Build failed (retrying...):

@AlexTalks
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 20, 2023

Already running a review

@craig craig bot merged commit 1b79102 into cockroachdb:master Jan 20, 2023
@craig
Copy link
Contributor

craig bot commented Jan 20, 2023

Build succeeded:

@AlexTalks AlexTalks deleted the dpf_api branch January 20, 2023 23:58
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Jan 27, 2023
This change implements the `DecommissionPreCheck` RPC on the `Admin`
service, using the support for evaluating node decommission readiness by
checking each range introduced in cockroachdb#93758. In checking node decommission
readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness
status are checked, and ranges with replicas on the checked nodes that
encounter errors in attempting to allocate replacement replicas are
reported in the response. Ranges that have replicas on multiple checked
nodes have their errors reported for each nodeID in the request list.

Depends on cockroachdb#93758, cockroachdb#90222.

Epic: CRDB-20924

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Jan 28, 2023
This change implements the `DecommissionPreCheck` RPC on the `Admin`
service, using the support for evaluating node decommission readiness by
checking each range introduced in cockroachdb#93758. In checking node decommission
readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness
status are checked, and ranges with replicas on the checked nodes that
encounter errors in attempting to allocate replacement replicas are
reported in the response. Ranges that have replicas on multiple checked
nodes have their errors reported for each nodeID in the request list.

Depends on cockroachdb#93758, cockroachdb#90222.

Epic: CRDB-20924

Release note: None
craig bot pushed a commit that referenced this pull request Jan 28, 2023
93950: server: implement decommission pre-check api r=AlexTalks a=AlexTalks

This change implements the `DecommissionPreCheck` RPC on the `Admin`
service, using the support for evaluating node decommission readiness by
checking each range introduced in #93758. In checking node decommission
readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness
status are checked, and ranges with replicas on the checked nodes that
encounter errors in attempting to allocate replacement replicas are
reported in the response. Ranges that have replicas on multiple checked
nodes have their errors reported for each nodeID in the request list.

Depends on #93758, #90222.

Epic: [CRDB-20924](https://cockroachlabs.atlassian.net/browse/CRDB-20924)

Release note: None

Co-authored-by: Alex Sarkesian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: implement decommission pre-check api
5 participants