server: add api for decommission pre-flight checks #90222

AlexTalks · 2022-10-19T03:33:57Z

While we have an API for checking the status of an in-progress decommission, we did not previously have an API to execute sanity checks prior to requesting a node to move into the DECOMMISSIONING state. This adds an API to do just that, intended to be called by the CLI prior to issuing a subsequent Decommission RPC request.

Fixes #91568.

Release note: None

cockroach-teamcity · 2022-10-19T03:34:05Z

This change is

dhartunian

Why do you need a separate API endpoint instead of executing pre-flight checks as part of the Decom request?

Reviewable status: complete! 0 of 0 LGTMs obtained

AlexTalks

This could be done as part of the Decommission request, however I think having it as a separate request will make things a bit more flexible for a few reasons.

The Decommission request is idempotent and can be/is run repeatedly, even if the node is already decommissioning, and always takes a new target liveness membership (either DECOMMISSIONING or DECOMMISSIONED), whereas we want to run the pre-checks once, prior to being in the decommissioning state.
Making the DecommissionPreCheck a separate API call means that it will be simpler to be flexible on what we call based on the flags passed to the cockroach node decommission command. For example, if we run with a --skip-checks flag, we can only call the Decommission RPC. If we run with a --checks-only flag, we can call the DecommissionPreCheck RPC and return after. (By default, the plan is to call the DecommissionPreCheck RPC, and then the Decommission RPC if the first had no issue). Making the API separate means we won't have to pass in such flags/options via gRPC, particularly in the existing request.
The output that we want from the pre-checks, as opposed to the "decommissioning status" responses returned to the periodic repeating Decommission calls, is much different - we want allocator errors and traces for the replicas that we couldn't find upreplication targets for. If possible, we'd also like to summarize said errors across ranges, and provide potential remediation steps.

Some of these can definitely be worked around, but in conjunction my thought is it makes a bit more sense as a separate API - let me know if these don't make sense though at all as I'm happy to chat!

Reviewable status: complete! 0 of 0 LGTMs obtained

AlexTalks

@dhartunian let me know if you have any additional thoughts! Also tagged @andrewbaptist for KV review...

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)

andrewbaptist

This code itself looks good to me. However, I'm not sure about the value of merging the API before the implementation is done. It seems likely there could be some change in the protobuf that becomes apparent once you implement it. Since this is all new code, there is no "merge conflict" benefit of merging early, and potentially a compatibility consideration if you do.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks)

pkg/server/serverpb/admin.proto line 502 at r2 (raw file):

  message Replica {
    int32 replica_id = 2 [ (gogoproto.customname) = "ReplicaID",

Is there a reason you started at 2?

This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in cockroachdb#93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on cockroachdb#93758, cockroachdb#90222. Epic: CRDB-20924 Release note: None

dhartunian

@dhartunian let me know if you have any additional thoughts!

I'm good, thx for the detailed justification 👍
since the implementation commit is now also up and does not contain further changes.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @AlexTalks and @andrewbaptist)

This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in cockroachdb#93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on cockroachdb#93758, cockroachdb#90222. Epic: CRDB-20924 Release note: None

While we have an API for checking the status of an in-progress decommission, we did not previously have an API to execute sanity checks prior to requesting a node to move into the `DECOMMISSIONING` state. This adds an API to do just that, intended to be called by the CLI prior to issuing a subsequent `Decommission` RPC request. Fixes cockroachdb#91568. Release note: None

This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in cockroachdb#93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on cockroachdb#93758, cockroachdb#90222. Epic: CRDB-20924 Release note: None

AlexTalks · 2023-01-20T00:28:50Z

bors r+

craig · 2023-01-20T00:28:53Z

👎 Rejected by code reviews

knz

Reviewed 4 of 4 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @AlexTalks)

AlexTalks · 2023-01-20T19:20:04Z

bors r+

craig · 2023-01-20T20:25:44Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

AlexTalks · 2023-01-20T22:08:34Z

bors r+

craig · 2023-01-20T22:08:36Z

Already running a review

craig · 2023-01-20T23:53:27Z

Build succeeded:

Bazel Essential CI (Cockroach)

This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in cockroachdb#93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on cockroachdb#93758, cockroachdb#90222. Epic: CRDB-20924 Release note: None

93950: server: implement decommission pre-check api r=AlexTalks a=AlexTalks This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in #93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on #93758, #90222. Epic: [CRDB-20924](https://cockroachlabs.atlassian.net/browse/CRDB-20924) Release note: None Co-authored-by: Alex Sarkesian <[email protected]>

AlexTalks force-pushed the dpf_api branch from baedc89 to 13a3153 Compare November 8, 2022 04:19

AlexTalks marked this pull request as ready for review November 8, 2022 04:26

AlexTalks requested a review from a team November 8, 2022 04:26

AlexTalks requested a review from a team as a code owner November 8, 2022 04:26

dhartunian reviewed Nov 8, 2022

View reviewed changes

AlexTalks force-pushed the dpf_api branch from 13a3153 to 1c7e84b Compare November 9, 2022 04:40

AlexTalks commented Nov 9, 2022

View reviewed changes

AlexTalks force-pushed the dpf_api branch from 1c7e84b to 21172b2 Compare November 10, 2022 20:35

AlexTalks requested a review from a team as a code owner November 10, 2022 20:35

AlexTalks requested review from aayushshah15 and andrewbaptist and removed request for aayushshah15 November 10, 2022 20:35

AlexTalks commented Nov 17, 2022

View reviewed changes

andrewbaptist requested changes Nov 17, 2022

View reviewed changes

AlexTalks force-pushed the dpf_api branch 3 times, most recently from b3d0b14 to 7397660 Compare December 19, 2022 22:05

AlexTalks mentioned this pull request Dec 19, 2022

server: implement decommission pre-check api #93950

Merged

dhartunian approved these changes Jan 3, 2023

View reviewed changes

AlexTalks force-pushed the dpf_api branch from 7397660 to 0a14307 Compare January 7, 2023 09:31

AlexTalks force-pushed the dpf_api branch 3 times, most recently from 188b456 to 9e7a243 Compare January 19, 2023 22:09

AlexTalks force-pushed the dpf_api branch from 9e7a243 to f6cf9b3 Compare January 19, 2023 22:12

knz approved these changes Jan 20, 2023

View reviewed changes

andrewbaptist approved these changes Jan 20, 2023

View reviewed changes

craig bot merged commit 1b79102 into cockroachdb:master Jan 20, 2023

AlexTalks deleted the dpf_api branch January 20, 2023 23:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: add api for decommission pre-flight checks #90222

server: add api for decommission pre-flight checks #90222

AlexTalks commented Oct 19, 2022 •

edited

Loading

cockroach-teamcity commented Oct 19, 2022

dhartunian left a comment

AlexTalks left a comment

AlexTalks left a comment

andrewbaptist left a comment

dhartunian left a comment

AlexTalks commented Jan 20, 2023

craig bot commented Jan 20, 2023

knz left a comment

AlexTalks commented Jan 20, 2023

craig bot commented Jan 20, 2023

AlexTalks commented Jan 20, 2023

craig bot commented Jan 20, 2023

craig bot commented Jan 20, 2023

server: add api for decommission pre-flight checks #90222

server: add api for decommission pre-flight checks #90222

Conversation

AlexTalks commented Oct 19, 2022 • edited Loading

cockroach-teamcity commented Oct 19, 2022

dhartunian left a comment

Choose a reason for hiding this comment

AlexTalks left a comment

Choose a reason for hiding this comment

AlexTalks left a comment

Choose a reason for hiding this comment

andrewbaptist left a comment

Choose a reason for hiding this comment

dhartunian left a comment

Choose a reason for hiding this comment

AlexTalks commented Jan 20, 2023

craig bot commented Jan 20, 2023

knz left a comment

Choose a reason for hiding this comment

AlexTalks commented Jan 20, 2023

craig bot commented Jan 20, 2023

AlexTalks commented Jan 20, 2023

craig bot commented Jan 20, 2023

craig bot commented Jan 20, 2023

AlexTalks commented Oct 19, 2022 •

edited

Loading