Skip to content

Commit

Permalink
server: add api for decommission pre-flight checks
Browse files Browse the repository at this point in the history
While we have an API for checking the status of an in-progress
decommission, we did not previously have an API to execute sanity checks
prior to requesting a node to move into the `DECOMMISSIONING` state.
This adds an API to do just that, intended to be called by the CLI prior
to issuing a subsequent `Decommission` RPC request.

Fixes cockroachdb#91568.

Release note: None
  • Loading branch information
AlexTalks committed Dec 19, 2022
1 parent 8c4d3bb commit 169073d
Show file tree
Hide file tree
Showing 4 changed files with 196 additions and 0 deletions.
97 changes: 97 additions & 0 deletions docs/generated/http/full.md
Original file line number Diff line number Diff line change
Expand Up @@ -6276,6 +6276,103 @@ DrainResponse is the response to a successful DrainRequest.



## DecommissionPreCheck



DecommissionPreCheck requests that the server execute preliminary checks
to evaluate the possibility of successfully decommissioning a given node.

Support status: [reserved](#support-status)

#### Request Parameters




DecommissionPreCheckRequest requests that preliminary checks be run to
ensure that the specified node(s) can be decommissioned successfully.


| Field | Type | Label | Description | Support status |
| ----- | ---- | ----- | ----------- | -------------- |
| node_ids | [int32](#cockroach.server.serverpb.DecommissionPreCheckRequest-int32) | repeated | | [reserved](#support-status) |
| num_replica_report | [int32](#cockroach.server.serverpb.DecommissionPreCheckRequest-int32) | | The maximum number of ranges for which to report errors. | [reserved](#support-status) |
| strict_readiness | [bool](#cockroach.server.serverpb.DecommissionPreCheckRequest-bool) | | If true, all ranges on the checked nodes must only need replacement or removal for decommissioning. | [reserved](#support-status) |







#### Response Parameters




DecommissionPreCheckResponse returns the number of replicas that encountered
errors when running preliminary decommissioning checks, as well as the
associated error messages and traces, for each node.


| Field | Type | Label | Description | Support status |
| ----- | ---- | ----- | ----------- | -------------- |
| checked_nodes | [DecommissionPreCheckResponse.NodeCheckResult](#cockroach.server.serverpb.DecommissionPreCheckResponse-cockroach.server.serverpb.DecommissionPreCheckResponse.NodeCheckResult) | repeated | Status of the preliminary decommission checks across nodes. | [reserved](#support-status) |






<a name="cockroach.server.serverpb.DecommissionPreCheckResponse-cockroach.server.serverpb.DecommissionPreCheckResponse.NodeCheckResult"></a>
#### DecommissionPreCheckResponse.NodeCheckResult



| Field | Type | Label | Description | Support status |
| ----- | ---- | ----- | ----------- | -------------- |
| node_id | [int32](#cockroach.server.serverpb.DecommissionPreCheckResponse-int32) | | | [reserved](#support-status) |
| decommission_readiness | [DecommissionPreCheckResponse.NodeReadiness](#cockroach.server.serverpb.DecommissionPreCheckResponse-cockroach.server.serverpb.DecommissionPreCheckResponse.NodeReadiness) | | | [reserved](#support-status) |
| liveness_status | [cockroach.kv.kvserver.liveness.livenesspb.NodeLivenessStatus](#cockroach.server.serverpb.DecommissionPreCheckResponse-cockroach.kv.kvserver.liveness.livenesspb.NodeLivenessStatus) | | The liveness status of the given node. | [reserved](#support-status) |
| replica_count | [int64](#cockroach.server.serverpb.DecommissionPreCheckResponse-int64) | | The number of total replicas on the node, computed by scanning range descriptors. | [reserved](#support-status) |
| checked_ranges | [DecommissionPreCheckResponse.RangeCheckResult](#cockroach.server.serverpb.DecommissionPreCheckResponse-cockroach.server.serverpb.DecommissionPreCheckResponse.RangeCheckResult) | repeated | The details and recorded traces from preprocessing each range with a replica on the checked nodes that resulted in error, up to the maximum specified in the request. | [reserved](#support-status) |





<a name="cockroach.server.serverpb.DecommissionPreCheckResponse-cockroach.server.serverpb.DecommissionPreCheckResponse.RangeCheckResult"></a>
#### DecommissionPreCheckResponse.RangeCheckResult



| Field | Type | Label | Description | Support status |
| ----- | ---- | ----- | ----------- | -------------- |
| range_id | [int32](#cockroach.server.serverpb.DecommissionPreCheckResponse-int32) | | | [reserved](#support-status) |
| allocatorAction | [string](#cockroach.server.serverpb.DecommissionPreCheckResponse-string) | | The action determined by the allocator that is needed for the range. | [reserved](#support-status) |
| events | [TraceEvent](#cockroach.server.serverpb.DecommissionPreCheckResponse-cockroach.server.serverpb.TraceEvent) | repeated | All trace events collected while processing the range in the allocator. | [reserved](#support-status) |
| error | [string](#cockroach.server.serverpb.DecommissionPreCheckResponse-string) | | The error message from the allocator's processing, if any. | [reserved](#support-status) |





<a name="cockroach.server.serverpb.DecommissionPreCheckResponse-cockroach.server.serverpb.TraceEvent"></a>
#### TraceEvent



| Field | Type | Label | Description | Support status |
| ----- | ---- | ----- | ----------- | -------------- |
| time | [google.protobuf.Timestamp](#cockroach.server.serverpb.DecommissionPreCheckResponse-google.protobuf.Timestamp) | | | [reserved](#support-status) |
| message | [string](#cockroach.server.serverpb.DecommissionPreCheckResponse-string) | | | [reserved](#support-status) |






## Decommission


Expand Down
8 changes: 8 additions & 0 deletions pkg/server/admin.go
Original file line number Diff line number Diff line change
Expand Up @@ -2656,6 +2656,14 @@ func (s *adminServer) getStatementBundle(ctx context.Context, id int64, w http.R
_, _ = io.Copy(w, &bundle)
}

// DecommissionPreCheck runs checks and returns the DecommissionPreCheckResponse
// for the given nodes.
func (s *systemAdminServer) DecommissionPreCheck(
ctx context.Context, req *serverpb.DecommissionPreCheckRequest,
) (*serverpb.DecommissionPreCheckResponse, error) {
return nil, grpcstatus.Errorf(codes.Unimplemented, "method DecommissionPreCheck not implemented")
}

// DecommissionStatus returns the DecommissionStatus for all or the given nodes.
func (s *systemAdminServer) DecommissionStatus(
ctx context.Context, req *serverpb.DecommissionStatusRequest,
Expand Down
27 changes: 27 additions & 0 deletions pkg/server/admin_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -2442,6 +2442,33 @@ func TestEndpointTelemetryBasic(t *testing.T) {
)))
}

// TestDecommissionPreCheck tests the basic functionality of the
// DecommissionPreCheck endpoint.
func TestDecommissionPreCheck(t *testing.T) {
defer leaktest.AfterTest(t)()
defer log.Scope(t).Close(t)
skip.UnderRace(t) // can't handle 7-node clusters

ctx := context.Background()
tc := serverutils.StartNewTestCluster(t, 7, base.TestClusterArgs{
ReplicationMode: base.ReplicationManual, // saves time
})
defer tc.Stopper().Stop(ctx)

adminSrv := tc.Server(4)
conn, err := adminSrv.RPCContext().GRPCDialNode(
adminSrv.RPCAddr(), adminSrv.NodeID(), rpc.DefaultClass).Connect(ctx)
require.NoError(t, err)
adminClient := serverpb.NewAdminClient(conn)

resp, err := adminClient.DecommissionPreCheck(ctx, &serverpb.DecommissionPreCheckRequest{
NodeIDs: []roachpb.NodeID{tc.Server(5).NodeID()},
})
require.Error(t, err)
require.Equal(t, codes.Unimplemented, status.Code(err))
require.Nil(t, resp)
}

func TestDecommissionSelf(t *testing.T) {
defer leaktest.AfterTest(t)()
defer log.Scope(t).Close(t)
Expand Down
64 changes: 64 additions & 0 deletions pkg/server/serverpb/admin.proto
Original file line number Diff line number Diff line change
Expand Up @@ -479,6 +479,65 @@ message DrainResponse {
reserved 1;
}

// DecommissionPreCheckRequest requests that preliminary checks be run to
// ensure that the specified node(s) can be decommissioned successfully.
message DecommissionPreCheckRequest {
repeated int32 node_ids = 1 [(gogoproto.customname) = "NodeIDs",
(gogoproto.casttype) = "github.com/cockroachdb/cockroach/pkg/roachpb.NodeID"];

// The maximum number of ranges for which to report errors.
int32 num_replica_report = 2;

// If true, all ranges on the checked nodes must only need replacement or
// removal for decommissioning.
bool strict_readiness = 3;
}

// DecommissionPreCheckResponse returns the number of replicas that encountered
// errors when running preliminary decommissioning checks, as well as the
// associated error messages and traces, for each node.
message DecommissionPreCheckResponse {
enum NodeReadiness {
UNKNOWN = 0;
READY = 1;
ALREADY_DECOMMISSIONED = 2;
ALLOCATION_ERRORS = 3;
}

message RangeCheckResult {
int32 range_id = 1 [ (gogoproto.customname) = "RangeID",
(gogoproto.casttype) = "github.com/cockroachdb/cockroach/pkg/roachpb.RangeID"];
// The action determined by the allocator that is needed for the range.
string allocatorAction = 2;
// All trace events collected while processing the range in the allocator.
repeated TraceEvent events = 3;
// The error message from the allocator's processing, if any.
string error = 4;
}

message NodeCheckResult {
int32 node_id = 1 [ (gogoproto.customname) = "NodeID",
(gogoproto.casttype) = "github.com/cockroachdb/cockroach/pkg/roachpb.NodeID"];

NodeReadiness decommission_readiness = 2;

// The liveness status of the given node.
kv.kvserver.liveness.livenesspb.NodeLivenessStatus liveness_status = 3;

// The number of total replicas on the node, computed by scanning range
// descriptors.
int64 replica_count = 4;

// The details and recorded traces from preprocessing each range with a
// replica on the checked nodes that resulted in error, up to the maximum
// specified in the request.
repeated RangeCheckResult checked_ranges = 5 [(gogoproto.nullable) = false];
}

// Status of the preliminary decommission checks across nodes.
repeated NodeCheckResult checked_nodes = 1 [(gogoproto.nullable) = false];
}

// DecommissionStatusRequest requests the decommissioning status for the
// specified or, if none are specified, all nodes.
message DecommissionStatusRequest {
Expand Down Expand Up @@ -1072,6 +1131,11 @@ service Admin {
rpc Drain(DrainRequest) returns (stream DrainResponse) {
}

// DecommissionPreCheck requests that the server execute preliminary checks
// to evaluate the possibility of successfully decommissioning a given node.
rpc DecommissionPreCheck(DecommissionPreCheckRequest) returns (DecommissionPreCheckResponse) {
}

// Decommission puts the node(s) into the specified decommissioning state.
// If this ever becomes exposed via HTTP, ensure that it performs
// authorization. See #42567.
Expand Down

0 comments on commit 169073d

Please sign in to comment.