Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
cli: evaluate readiness prior to node decommission
This changes the functionality of `cockroach node decommission` to run preliminary readiness checks prior to starting the decommission of the nodes. These checks, if they evaluate and find that nodes are not ready for decommission, will report the errors observed and on which nodes so that the cluster's configuration can be rectified prior to reattempting node decommission. The readiness checks are enabled by default, but can be controlled with the following new flags: ``` --dry-run Only evaluate decommission readiness and check decommission status, without actually decommissioning the node. --checks string Specifies how to evaluate readiness checks prior to node decommission. Takes any of the following values: - enabled evaluate readiness prior to starting node decommission. - strict use strict readiness evaluation mode prior to node decommission. - skip skip readiness checks and immediately request node decommission. ``` Issues blocking decommission are presented grouped by node and error, e.g. ``` $ ./cockroach node decommission 1 4 5 --checks=enabled --insecure --dry-run id | is_live | replicas | is_decommissioning | membership | is_draining | readiness | blocking_ranges -----+---------+----------+--------------------+------------+-------------+-------------------+------------------ 1 | true | 53 | false | active | false | allocation errors | 47 4 | true | 52 | false | active | false | allocation errors | 46 5 | true | 54 | false | active | false | allocation errors | 48 (3 rows) ranges blocking decommission detected n1 has 34 replicas blocked with error: "0 of 1 live stores are able to take a new replica for the range (2 already have a voter, 0 already have a non-voter); likely not enough nodes in cluster" n1 has 13 replicas blocked with error: "0 of 1 live stores are able to take a new replica for the range (2 already have a voter, 0 already have a non-voter); replicas must match constraints [{+node1:1} {+node4:1} {+node5:1}]; voting replicas must match voter_constraints []" n4 has 13 replicas blocked with error: "0 of 1 live stores are able to take a new replica for the range (2 already have a voter, 0 already have a non-voter); replicas must match constraints [{+node1:1} {+node4:1} {+node5:1}]; voting replicas must match voter_constraints []" n4 has 33 replicas blocked with error: "0 of 1 live stores are able to take a new replica for the range (2 already have a voter, 0 already have a non-voter); likely not enough nodes in cluster" n5 has 35 replicas blocked with error: "0 of 1 live stores are able to take a new replica for the range (2 already have a voter, 0 already have a non-voter); likely not enough nodes in cluster" ...more blocking errors detected. ERROR: Cannot decommission nodes. Failed running "node decommission" ``` Fixes: #91893 Release note (cli change): `cockroach node decommission` operations now preliminarily check the ability of the node to complete decommissioning, given the cluster configuration and the ranges with replicas present on the node. This step can be skipped by using the flag `--checks=skip`. When errors are detected that would result in the inability to complete node decommission, they will be printed to stderr and the command will exit, instead of marking the node as `decommissioning` and beginning the node decommission process. When the strict readiness evaluation mode is used by setting the flag `--checks=strict`, any ranges that need any preliminary actions prior to replacement for the decommission process (e.g. ranges that are not yet fully upreplicated) will block the decommission process.
- Loading branch information