Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
roachprod: validate host before running command
This commit adds a wrapper to every command run on a remote roachprod cluster with checks that the node still belongs to the cluster roachprod thinks it does. Specifically, it stops roachprod commands from interfering with unrelated clusters if the roachprod cache is not up to date. Consider the following set of steps: ```console $ roachprod create cluster-a -n 1 --lifetime 1h $ roachprod put cluster-a ./cockroach ./cockroach $ roachprod start cluster-a ``` One hour later, `cluster-a`'s lifetime expires and the VM is destroyed. Another user (or a Team City agent) creates another cluster: ```console $ roachprod create cluster-b -n 1 $ roachprod put cluster-b ./cockroach ./cockroach $ roachprod start cluster-b ``` In the process of creating `cluster-b`, it is possible that the public IP for cluster A's VM is reused and assigned to cluster B's VM. To simulate that situation on a single client, we can manually edit roachprod's cache. Now suppose the creator of `cluster-a`, not knowing the cluster was expired and now with a stale cache, runs: ```console $ roachprod stop cluster-a ``` This will unintentionally stop cockroach on `cluster-b`! A client with an updated cache will see the following output: ```console $ roachprod status cluster-b cluster-b: status 1/1 17:14:31 main.go:518: 1: not running ``` In addition, it's confusing, from cluster B's perspective, why the cockroach process died -- all we know is that it was killed and the process exited with code 137: ```console $ roachprod run cluster-b 'sudo journalctl' | grep cockroach Mar 06 17:18:33 renato-cluster-b-0001 systemd[1]: Starting /usr/bin/bash ./cockroach.sh run... Mar 06 17:18:33 renato-cluster-b-0001 bash[13384]: cockroach start: Mon Mar 6 17:18:33 UTC 2023, logging to logs Mar 06 17:18:34 renato-cluster-b-0001 systemd[1]: Started /usr/bin/bash ./cockroach.sh run. Mar 06 17:19:04 renato-cluster-b-0001 bash[13381]: ./cockroach.sh: line 67: 13386 Killed "${BINARY}" "${ARGS[@]}" >> "${LOG_DIR}/cockroach.stdout.log" 2>> "${LOG_DIR}/cockroach.stderr.log" Mar 06 17:19:04 renato-cluster-b-0001 bash[13617]: cockroach exited with code 137: Mon Mar 6 17:19:04 UTC 2023 Mar 06 17:19:04 renato-cluster-b-0001 systemd[1]: cockroach.service: Main process exited, code=exited, status=137/n/a Mar 06 17:19:04 renato-cluster-b-0001 systemd[1]: cockroach.service: Failed with result 'exit-code'. ``` With the changes in this commit, the `roachprod stop` call will now fail since the hostnames don't match: ```console $ roachprod stop cluster-a cluster-a: stopping and waiting 1/1 0: COMMAND_PROBLEM: exit status 1 (1) COMMAND_PROBLEM Wraps: (2) exit status 1 Error types: (1) errors.Cmd (2) *exec.ExitError: expected host to be part of cluster-a, but is cluster-b-0001 ``` Finally, `roachprod ssh` calls will now also fail if the hostnames do not match, instead of silently connecting to the wrong cluster. Resolves #89437. Resolves #63637. Release note: None
- Loading branch information