roachprod: validate host before running command #98076

renatolabs · 2023-03-06T18:36:35Z

This commit adds a wrapper to every command run on a remote roachprod cluster with checks that the node still belongs to the cluster roachprod thinks it does. Specifically, it stops roachprod commands from interfering with unrelated clusters if the roachprod cache is not up to date.

Consider the following set of steps:

$ roachprod create cluster-a -n 1 --lifetime 1h
$ roachprod put cluster-a ./cockroach ./cockroach
$ roachprod start cluster-a

One hour later, cluster-a's lifetime expires and the VM is destroyed. Another user (or a Team City agent) creates another cluster:

$ roachprod create cluster-b -n 1
$ roachprod put cluster-b ./cockroach ./cockroach
$ roachprod start cluster-b

In the process of creating cluster-b, it is possible that the public IP for cluster A's VM is reused and assigned to cluster B's VM. To simulate that situation on a single client, we can manually edit roachprod's cache.

Now suppose the creator of cluster-a, not knowing the cluster was expired and now with a stale cache, runs:

$ roachprod stop cluster-a

This will unintentionally stop cockroach on cluster-b! A client with an updated cache will see the following output:

$ roachprod status cluster-b
cluster-b: status 1/1
17:14:31 main.go:518:    1: not running

In addition, it's confusing, from cluster B's perspective, why the cockroach process died -- all we know is that it was killed and the process exited with code 137:

$ roachprod run cluster-b 'sudo journalctl' | grep cockroach
Mar 06 17:18:33 renato-cluster-b-0001 systemd[1]: Starting /usr/bin/bash ./cockroach.sh run...
Mar 06 17:18:33 renato-cluster-b-0001 bash[13384]: cockroach start: Mon Mar  6 17:18:33 UTC 2023, logging to logs
Mar 06 17:18:34 renato-cluster-b-0001 systemd[1]: Started /usr/bin/bash ./cockroach.sh run.
Mar 06 17:19:04 renato-cluster-b-0001 bash[13381]: ./cockroach.sh: line 67: 13386 Killed                  "${BINARY}" "${ARGS[@]}" >> "${LOG_DIR}/cockroach.stdout.log" 2>> "${LOG_DIR}/cockroach.stderr.log"
Mar 06 17:19:04 renato-cluster-b-0001 bash[13617]: cockroach exited with code 137: Mon Mar  6 17:19:04 UTC 2023
Mar 06 17:19:04 renato-cluster-b-0001 systemd[1]: cockroach.service: Main process exited, code=exited, status=137/n/a
Mar 06 17:19:04 renato-cluster-b-0001 systemd[1]: cockroach.service: Failed with result 'exit-code'.

With the changes in this commit, the roachprod stop call will now fail since the hostnames don't match:

$ roachprod stop cluster-a
cluster-a: stopping and waiting 1/1
0: COMMAND_PROBLEM: exit status 1
(1) COMMAND_PROBLEM
Wraps: (2) exit status 1
Error types: (1) errors.Cmd (2) *exec.ExitError: expected host to be part of cluster-a, but is cluster-b-0001

Finally, roachprod ssh calls will now also fail if the hostnames do not match, instead of silently connecting to the wrong cluster.

Resolves #89437.
Resolves #63637.

Release note: None

cockroach-teamcity · 2023-03-06T18:36:44Z

This change is

herkolategan

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs and @srosenberg)

pkg/roachprod/install/cluster_synced.go line 331 at r2 (raw file):

// By wrapping every command with a hostname check as is done here, we
// ensure that the cached cluster information is still correct.
func (c *SyncedCluster) validateHost(cmd string, node Node) string {

Nit: The name of this function could be a bit more descriptive to let the caller know it's not actually validating the host, but rather supplying a wrapped command that can do so (I guess the return type is somewhat indicative of this). But having tried to think of a better name, I couldn't really come up with one.

pkg/roachprod/install/cluster_synced.go line 343 at r2 (raw file):

	}

	return fmt.Sprintf(`

I almost want to recommend using text/template for the contents of this function, but I think here it might be more effort than it's worth. And the elseBranch logic might be a challenge. Since the exit 1 short circuits the script, the branch might not be required (just append the command). But you most likely already considered this.

herkolategan

LGTM, left a few comments in general, but just for thought, not necessarily actionable.

Reviewed 2 of 3 files at r1, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs and @srosenberg)

srosenberg

Reviewed 1 of 3 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)

pkg/roachprod/install/cluster_synced.go line 331 at r2 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: The name of this function could be a bit more descriptive to let the caller know it's not actually validating the host, but rather supplying a wrapped command that can do so (I guess the return type is somewhat indicative of this). But having tried to think of a better name, I couldn't really come up with one.

validateHostnameCmd?

pkg/roachprod/install/cluster_synced.go line 332 at r3 (raw file):

// ensure that the cached cluster information is still correct.
func (c *SyncedCluster) validateHost(cmd string, node Node) string {
	isValidHost := fmt.Sprintf("[[ `hostname` == '%s' ]]", vm.Name(c.Name, int(node)))

I see two potential issues with this type of check,

is hostname guaranteed to be set to the expected value across all supported cloud providers?
what about the case when a fresh cluster with the same name is created?

renatolabs

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)

pkg/roachprod/install/cluster_synced.go line 332 at r3 (raw file):

is hostname guaranteed to be set to the expected value across all supported cloud providers?

Good call; the default hostname for AWS is indeed different, but it's thankfully easy to update (see changes in start script now included in this PR); I confirmed we get the same behavior on AWS now.

As for Azure, I think it should also work without changes from what I read in their documentation, but I'll run some tests soon to verify.

what about the case when a fresh cluster with the same name is created?

This will indeed not reject commands if 1) cluster has the same name; and 2) it was deleted and re-created and the same IP was reused.

I'm struggling to think of a scenario where this could actually happen. The timing for the events needs to be very precise; coupled with the fact that 1) we typically prefix cluster names with our usernames and 2) teamcity cluster names are unique, I think the hostname check is a fairly reasonable compromise.

That said, I'm open to other solutions if we think there's a better path here. I see that one comment in the original issue was to use instance metadata or labels. That would not be susceptible to your second issue, but a couple of downsides of that is that we would rely on cloud-specific code to do this check and also querying that data would add potentially non-negligible delay to every remote session.

Thoughts?

renatolabs

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)

pkg/roachprod/install/cluster_synced.go line 331 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

validateHostnameCmd?

Updated to validateHostnameCmd.

pkg/roachprod/install/cluster_synced.go line 343 at r2 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

I almost want to recommend using text/template for the contents of this function, but I think here it might be more effort than it's worth. And the elseBranch logic might be a challenge. Since the exit 1 short circuits the script, the branch might not be required (just append the command). But you most likely already considered this.

I thought about it and came to the same conclusion as you: it seemed more work than it's worth. I believe the size of the script here is small enough that it should be relatively straightforward to understand what is going on; we can change that if we see ourselves updating this logic (which I hope wouldn't happen in a long time).

This commit adds a wrapper to every command run on a remote roachprod cluster with checks that the node still belongs to the cluster roachprod thinks it does. Specifically, it stops roachprod commands from interfering with unrelated clusters if the roachprod cache is not up to date. Consider the following set of steps: ```console $ roachprod create cluster-a -n 1 --lifetime 1h $ roachprod put cluster-a ./cockroach ./cockroach $ roachprod start cluster-a ``` One hour later, `cluster-a`'s lifetime expires and the VM is destroyed. Another user (or a Team City agent) creates another cluster: ```console $ roachprod create cluster-b -n 1 $ roachprod put cluster-b ./cockroach ./cockroach $ roachprod start cluster-b ``` In the process of creating `cluster-b`, it is possible that the public IP for cluster A's VM is reused and assigned to cluster B's VM. To simulate that situation on a single client, we can manually edit roachprod's cache. Now suppose the creator of `cluster-a`, not knowing the cluster was expired and now with a stale cache, runs: ```console $ roachprod stop cluster-a ``` This will unintentionally stop cockroach on `cluster-b`! A client with an updated cache will see the following output: ```console $ roachprod status cluster-b cluster-b: status 1/1 17:14:31 main.go:518: 1: not running ``` In addition, it's confusing, from cluster B's perspective, why the cockroach process died -- all we know is that it was killed and the process exited with code 137: ```console $ roachprod run cluster-b 'sudo journalctl' | grep cockroach Mar 06 17:18:33 renato-cluster-b-0001 systemd[1]: Starting /usr/bin/bash ./cockroach.sh run... Mar 06 17:18:33 renato-cluster-b-0001 bash[13384]: cockroach start: Mon Mar 6 17:18:33 UTC 2023, logging to logs Mar 06 17:18:34 renato-cluster-b-0001 systemd[1]: Started /usr/bin/bash ./cockroach.sh run. Mar 06 17:19:04 renato-cluster-b-0001 bash[13381]: ./cockroach.sh: line 67: 13386 Killed "${BINARY}" "${ARGS[@]}" >> "${LOG_DIR}/cockroach.stdout.log" 2>> "${LOG_DIR}/cockroach.stderr.log" Mar 06 17:19:04 renato-cluster-b-0001 bash[13617]: cockroach exited with code 137: Mon Mar 6 17:19:04 UTC 2023 Mar 06 17:19:04 renato-cluster-b-0001 systemd[1]: cockroach.service: Main process exited, code=exited, status=137/n/a Mar 06 17:19:04 renato-cluster-b-0001 systemd[1]: cockroach.service: Failed with result 'exit-code'. ``` With the changes in this commit, the `roachprod stop` call will now fail since the hostnames don't match: ```console $ roachprod stop cluster-a cluster-a: stopping and waiting 1/1 0: COMMAND_PROBLEM: exit status 1 (1) COMMAND_PROBLEM Wraps: (2) exit status 1 Error types: (1) errors.Cmd (2) *exec.ExitError: expected host to be part of cluster-a, but is cluster-b-0001 ``` Finally, `roachprod ssh` calls will now also fail if the hostnames do not match, instead of silently connecting to the wrong cluster. Resolves cockroachdb#89437. Resolves cockroachdb#63637. Release note: None

srosenberg

Thanks for testing this in AWS and Azure!

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)

pkg/roachprod/install/cluster_synced.go line 332 at r3 (raw file):

I think the hostname check is a fairly reasonable compromise.

Yep, I completely agree. I only raised it as a potential, secondary issue. It's indeed rare, and we have safeguards like unique cluster names. The contrived scenario is someone creating a shared cluster, say $team-test; someone else destroys; subsequently, it's created again, and that initial someone with the stale (local) cache does roachprod stop or something else that's destructive. :)

That said, I'm open to other solutions if we think there's a better path here.

Considering (a) hostname check works everywhere, (b) extremely low probability of the contrived scenario, and (c) other workarounds add either more overhead or coupling, or both, I am perfectly content with the one in this PR!

srosenberg

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)

pkg/roachprod/vm/aws/support.go line 154 at r6 (raw file):

# set hostname according to the name used by roachprod. There's host
# validation logic that relies on this -- see comment on cluster_synced.go
sudo hostnamectl set-hostname {{.VMName}}

Let's add --static so that it survives reboot. None of the existing roachtests do this today, although tpccbench has/had that option I believe. The manually created clusters could in theory be rebooted albeit not frequently.

renatolabs

Since this change has the potential for significant disruption (as it changes the actual command run in every session), I ran a nightly build on this branch on both GCP [1] and AWS [2]. Everything looks good, failures observed are the same failures we see on master, so this is ready to merge.

I also ran a manual test on Azure, and the hostname validation works without changes.

[1] https://teamcity.cockroachdb.com/viewLog.html?buildId=8962079&tab=buildResultsDiv&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel
[2] https://teamcity.cockroachdb.com/viewLog.html?buildId=8972458&buildTypeId=Cockroach_Nightlies_RoachtestNightlyAwsBazel

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)

pkg/roachprod/vm/aws/support.go line 154 at r6 (raw file):
According to the documentation for set-hostname [1]:

Set the system hostname to NAME. By default, this will alter the pretty, the static, and the transient hostname alike; however, if one or more of --static, --transient, --pretty are used, only the selected hostnames are changed.

So what we're doing here is a superset of what would be done if we passed --static. I verified that rebooting an AWS VM created by roachprod does not reset the hostname.

[1] https://manpages.ubuntu.com/manpages/xenial/man1/hostnamectl.1.html

renatolabs · 2023-03-08T23:28:02Z

bors r=herkolategan,srosenberg

TFTRs!

craig · 2023-03-09T00:01:21Z

Build succeeded:

Bazel Essential CI (Cockroach)

In cockroachdb#98076, we started validating hostnames before running any commands to avoid situations where a stale cache could lead to unintended interference with other clusters due to public IP reuse. The check relies on the VM's `hostname` matching the expected cluster name in the cache. GCP and Azure clusters set the hostname to the instance name by default, but that is not the case for AWS; the aforementioned PR explicitly sets the hostname when the instance is created. However, in the case of long running AWS clusters (created before host validation was introduced) or clusters that are created with an outdated version of `roachprod`, the hostname will still be the default AWS hostname, and any interaction with that cluster will fail if using a recent `roachprod` version. To remedy this situation, this commit includes: * better error reporting. When we attempt to run a command on an AWS cluster and host validation fails, we display a message to the user explaining that their hostnames may need fixing. * if the user confirms that the cluster still exists (by running `roachprod list`), they are able to automatically fix the hostnames to the expected value by running a new `fix-long-running-aws-hostnames` command. This is a temporary workaround that should be removed once we no longer have clusters that would be affected by this issue. This commit will be reverted once we no longer have clusters created with the default hostnames; this will be easier to achieve once we have an easy way for everyone to upgrade their `roachprod` (see cockroachdb#97311). Epic: none Release note: None

98682: roachprod: provide workaround for long-running AWS clusters r=srosenberg a=renatolabs In #98076, we started validating hostnames before running any commands to avoid situations where a stale cache could lead to unintended interference with other clusters due to public IP reuse. The check relies on the VM's `hostname` matching the expected cluster name in the cache. GCP and Azure clusters set the hostname to the instance name by default, but that is not the case for AWS; the aforementioned PR explicitly sets the hostname when the instance is created. However, in the case of long running AWS clusters (created before host validation was introduced) or clusters that are created with an outdated version of `roachprod`, the hostname will still be the default AWS hostname, and any interaction with that cluster will fail if using a recent `roachprod` version. To remedy this situation, this commit includes: * better error reporting. When we attempt to run a command on an AWS cluster and host validation fails, we display a message to the user explaining that their hostnames may need fixing. * if the user confirms that the cluster still exists (by running `roachprod list`), they are able to automatically fix the hostnames to the expected value by running a new `fix-long-running-aws-hostnames` command. This is a temporary workaround that should be removed once we no longer have clusters that would be affected by this issue. This commit will be reverted once we no longer have clusters created with the default hostnames; this will be easier to achieve once we have an easy way for everyone to upgrade their `roachprod` (see #97311). Epic: none Release note: None 98717: tree: fix tuple encoding performance regression r=mgartner a=mgartner This commit fixes a performance regression in pgwire encoding of tuples introduced in #95009. Informs #98306 Epic: None Release note: None Co-authored-by: Renato Costa <[email protected]> Co-authored-by: Marcus Gartner <[email protected]>

renatolabs requested a review from a team as a code owner March 6, 2023 18:36

renatolabs requested review from herkolategan and srosenberg and removed request for a team March 6, 2023 18:36

renatolabs force-pushed the rc/roachprod-validate-host branch 2 times, most recently from 13b3021 to 3de9c69 Compare March 6, 2023 20:04

herkolategan approved these changes Mar 6, 2023

View reviewed changes

srosenberg reviewed Mar 7, 2023

View reviewed changes

renatolabs force-pushed the rc/roachprod-validate-host branch from 3de9c69 to 64e84ab Compare March 7, 2023 20:34

renatolabs commented Mar 7, 2023

View reviewed changes

renatolabs force-pushed the rc/roachprod-validate-host branch from 64e84ab to dd5595f Compare March 7, 2023 21:48

renatolabs commented Mar 7, 2023

View reviewed changes

renatolabs force-pushed the rc/roachprod-validate-host branch from dd5595f to f0d2e46 Compare March 7, 2023 22:03

renatolabs force-pushed the rc/roachprod-validate-host branch from f0d2e46 to 465a0f4 Compare March 7, 2023 22:19

srosenberg approved these changes Mar 7, 2023

View reviewed changes

srosenberg reviewed Mar 7, 2023

View reviewed changes

renatolabs commented Mar 8, 2023

View reviewed changes

craig bot merged commit d629fbf into cockroachdb:master Mar 9, 2023

renatolabs mentioned this pull request Mar 15, 2023

roachprod: provide workaround for long-running AWS clusters #98682

Merged

renatolabs mentioned this pull request Jul 18, 2024

roachprod: implement spot vm preemption detection for AWS. #126925

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachprod: validate host before running command #98076

roachprod: validate host before running command #98076

renatolabs commented Mar 6, 2023

cockroach-teamcity commented Mar 6, 2023

herkolategan left a comment

herkolategan left a comment

srosenberg left a comment

renatolabs left a comment

renatolabs left a comment

srosenberg left a comment

srosenberg left a comment

renatolabs left a comment

renatolabs commented Mar 8, 2023

craig bot commented Mar 9, 2023

roachprod: validate host before running command #98076

roachprod: validate host before running command #98076

Conversation

renatolabs commented Mar 6, 2023

cockroach-teamcity commented Mar 6, 2023

herkolategan left a comment

Choose a reason for hiding this comment

herkolategan left a comment

Choose a reason for hiding this comment

srosenberg left a comment

Choose a reason for hiding this comment

renatolabs left a comment

Choose a reason for hiding this comment

renatolabs left a comment

Choose a reason for hiding this comment

srosenberg left a comment

Choose a reason for hiding this comment

srosenberg left a comment

Choose a reason for hiding this comment

renatolabs left a comment

Choose a reason for hiding this comment

renatolabs commented Mar 8, 2023

craig bot commented Mar 9, 2023