Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: validate host before running command #98076

Merged
merged 1 commit into from
Mar 9, 2023

Conversation

renatolabs
Copy link
Contributor

This commit adds a wrapper to every command run on a remote roachprod cluster with checks that the node still belongs to the cluster roachprod thinks it does. Specifically, it stops roachprod commands from interfering with unrelated clusters if the roachprod cache is not up to date.

Consider the following set of steps:

$ roachprod create cluster-a -n 1 --lifetime 1h
$ roachprod put cluster-a ./cockroach ./cockroach
$ roachprod start cluster-a

One hour later, cluster-a's lifetime expires and the VM is destroyed. Another user (or a Team City agent) creates another cluster:

$ roachprod create cluster-b -n 1
$ roachprod put cluster-b ./cockroach ./cockroach
$ roachprod start cluster-b

In the process of creating cluster-b, it is possible that the public IP for cluster A's VM is reused and assigned to cluster B's VM. To simulate that situation on a single client, we can manually edit roachprod's cache.

Now suppose the creator of cluster-a, not knowing the cluster was expired and now with a stale cache, runs:

$ roachprod stop cluster-a

This will unintentionally stop cockroach on cluster-b! A client with an updated cache will see the following output:

$ roachprod status cluster-b
cluster-b: status 1/1
17:14:31 main.go:518:    1: not running

In addition, it's confusing, from cluster B's perspective, why the cockroach process died -- all we know is that it was killed and the process exited with code 137:

$ roachprod run cluster-b 'sudo journalctl' | grep cockroach
Mar 06 17:18:33 renato-cluster-b-0001 systemd[1]: Starting /usr/bin/bash ./cockroach.sh run...
Mar 06 17:18:33 renato-cluster-b-0001 bash[13384]: cockroach start: Mon Mar  6 17:18:33 UTC 2023, logging to logs
Mar 06 17:18:34 renato-cluster-b-0001 systemd[1]: Started /usr/bin/bash ./cockroach.sh run.
Mar 06 17:19:04 renato-cluster-b-0001 bash[13381]: ./cockroach.sh: line 67: 13386 Killed                  "${BINARY}" "${ARGS[@]}" >> "${LOG_DIR}/cockroach.stdout.log" 2>> "${LOG_DIR}/cockroach.stderr.log"
Mar 06 17:19:04 renato-cluster-b-0001 bash[13617]: cockroach exited with code 137: Mon Mar  6 17:19:04 UTC 2023
Mar 06 17:19:04 renato-cluster-b-0001 systemd[1]: cockroach.service: Main process exited, code=exited, status=137/n/a
Mar 06 17:19:04 renato-cluster-b-0001 systemd[1]: cockroach.service: Failed with result 'exit-code'.

With the changes in this commit, the roachprod stop call will now fail since the hostnames don't match:

$ roachprod stop cluster-a
cluster-a: stopping and waiting 1/1
0: COMMAND_PROBLEM: exit status 1
(1) COMMAND_PROBLEM
Wraps: (2) exit status 1
Error types: (1) errors.Cmd (2) *exec.ExitError: expected host to be part of cluster-a, but is cluster-b-0001

Finally, roachprod ssh calls will now also fail if the hostnames do not match, instead of silently connecting to the wrong cluster.

Resolves #89437.
Resolves #63637.

Release note: None

@renatolabs renatolabs requested a review from a team as a code owner March 6, 2023 18:36
@renatolabs renatolabs requested review from herkolategan and srosenberg and removed request for a team March 6, 2023 18:36
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@renatolabs renatolabs force-pushed the rc/roachprod-validate-host branch 2 times, most recently from 13b3021 to 3de9c69 Compare March 6, 2023 20:04
Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs and @srosenberg)


pkg/roachprod/install/cluster_synced.go line 331 at r2 (raw file):

// By wrapping every command with a hostname check as is done here, we
// ensure that the cached cluster information is still correct.
func (c *SyncedCluster) validateHost(cmd string, node Node) string {

Nit: The name of this function could be a bit more descriptive to let the caller know it's not actually validating the host, but rather supplying a wrapped command that can do so (I guess the return type is somewhat indicative of this). But having tried to think of a better name, I couldn't really come up with one.


pkg/roachprod/install/cluster_synced.go line 343 at r2 (raw file):

	}

	return fmt.Sprintf(`

I almost want to recommend using text/template for the contents of this function, but I think here it might be more effort than it's worth. And the elseBranch logic might be a challenge. Since the exit 1 short circuits the script, the branch might not be required (just append the command). But you most likely already considered this.

Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a few comments in general, but just for thought, not necessarily actionable.

Reviewed 2 of 3 files at r1, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs and @srosenberg)

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)


pkg/roachprod/install/cluster_synced.go line 331 at r2 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: The name of this function could be a bit more descriptive to let the caller know it's not actually validating the host, but rather supplying a wrapped command that can do so (I guess the return type is somewhat indicative of this). But having tried to think of a better name, I couldn't really come up with one.

validateHostnameCmd?


pkg/roachprod/install/cluster_synced.go line 332 at r3 (raw file):

// ensure that the cached cluster information is still correct.
func (c *SyncedCluster) validateHost(cmd string, node Node) string {
	isValidHost := fmt.Sprintf("[[ `hostname` == '%s' ]]", vm.Name(c.Name, int(node)))

I see two potential issues with this type of check,

  • is hostname guaranteed to be set to the expected value across all supported cloud providers?
  • what about the case when a fresh cluster with the same name is created?

@renatolabs renatolabs force-pushed the rc/roachprod-validate-host branch from 3de9c69 to 64e84ab Compare March 7, 2023 20:34
Copy link
Contributor Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)


pkg/roachprod/install/cluster_synced.go line 332 at r3 (raw file):

is hostname guaranteed to be set to the expected value across all supported cloud providers?

Good call; the default hostname for AWS is indeed different, but it's thankfully easy to update (see changes in start script now included in this PR); I confirmed we get the same behavior on AWS now.

As for Azure, I think it should also work without changes from what I read in their documentation, but I'll run some tests soon to verify.

what about the case when a fresh cluster with the same name is created?

This will indeed not reject commands if 1) cluster has the same name; and 2) it was deleted and re-created and the same IP was reused.

I'm struggling to think of a scenario where this could actually happen. The timing for the events needs to be very precise; coupled with the fact that 1) we typically prefix cluster names with our usernames and 2) teamcity cluster names are unique, I think the hostname check is a fairly reasonable compromise.

That said, I'm open to other solutions if we think there's a better path here. I see that one comment in the original issue was to use instance metadata or labels. That would not be susceptible to your second issue, but a couple of downsides of that is that we would rely on cloud-specific code to do this check and also querying that data would add potentially non-negligible delay to every remote session.

Thoughts?

@renatolabs renatolabs force-pushed the rc/roachprod-validate-host branch from 64e84ab to dd5595f Compare March 7, 2023 21:48
Copy link
Contributor Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)


pkg/roachprod/install/cluster_synced.go line 331 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

validateHostnameCmd?

Updated to validateHostnameCmd.


pkg/roachprod/install/cluster_synced.go line 343 at r2 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

I almost want to recommend using text/template for the contents of this function, but I think here it might be more effort than it's worth. And the elseBranch logic might be a challenge. Since the exit 1 short circuits the script, the branch might not be required (just append the command). But you most likely already considered this.

I thought about it and came to the same conclusion as you: it seemed more work than it's worth. I believe the size of the script here is small enough that it should be relatively straightforward to understand what is going on; we can change that if we see ourselves updating this logic (which I hope wouldn't happen in a long time).

@renatolabs renatolabs force-pushed the rc/roachprod-validate-host branch from dd5595f to f0d2e46 Compare March 7, 2023 22:03
This commit adds a wrapper to every command run on a remote roachprod
cluster with checks that the node still belongs to the cluster
roachprod thinks it does. Specifically, it stops roachprod commands
from interfering with unrelated clusters if the roachprod cache is not
up to date.

Consider the following set of steps:

```console
$ roachprod create cluster-a -n 1 --lifetime 1h
$ roachprod put cluster-a ./cockroach ./cockroach
$ roachprod start cluster-a
```

One hour later, `cluster-a`'s lifetime expires and the VM is
destroyed. Another user (or a Team City agent) creates another
cluster:

```console
$ roachprod create cluster-b -n 1
$ roachprod put cluster-b ./cockroach ./cockroach
$ roachprod start cluster-b
```

In the process of creating `cluster-b`, it is possible that the public
IP for cluster A's VM is reused and assigned to cluster B's VM. To
simulate that situation on a single client, we can manually edit
roachprod's cache.

Now suppose the creator of `cluster-a`, not knowing the cluster was
expired and now with a stale cache, runs:

```console
$ roachprod stop cluster-a
```

This will unintentionally stop cockroach on `cluster-b`! A client with
an updated cache will see the following output:

```console
$ roachprod status cluster-b
cluster-b: status 1/1
17:14:31 main.go:518:    1: not running
```

In addition, it's confusing, from cluster B's perspective, why the
cockroach process died -- all we know is that it was killed and the
process exited with code 137:

```console
$ roachprod run cluster-b 'sudo journalctl' | grep cockroach
Mar 06 17:18:33 renato-cluster-b-0001 systemd[1]: Starting /usr/bin/bash ./cockroach.sh run...
Mar 06 17:18:33 renato-cluster-b-0001 bash[13384]: cockroach start: Mon Mar  6 17:18:33 UTC 2023, logging to logs
Mar 06 17:18:34 renato-cluster-b-0001 systemd[1]: Started /usr/bin/bash ./cockroach.sh run.
Mar 06 17:19:04 renato-cluster-b-0001 bash[13381]: ./cockroach.sh: line 67: 13386 Killed                  "${BINARY}" "${ARGS[@]}" >> "${LOG_DIR}/cockroach.stdout.log" 2>> "${LOG_DIR}/cockroach.stderr.log"
Mar 06 17:19:04 renato-cluster-b-0001 bash[13617]: cockroach exited with code 137: Mon Mar  6 17:19:04 UTC 2023
Mar 06 17:19:04 renato-cluster-b-0001 systemd[1]: cockroach.service: Main process exited, code=exited, status=137/n/a
Mar 06 17:19:04 renato-cluster-b-0001 systemd[1]: cockroach.service: Failed with result 'exit-code'.
```

With the changes in this commit, the `roachprod stop` call will now
fail since the hostnames don't match:

```console
$ roachprod stop cluster-a
cluster-a: stopping and waiting 1/1
0: COMMAND_PROBLEM: exit status 1
(1) COMMAND_PROBLEM
Wraps: (2) exit status 1
Error types: (1) errors.Cmd (2) *exec.ExitError: expected host to be part of cluster-a, but is cluster-b-0001
```

Finally, `roachprod ssh` calls will now also fail if the hostnames do
not match, instead of silently connecting to the wrong cluster.

Resolves cockroachdb#89437.
Resolves cockroachdb#63637.

Release note: None
@renatolabs renatolabs force-pushed the rc/roachprod-validate-host branch from f0d2e46 to 465a0f4 Compare March 7, 2023 22:19
Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Thanks for testing this in AWS and Azure!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)


pkg/roachprod/install/cluster_synced.go line 332 at r3 (raw file):

I think the hostname check is a fairly reasonable compromise.

Yep, I completely agree. I only raised it as a potential, secondary issue. It's indeed rare, and we have safeguards like unique cluster names. The contrived scenario is someone creating a shared cluster, say $team-test; someone else destroys; subsequently, it's created again, and that initial someone with the stale (local) cache does roachprod stop or something else that's destructive. :)

That said, I'm open to other solutions if we think there's a better path here.

Considering (a) hostname check works everywhere, (b) extremely low probability of the contrived scenario, and (c) other workarounds add either more overhead or coupling, or both, I am perfectly content with the one in this PR!

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)


pkg/roachprod/vm/aws/support.go line 154 at r6 (raw file):

# set hostname according to the name used by roachprod. There's host
# validation logic that relies on this -- see comment on cluster_synced.go
sudo hostnamectl set-hostname {{.VMName}}

Let's add --static so that it survives reboot. None of the existing roachtests do this today, although tpccbench has/had that option I believe. The manually created clusters could in theory be rebooted albeit not frequently.

Copy link
Contributor Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this change has the potential for significant disruption (as it changes the actual command run in every session), I ran a nightly build on this branch on both GCP [1] and AWS [2]. Everything looks good, failures observed are the same failures we see on master, so this is ready to merge.

I also ran a manual test on Azure, and the hostname validation works without changes.

[1] https://teamcity.cockroachdb.com/viewLog.html?buildId=8962079&tab=buildResultsDiv&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel
[2] https://teamcity.cockroachdb.com/viewLog.html?buildId=8972458&buildTypeId=Cockroach_Nightlies_RoachtestNightlyAwsBazel

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)


pkg/roachprod/vm/aws/support.go line 154 at r6 (raw file):
According to the documentation for set-hostname [1]:

Set the system hostname to NAME. By default, this will alter the pretty, the static, and the transient hostname alike; however, if one or more of --static, --transient, --pretty are used, only the selected hostnames are changed.

So what we're doing here is a superset of what would be done if we passed --static. I verified that rebooting an AWS VM created by roachprod does not reset the hostname.

[1] https://manpages.ubuntu.com/manpages/xenial/man1/hostnamectl.1.html

@renatolabs
Copy link
Contributor Author

bors r=herkolategan,srosenberg

TFTRs!

@craig
Copy link
Contributor

craig bot commented Mar 9, 2023

Build succeeded:

@craig craig bot merged commit d629fbf into cockroachdb:master Mar 9, 2023
renatolabs added a commit to renatolabs/cockroach that referenced this pull request Mar 15, 2023
In cockroachdb#98076, we started validating hostnames before running any commands
to avoid situations where a stale cache could lead to unintended
interference with other clusters due to public IP reuse. The check
relies on the VM's `hostname` matching the expected cluster name in
the cache. GCP and Azure clusters set the hostname to the instance
name by default, but that is not the case for AWS; the aforementioned
PR explicitly sets the hostname when the instance is created.

However, in the case of long running AWS clusters (created before host
validation was introduced) or clusters that are created with an
outdated version of `roachprod`, the hostname will still be the
default AWS hostname, and any interaction with that cluster will fail
if using a recent `roachprod` version. To remedy this situation, this
commit includes:

* better error reporting. When we attempt to run a command on an AWS
cluster and host validation fails, we display a message to the user
explaining that their hostnames may need fixing.

* if the user confirms that the cluster still exists (by running
`roachprod list`), they are able to automatically fix the hostnames to
the expected value by running a new `fix-long-running-aws-hostnames`
command. This is a temporary workaround that should be removed once we
no longer have clusters that would be affected by this issue.

This commit will be reverted once we no longer have clusters created
with the default hostnames; this will be easier to achieve once we
have an easy way for everyone to upgrade their `roachprod` (see cockroachdb#97311).

Epic: none

Release note: None
craig bot pushed a commit that referenced this pull request Mar 16, 2023
98682: roachprod: provide workaround for long-running AWS clusters r=srosenberg a=renatolabs

In #98076, we started validating hostnames before running any commands to avoid situations where a stale cache could lead to unintended interference with other clusters due to public IP reuse. The check relies on the VM's `hostname` matching the expected cluster name in the cache. GCP and Azure clusters set the hostname to the instance name by default, but that is not the case for AWS; the aforementioned PR explicitly sets the hostname when the instance is created.

However, in the case of long running AWS clusters (created before host validation was introduced) or clusters that are created with an outdated version of `roachprod`, the hostname will still be the default AWS hostname, and any interaction with that cluster will fail if using a recent `roachprod` version. To remedy this situation, this commit includes:

* better error reporting. When we attempt to run a command on an AWS cluster and host validation fails, we display a message to the user explaining that their hostnames may need fixing.

* if the user confirms that the cluster still exists (by running `roachprod list`), they are able to automatically fix the hostnames to the expected value by running a new `fix-long-running-aws-hostnames` command. This is a temporary workaround that should be removed once we no longer have clusters that would be affected by this issue.

This commit will be reverted once we no longer have clusters created with the default hostnames; this will be easier to achieve once we have an easy way for everyone to upgrade their `roachprod` (see #97311).

Epic: none

Release note: None

98717: tree: fix tuple encoding performance regression r=mgartner a=mgartner

This commit fixes a performance regression in pgwire encoding of tuples
introduced in #95009.

Informs #98306

Epic: None

Release note: None

Co-authored-by: Renato Costa <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants