-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachprod: validate host before running command #98076
roachprod: validate host before running command #98076
Conversation
13b3021
to
3de9c69
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs and @srosenberg)
pkg/roachprod/install/cluster_synced.go
line 331 at r2 (raw file):
// By wrapping every command with a hostname check as is done here, we // ensure that the cached cluster information is still correct. func (c *SyncedCluster) validateHost(cmd string, node Node) string {
Nit: The name of this function could be a bit more descriptive to let the caller know it's not actually validating the host, but rather supplying a wrapped command that can do so (I guess the return type is somewhat indicative of this). But having tried to think of a better name, I couldn't really come up with one.
pkg/roachprod/install/cluster_synced.go
line 343 at r2 (raw file):
} return fmt.Sprintf(`
I almost want to recommend using text/template
for the contents of this function, but I think here it might be more effort than it's worth. And the elseBranch
logic might be a challenge. Since the exit 1
short circuits the script, the branch might not be required (just append the command). But you most likely already considered this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, left a few comments in general, but just for thought, not necessarily actionable.
Reviewed 2 of 3 files at r1, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs and @srosenberg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)
pkg/roachprod/install/cluster_synced.go
line 331 at r2 (raw file):
Previously, herkolategan (Herko Lategan) wrote…
Nit: The name of this function could be a bit more descriptive to let the caller know it's not actually validating the host, but rather supplying a wrapped command that can do so (I guess the return type is somewhat indicative of this). But having tried to think of a better name, I couldn't really come up with one.
validateHostnameCmd
?
pkg/roachprod/install/cluster_synced.go
line 332 at r3 (raw file):
// ensure that the cached cluster information is still correct. func (c *SyncedCluster) validateHost(cmd string, node Node) string { isValidHost := fmt.Sprintf("[[ `hostname` == '%s' ]]", vm.Name(c.Name, int(node)))
I see two potential issues with this type of check,
- is
hostname
guaranteed to be set to the expected value across all supported cloud providers? - what about the case when a fresh cluster with the same name is created?
3de9c69
to
64e84ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)
pkg/roachprod/install/cluster_synced.go
line 332 at r3 (raw file):
is hostname guaranteed to be set to the expected value across all supported cloud providers?
Good call; the default hostname for AWS is indeed different, but it's thankfully easy to update (see changes in start script now included in this PR); I confirmed we get the same behavior on AWS now.
As for Azure, I think it should also work without changes from what I read in their documentation, but I'll run some tests soon to verify.
what about the case when a fresh cluster with the same name is created?
This will indeed not reject commands if 1) cluster has the same name; and 2) it was deleted and re-created and the same IP was reused.
I'm struggling to think of a scenario where this could actually happen. The timing for the events needs to be very precise; coupled with the fact that 1) we typically prefix cluster names with our usernames and 2) teamcity cluster names are unique, I think the hostname check is a fairly reasonable compromise.
That said, I'm open to other solutions if we think there's a better path here. I see that one comment in the original issue was to use instance metadata or labels. That would not be susceptible to your second issue, but a couple of downsides of that is that we would rely on cloud-specific code to do this check and also querying that data would add potentially non-negligible delay to every remote session.
Thoughts?
64e84ab
to
dd5595f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)
pkg/roachprod/install/cluster_synced.go
line 331 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
validateHostnameCmd
?
Updated to validateHostnameCmd
.
pkg/roachprod/install/cluster_synced.go
line 343 at r2 (raw file):
Previously, herkolategan (Herko Lategan) wrote…
I almost want to recommend using
text/template
for the contents of this function, but I think here it might be more effort than it's worth. And theelseBranch
logic might be a challenge. Since theexit 1
short circuits the script, the branch might not be required (just append the command). But you most likely already considered this.
I thought about it and came to the same conclusion as you: it seemed more work than it's worth. I believe the size of the script here is small enough that it should be relatively straightforward to understand what is going on; we can change that if we see ourselves updating this logic (which I hope wouldn't happen in a long time).
dd5595f
to
f0d2e46
Compare
This commit adds a wrapper to every command run on a remote roachprod cluster with checks that the node still belongs to the cluster roachprod thinks it does. Specifically, it stops roachprod commands from interfering with unrelated clusters if the roachprod cache is not up to date. Consider the following set of steps: ```console $ roachprod create cluster-a -n 1 --lifetime 1h $ roachprod put cluster-a ./cockroach ./cockroach $ roachprod start cluster-a ``` One hour later, `cluster-a`'s lifetime expires and the VM is destroyed. Another user (or a Team City agent) creates another cluster: ```console $ roachprod create cluster-b -n 1 $ roachprod put cluster-b ./cockroach ./cockroach $ roachprod start cluster-b ``` In the process of creating `cluster-b`, it is possible that the public IP for cluster A's VM is reused and assigned to cluster B's VM. To simulate that situation on a single client, we can manually edit roachprod's cache. Now suppose the creator of `cluster-a`, not knowing the cluster was expired and now with a stale cache, runs: ```console $ roachprod stop cluster-a ``` This will unintentionally stop cockroach on `cluster-b`! A client with an updated cache will see the following output: ```console $ roachprod status cluster-b cluster-b: status 1/1 17:14:31 main.go:518: 1: not running ``` In addition, it's confusing, from cluster B's perspective, why the cockroach process died -- all we know is that it was killed and the process exited with code 137: ```console $ roachprod run cluster-b 'sudo journalctl' | grep cockroach Mar 06 17:18:33 renato-cluster-b-0001 systemd[1]: Starting /usr/bin/bash ./cockroach.sh run... Mar 06 17:18:33 renato-cluster-b-0001 bash[13384]: cockroach start: Mon Mar 6 17:18:33 UTC 2023, logging to logs Mar 06 17:18:34 renato-cluster-b-0001 systemd[1]: Started /usr/bin/bash ./cockroach.sh run. Mar 06 17:19:04 renato-cluster-b-0001 bash[13381]: ./cockroach.sh: line 67: 13386 Killed "${BINARY}" "${ARGS[@]}" >> "${LOG_DIR}/cockroach.stdout.log" 2>> "${LOG_DIR}/cockroach.stderr.log" Mar 06 17:19:04 renato-cluster-b-0001 bash[13617]: cockroach exited with code 137: Mon Mar 6 17:19:04 UTC 2023 Mar 06 17:19:04 renato-cluster-b-0001 systemd[1]: cockroach.service: Main process exited, code=exited, status=137/n/a Mar 06 17:19:04 renato-cluster-b-0001 systemd[1]: cockroach.service: Failed with result 'exit-code'. ``` With the changes in this commit, the `roachprod stop` call will now fail since the hostnames don't match: ```console $ roachprod stop cluster-a cluster-a: stopping and waiting 1/1 0: COMMAND_PROBLEM: exit status 1 (1) COMMAND_PROBLEM Wraps: (2) exit status 1 Error types: (1) errors.Cmd (2) *exec.ExitError: expected host to be part of cluster-a, but is cluster-b-0001 ``` Finally, `roachprod ssh` calls will now also fail if the hostnames do not match, instead of silently connecting to the wrong cluster. Resolves cockroachdb#89437. Resolves cockroachdb#63637. Release note: None
f0d2e46
to
465a0f4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for testing this in AWS and Azure!
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)
pkg/roachprod/install/cluster_synced.go
line 332 at r3 (raw file):
I think the hostname check is a fairly reasonable compromise.
Yep, I completely agree. I only raised it as a potential, secondary issue. It's indeed rare, and we have safeguards like unique cluster names. The contrived scenario is someone creating a shared cluster, say $team-test
; someone else destroys; subsequently, it's created again, and that initial someone with the stale (local) cache does roachprod stop
or something else that's destructive. :)
That said, I'm open to other solutions if we think there's a better path here.
Considering (a) hostname
check works everywhere, (b) extremely low probability of the contrived scenario, and (c) other workarounds add either more overhead or coupling, or both, I am perfectly content with the one in this PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)
pkg/roachprod/vm/aws/support.go
line 154 at r6 (raw file):
# set hostname according to the name used by roachprod. There's host # validation logic that relies on this -- see comment on cluster_synced.go sudo hostnamectl set-hostname {{.VMName}}
Let's add --static
so that it survives reboot. None of the existing roachtests do this today, although tpccbench has/had that option I believe. The manually created clusters could in theory be rebooted albeit not frequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this change has the potential for significant disruption (as it changes the actual command run in every session), I ran a nightly build on this branch on both GCP [1] and AWS [2]. Everything looks good, failures observed are the same failures we see on master, so this is ready to merge.
I also ran a manual test on Azure, and the hostname validation works without changes.
[1] https://teamcity.cockroachdb.com/viewLog.html?buildId=8962079&tab=buildResultsDiv&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel
[2] https://teamcity.cockroachdb.com/viewLog.html?buildId=8972458&buildTypeId=Cockroach_Nightlies_RoachtestNightlyAwsBazel
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)
pkg/roachprod/vm/aws/support.go
line 154 at r6 (raw file):
According to the documentation for set-hostname
[1]:
Set the system hostname to NAME. By default, this will alter the pretty, the static, and the transient hostname alike; however, if one or more of --static, --transient, --pretty are used, only the selected hostnames are changed.
So what we're doing here is a superset of what would be done if we passed --static
. I verified that rebooting an AWS VM created by roachprod does not reset the hostname.
[1] https://manpages.ubuntu.com/manpages/xenial/man1/hostnamectl.1.html
bors r=herkolategan,srosenberg TFTRs! |
Build succeeded: |
In cockroachdb#98076, we started validating hostnames before running any commands to avoid situations where a stale cache could lead to unintended interference with other clusters due to public IP reuse. The check relies on the VM's `hostname` matching the expected cluster name in the cache. GCP and Azure clusters set the hostname to the instance name by default, but that is not the case for AWS; the aforementioned PR explicitly sets the hostname when the instance is created. However, in the case of long running AWS clusters (created before host validation was introduced) or clusters that are created with an outdated version of `roachprod`, the hostname will still be the default AWS hostname, and any interaction with that cluster will fail if using a recent `roachprod` version. To remedy this situation, this commit includes: * better error reporting. When we attempt to run a command on an AWS cluster and host validation fails, we display a message to the user explaining that their hostnames may need fixing. * if the user confirms that the cluster still exists (by running `roachprod list`), they are able to automatically fix the hostnames to the expected value by running a new `fix-long-running-aws-hostnames` command. This is a temporary workaround that should be removed once we no longer have clusters that would be affected by this issue. This commit will be reverted once we no longer have clusters created with the default hostnames; this will be easier to achieve once we have an easy way for everyone to upgrade their `roachprod` (see cockroachdb#97311). Epic: none Release note: None
98682: roachprod: provide workaround for long-running AWS clusters r=srosenberg a=renatolabs In #98076, we started validating hostnames before running any commands to avoid situations where a stale cache could lead to unintended interference with other clusters due to public IP reuse. The check relies on the VM's `hostname` matching the expected cluster name in the cache. GCP and Azure clusters set the hostname to the instance name by default, but that is not the case for AWS; the aforementioned PR explicitly sets the hostname when the instance is created. However, in the case of long running AWS clusters (created before host validation was introduced) or clusters that are created with an outdated version of `roachprod`, the hostname will still be the default AWS hostname, and any interaction with that cluster will fail if using a recent `roachprod` version. To remedy this situation, this commit includes: * better error reporting. When we attempt to run a command on an AWS cluster and host validation fails, we display a message to the user explaining that their hostnames may need fixing. * if the user confirms that the cluster still exists (by running `roachprod list`), they are able to automatically fix the hostnames to the expected value by running a new `fix-long-running-aws-hostnames` command. This is a temporary workaround that should be removed once we no longer have clusters that would be affected by this issue. This commit will be reverted once we no longer have clusters created with the default hostnames; this will be easier to achieve once we have an easy way for everyone to upgrade their `roachprod` (see #97311). Epic: none Release note: None 98717: tree: fix tuple encoding performance regression r=mgartner a=mgartner This commit fixes a performance regression in pgwire encoding of tuples introduced in #95009. Informs #98306 Epic: None Release note: None Co-authored-by: Renato Costa <[email protected]> Co-authored-by: Marcus Gartner <[email protected]>
This commit adds a wrapper to every command run on a remote roachprod cluster with checks that the node still belongs to the cluster roachprod thinks it does. Specifically, it stops roachprod commands from interfering with unrelated clusters if the roachprod cache is not up to date.
Consider the following set of steps:
One hour later,
cluster-a
's lifetime expires and the VM is destroyed. Another user (or a Team City agent) creates another cluster:In the process of creating
cluster-b
, it is possible that the public IP for cluster A's VM is reused and assigned to cluster B's VM. To simulate that situation on a single client, we can manually edit roachprod's cache.Now suppose the creator of
cluster-a
, not knowing the cluster was expired and now with a stale cache, runs:$ roachprod stop cluster-a
This will unintentionally stop cockroach on
cluster-b
! A client with an updated cache will see the following output:In addition, it's confusing, from cluster B's perspective, why the cockroach process died -- all we know is that it was killed and the process exited with code 137:
With the changes in this commit, the
roachprod stop
call will now fail since the hostnames don't match:Finally,
roachprod ssh
calls will now also fail if the hostnames do not match, instead of silently connecting to the wrong cluster.Resolves #89437.
Resolves #63637.
Release note: None