Skip to content

Commit

Permalink
roachprod: validate host before running command
Browse files Browse the repository at this point in the history
This commit adds a wrapper to every command run on a remote roachprod
cluster with checks that the node still belongs to the cluster
roachprod thinks it does. Specifically, it stops roachprod commands
from interfering with unrelated clusters if the roachprod cache is not
up to date.

Consider the following set of steps:

```console
$ roachprod create cluster-a -n 1 --lifetime 1h
$ roachprod put cluster-a ./cockroach ./cockroach
$ roachprod start cluster-a
```

One hour later, `cluster-a`'s lifetime expires and the VM is
destroyed. Another user (or a Team City agent) creates another
cluster:

```console
$ roachprod create cluster-b -n 1
$ roachprod put cluster-b ./cockroach ./cockroach
$ roachprod start cluster-b
```

In the process of creating `cluster-b`, it is possible that the public
IP for cluster A's VM is reused and assigned to cluster B's VM. To
simulate that situation on a single client, we can manually edit
roachprod's cache.

Now suppose the creator of `cluster-a`, not knowing the cluster was
expired and now with a stale cache, runs:

```console
$ roachprod stop cluster-a
```

This will unintentionally stop cockroach on `cluster-b`! A client with
an updated cache will see the following output:

```console
$ roachprod status cluster-b
cluster-b: status 1/1
17:14:31 main.go:518:    1: not running
```

In addition, it's confusing, from cluster B's perspective, why the
cockroach process died -- all we know is that it was killed and the
process exited with code 137:

```console
$ roachprod run cluster-b 'sudo journalctl' | grep cockroach
Mar 06 17:18:33 renato-cluster-b-0001 systemd[1]: Starting /usr/bin/bash ./cockroach.sh run...
Mar 06 17:18:33 renato-cluster-b-0001 bash[13384]: cockroach start: Mon Mar  6 17:18:33 UTC 2023, logging to logs
Mar 06 17:18:34 renato-cluster-b-0001 systemd[1]: Started /usr/bin/bash ./cockroach.sh run.
Mar 06 17:19:04 renato-cluster-b-0001 bash[13381]: ./cockroach.sh: line 67: 13386 Killed                  "${BINARY}" "${ARGS[@]}" >> "${LOG_DIR}/cockroach.stdout.log" 2>> "${LOG_DIR}/cockroach.stderr.log"
Mar 06 17:19:04 renato-cluster-b-0001 bash[13617]: cockroach exited with code 137: Mon Mar  6 17:19:04 UTC 2023
Mar 06 17:19:04 renato-cluster-b-0001 systemd[1]: cockroach.service: Main process exited, code=exited, status=137/n/a
Mar 06 17:19:04 renato-cluster-b-0001 systemd[1]: cockroach.service: Failed with result 'exit-code'.
```

With the changes in this commit, the `roachprod stop` call will now
fail since the hostnames don't match:

```console
$ roachprod stop cluster-a
cluster-a: stopping and waiting 1/1
0: COMMAND_PROBLEM: exit status 1
(1) COMMAND_PROBLEM
Wraps: (2) exit status 1
Error types: (1) errors.Cmd (2) *exec.ExitError: expected host to be part of cluster-a, but is cluster-b-0001
```

Finally, `roachprod ssh` calls will now also fail if the hostnames do
not match, instead of silently connecting to the wrong cluster.

Resolves #89437.
Resolves #63637.

Release note: None
  • Loading branch information
renatolabs committed Mar 6, 2023
1 parent 856658b commit 13b3021
Show file tree
Hide file tree
Showing 3 changed files with 80 additions and 8 deletions.
1 change: 1 addition & 0 deletions pkg/roachprod/install/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ go_library(
"//pkg/roachprod/logger",
"//pkg/roachprod/ssh",
"//pkg/roachprod/ui",
"//pkg/roachprod/vm",
"//pkg/roachprod/vm/aws",
"//pkg/roachprod/vm/gce",
"//pkg/roachprod/vm/local",
Expand Down
85 changes: 78 additions & 7 deletions pkg/roachprod/install/cluster_synced.go
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/roachprod/logger"
"github.com/cockroachdb/cockroach/pkg/roachprod/ssh"
"github.com/cockroachdb/cockroach/pkg/roachprod/ui"
"github.com/cockroachdb/cockroach/pkg/roachprod/vm"
"github.com/cockroachdb/cockroach/pkg/roachprod/vm/aws"
"github.com/cockroachdb/cockroach/pkg/roachprod/vm/local"
"github.com/cockroachdb/cockroach/pkg/util/log"
Expand Down Expand Up @@ -291,6 +292,62 @@ func (c *SyncedCluster) roachprodEnvRegex(node Node) string {
return fmt.Sprintf(`ROACHPROD=%s[ \/]`, escaped)
}

// validateHost wraps the command given with a check that the remote
// node is still part of the `SyncedCluster`. When `cmd` is empty, the
// command returned can be used to validate whether the host matches
// the name expected by roachprod, and can be used to validate the
// contents of the cache for that cluster.
//
// Since `SyncedCluster` is created from a potentially stale cache, it
// is possible for the following events to happen:
//
// Client 1:
// - cluster A is created and information is persisted in roachprod's cache.
// - cluster's A lifetime expires, VMs are destroyed.
//
// Client 2
// - cluster B is created; public IP of one of the VMs of cluster
// A is reused and assigned to one of cluster B's VMs.
//
// Client 1:
// - command with side-effects is run on cluster A (e.g.,
// `roachprod stop`). Public IP now belongs to a VM in cluster
// B. Client unknowingly disturbs cluster B, thinking it's cluster A.
//
// Client 2:
// - notices that cluster B is behaving strangely (e.g., cockroach
// process died). A lot of time is spent trying to debug the
// issue.
//
// This scenario is possible and has happened a number of times in the
// past (for one such occurrence, see #97100). A particularly
// problematic instance of this bug happens when "cluster B" is
// running a roachtest. This interaction leads to issues that are hard
// to debug or make sense of, which ultimately wastes a lot of
// engineering time.
//
// By wrapping every command with a hostname check as is done here, we
// ensure that the cached cluster information is still correct.
func (c *SyncedCluster) validateHost(cmd string, node Node) string {
isValidHost := fmt.Sprintf("[[ `hostname` == '%s' ]]", vm.Name(c.Name, int(node)))
errMsg := fmt.Sprintf("expected host to be part of %s, but is `hostname`", c.Name)
elseBranch := "fi"
if cmd != "" {
elseBranch = fmt.Sprintf(`
else
%s
fi
`, cmd)
}

return fmt.Sprintf(`
if ! %s; then
echo "%s"
exit 1
%s
`, isValidHost, errMsg, elseBranch)
}

// cmdDebugName is the suffix of the generated ssh debug file
// If it is "", a suffix will be generated from the cmd string
func (c *SyncedCluster) newSession(
Expand All @@ -303,7 +360,7 @@ func (c *SyncedCluster) newSession(
node: node,
user: c.user(node),
host: c.Host(node),
cmd: cmd,
cmd: c.validateHost(cmd, node),
}

for _, opt := range options {
Expand Down Expand Up @@ -2194,8 +2251,22 @@ func (c *SyncedCluster) pghosts(
return m, nil
}

// SSH TODO(peter): document
// SSH creates an interactive shell connecting the caller to the first
// node on the cluster (or to all nodes in an iterm2 split screen if
// supported).
//
// CAUTION: this script will `exec` the ssh utility, so it must not be
// used in any roachtest code. This is for use within `./roachprod`
// exclusively.
func (c *SyncedCluster) SSH(ctx context.Context, l *logger.Logger, sshArgs, args []string) error {
targetNode := c.Nodes[0]
validateHostCmd := c.validateHost("", targetNode)
if err := c.Run(
ctx, l, l.Stdout, l.Stderr, Nodes{targetNode}, "validate-ssh-host", validateHostCmd,
); err != nil {
return err
}

if len(c.Nodes) != 1 && len(args) == 0 {
// If trying to ssh to more than 1 node and the ssh session is interactive,
// try sshing with an iTerm2 split screen configuration.
Expand All @@ -2207,7 +2278,7 @@ func (c *SyncedCluster) SSH(ctx context.Context, l *logger.Logger, sshArgs, args

// Perform template expansion on the arguments.
e := expander{
node: c.Nodes[0],
node: targetNode,
}
var expandedArgs []string
for _, arg := range args {
Expand All @@ -2223,27 +2294,27 @@ func (c *SyncedCluster) SSH(ctx context.Context, l *logger.Logger, sshArgs, args
allArgs = []string{
"/bin/bash", "-c",
}
cmd := fmt.Sprintf("cd %s ; ", c.localVMDir(c.Nodes[0]))
cmd := fmt.Sprintf("cd %s ; ", c.localVMDir(targetNode))
if len(args) == 0 /* interactive */ {
cmd += "/bin/bash "
}
if len(args) > 0 {
cmd += fmt.Sprintf("export ROACHPROD=%s ; ", c.roachprodEnvValue(c.Nodes[0]))
cmd += fmt.Sprintf("export ROACHPROD=%s ; ", c.roachprodEnvValue(targetNode))
cmd += strings.Join(expandedArgs, " ")
}
allArgs = append(allArgs, cmd)
} else {
allArgs = []string{
"ssh",
fmt.Sprintf("%s@%s", c.user(c.Nodes[0]), c.Host(c.Nodes[0])),
fmt.Sprintf("%s@%s", c.user(targetNode), c.Host(targetNode)),
"-o", "UserKnownHostsFile=/dev/null",
"-o", "StrictHostKeyChecking=no",
}
allArgs = append(allArgs, sshAuthArgs()...)
allArgs = append(allArgs, sshArgs...)
if len(args) > 0 {
allArgs = append(allArgs, fmt.Sprintf(
"export ROACHPROD=%s ;", c.roachprodEnvValue(c.Nodes[0]),
"export ROACHPROD=%s ;", c.roachprodEnvValue(targetNode),
))
}
allArgs = append(allArgs, expandedArgs...)
Expand Down
2 changes: 1 addition & 1 deletion pkg/roachprod/vm/aws/aws.go
Original file line number Diff line number Diff line change
Expand Up @@ -343,7 +343,7 @@ func (o *ProviderOpts) ConfigureCreateFlags(flags *pflag.FlagSet) {
" rate limit (per second) for instance creation. This is used to avoid hitting the request"+
" limits from aws, which can vary based on the region, and the size of the cluster being"+
" created. Try lowering this limit when hitting 'Request limit exceeded' errors.")
flags.StringVar(&providerInstance.IAMProfile, ProviderName+"- iam-profile", providerInstance.IAMProfile,
flags.StringVar(&providerInstance.IAMProfile, ProviderName+"-iam-profile", providerInstance.IAMProfile,
"the IAM instance profile to associate with created VMs if non-empty")

}
Expand Down

0 comments on commit 13b3021

Please sign in to comment.