Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: re-evaluate SSH flake detection and prevention #100875

Open
renatolabs opened this issue Apr 6, 2023 · 3 comments
Open

roachtest: re-evaluate SSH flake detection and prevention #100875

renatolabs opened this issue Apr 6, 2023 · 3 comments
Assignees
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team

Comments

@renatolabs
Copy link
Contributor

renatolabs commented Apr 6, 2023

As of a few months ago, we started using ssh's exit code to identify "SSH flakes": failures that we believe are due to SSH connection as opposed to an actual error in the command itself. As time went on, I believe we started to see how this mechanism is less than ideal, at least the way it's implemented today: there's no way to differentiate a 255 exit code coming from ssh itself (arguably an SSH flake, see below) from one where the command being run returned 255 for its own reasons (which should translate to a legitimate test failure).

The following documents a few recent instances where test failures were marked as SSH flakes when they should have been legitimate test failures:

Jepsen test failures
We recently had to perform an update in our Jepsen fork [1] to fix something that caused the failure of every Jepsen roachtest. The Jepsen failures were marked as SSH flakes [2] because Jepsen will exit with code 255 when an unhandled exception is caught [3].

ruby-install failures
We have seen ruby-install failures being marked as SSH flakes for months as well [4]. Turns out ruby-install will exit with code -1 when it fails [5], which translates to an actual exit code of 255.

GetInternalIP failures
GetInternalIP will return the output of hostname --all-ip-addresses. However, hostname itself will return 255 in some (maybe all?) error conditions (e.g., try hostname --invalid; echo $? on Linux).

More generally, there are a lot of commands that return 255 in error conditions [6], and in those cases we do want the test to fail.

How about when ssh itself returns 255 -- that's gotta be an SSH flake, right?
I'd say depends on how we define "flake". In [7], for example, we see that an authentication failure due to misconfigured SSH keys led to a 255 exit code and therefore (unwanted) retries. In this case, it's not a "flake" (in the sense that the command will fail deterministically), but it would be a failure that, were it to happen on a test, we wouldn't want to notify teams about.

As another example, take this GetInternalIP failure [8]. If we look at the associated SSH logs, we see that the actual reason for the failure is:

debug1: kex_exchange_identification: banner line 0: Exceeded MaxStartups
kex_exchange_identification: Connection closed by remote host

I won't get too much into the details of the semantics of MaxStartups; it's documented in [9]. Suffice it to say that it can cause ssh to randomly drop connections when there are too many unauthenticated SSH connections. It has previously been a source of flakes and back in 2019 (#37001) we bumped that limit. However, no amount of bumping can catch up with port scanners. Indeed, shortly before the failure in [8], we see a port scanner in action:

sshd[18234]: Invalid user ts3 from 171.244.62.120 port 60474
sshd[18235]: Invalid user minecraft from 171.244.62.120 port 60504
sshd[18232]: Invalid user pi from 171.244.62.120 port 60544
sshd[18227]: Invalid user testuser from 171.244.62.120 port 60340
sshd[18242]: Invalid user deploy from 171.244.62.120 port 60624
sshd[18241]: Invalid user test from 171.244.62.120 port 60566
sshd[18244]: Invalid user ansible from 171.244.62.120 port 60594
sshd[18239]: Invalid user bob from 171.244.62.120 port 60450
sshd[18238]: Invalid user test from 171.244.62.120 port 60664
sshd[18257]: Invalid user test from 171.244.62.120 port 60604
sshd[18258]: Invalid user testftp from 171.244.62.120 port 60546
sshd[18259]: Invalid user raspberry from 171.244.62.120 port 60432
sshd[18260]: Invalid user test from 171.244.62.120 port 60390
sshd[18262]: Invalid user hadoop from 171.244.62.120 port 60470
sshd[18261]: Invalid user ubnt from 171.244.62.120 port 60540

Which brings us to prevention; we shouldn't let random port scanners cause a test to fail.

This rambly issue can be summarized in a few points:

  • checking exit status from ssh to determine whether something was an SSH flake is too brittle. We want to check the exit status of the command being run at the very least.
  • ideally, we would also not retry commands that fail due to authentication failures
  • we can probably do a better job at preventing SSH flakes, especially in cases where a port scanner causes sshd to drop connections. We can investigate to what extent we can re-enable sshguard [10], or a similar solution. Even better, the more long term approach would be to get rid of public IPs entirely, as proposed in roachprod: enable provisioning without public ip #77644.

[1] cockroachdb/jepsen#34
[2] Example: #90695 (comment)
[3] https://github.com/cockroachdb/jepsen/blob/3d7c345d6958f067edb097f9b82ab8e7a4a752c7/jepsen/src/jepsen/cli.clj#L278-L280
[4] Example: #99828 (comment)
[5] https://github.com/postmodern/ruby-install/blob/f59dd9cb31a073c2e2c94eb679a348f7ec3ca1b1/share/ruby-install/logging.sh#L71-L75
[6] Others have expressed their confusion on the topic as well, see https://www.redhat.com/sysadmin/exit-codes-demystified for example.
[7] https://cockroachlabs.slack.com/archives/CJ0H8Q97C/p1679342613439679 (internal)
[8] #99828 (comment)
[9] https://man7.org/linux/man-pages/man5/sshd_config.5.html
[10] https://github.com/cockroachdb/cockroach/blob/master/pkg/roachprod/vm/gce/utils.go#L134-L136

Jira issue: CRDB-26665

@renatolabs renatolabs added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure T-testeng TestEng Team labels Apr 6, 2023
@blathers-crl
Copy link

blathers-crl bot commented Apr 6, 2023

cc @cockroachdb/test-eng

@smg260
Copy link
Contributor

smg260 commented Apr 6, 2023

Thanks for compiling this issue. I agree with the observations, especially in cases where commands themselves return 255

checking exit status from ssh to determine whether something was an SSH flake is too brittle. We want to check the exit status of the command being run at the very least.

That's essentially what we are doing now. Non-zero exit codes of the command are passed through to the ssh caller.

How else would you propose checking the exit status of the underlying command? If a command did return -1, it will only ever be visible as 255 (two's complement). Is there a way to disambiguate from ssh returning 255, without nasty manipulation of the original command? One possible way is to match on substrings in the debug logs.

The default way in which we determine if a command should be retried is relatively easily modified, as it has access to the stdout/err of the command, as well as the exit status, should we want to look for certain output. [1]

[1] -

var DefaultSSHRetryOpts = newRunRetryOpts(defaultRetryOpt, func(res *RunResultDetails) bool { return errors.Is(res.Err, rperrors.ErrSSH255) })

@renatolabs
Copy link
Contributor Author

Is there a way to disambiguate from ssh returning 255, without nasty manipulation of the original command? One possible way is to match on substrings in the debug logs.

I was intentionally trying to document behavior we probably don't want (mark test as SSH flake if command returned 255) without prescribing a solution. That said, what you mentioned here was one of the things that came to mind: we can likely check SSH logs to determine the source of the 255. That would couple us to a specific logging format and we might need to revisit the logic when we upgrade SSH/Ubuntu, etc, but it might be a worthwhile compromise to make to avoid mislabeling issues.

There might be other solutions that I'm not aware of as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team
Projects
None yet
Development

No branches or pull requests

2 participants