-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachprod: increase concurrent unauthenticated SSH connections #37001
roachprod: increase concurrent unauthenticated SSH connections #37001
Conversation
This PR bumps the permitted number of concurrent unauthenticated SSH connections from 10 to 64. Above this limit, sshd starts randomly dropping connections. It's possible this is what we have been running into with the frequent "Connection closed by remote host" errors in roachtests. See: - https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Load_Balancing - http://edoceo.com/notabene/ssh-exchange-identification Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though why do you suspect we're performing so many concurrent ssh connections? roachprod
and roachtest
will open up a number of connections to a host during a test, but most of them are done serially.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @petermattis)
I think @bdarnell pointed out that the VMs usually get port scans as soon as they come up, perhaps we're just sometimes seeing close to 10 scanbots brute forcing SSH logins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/cmd/roachprod/vm/aws/support.go, line 91 at r1 (raw file):
# sshguard can prevent frequent ssh connections to the same host. Disable it. sudo service sshguard stop
If we're getting port scanned, I think we want to keep sshguard
enabled. Otherwise the attackers will be able to take up as many of those connection slots as they want. Maybe we should be reenabling sshguard on gce instead of turning it off on aws?
If sshguard is blocking our own tools, we should figure out how to reuse connections (which may be as easy as turning on ControlMaster).
FWIW, in this test #36412 the number of BTW, the gossip test is interesting because it looks like it could be misconstrued as a DDOS: cockroach/pkg/cmd/roachtest/gossip.go Lines 113 to 118 in 875e7ff
|
Here's one that I haven't seen so far:
From #36759 |
I went and linked some other issues to this PR. Not necessarily all the same, though. |
Yeah, that one would not be helped by this change. This change would only affect failures without any other output, I think.
That's the same symptom I saw on port 26257 in the syn cookie issue (#36745). However, the dmesg logs here do not show syn cookie activations so that's not the issue here. |
In create one place where we do create a large number of connections is when we set up the known hosts files for each host:
Each node in parallel will create an ssh session with every other host. That being said some of these referenced failures (#36759) happen during |
Hmm, I wonder if we could do the |
Treedist wouldn't put us over the 10-connection limit on its own, but it would (I think) use 8, and then a couple of extra connections (port scans, or parts of roachprod/roachtest we're not thinking about at the moment) could put it over the top.
Good idea. In any case, the default limit of 10 seems really low so I think it's a good idea to increase it no matter what else we do. |
Agreed! |
Before this PR we would run ssh-keyscan in parallel on all of the hosts. I suspect the retry loop in the keyscan script is to deal with the connection failures which cockroachdb#37001 attempts to mitigate but I don't feel eager to rip them out right now. In addition to performing the keyscan from a single host and distributing to all, this change also distributes the known_hosts file to the shared user in anticipation of further changes to enable the shared user to become the default user for roachprod commands with gce. Release note: None
|
37077: roachprod: keyscan from a single host and distribute known_hosts r=ajwerner a=ajwerner Before this PR we would run ssh-keyscan in parallel on all of the hosts. I suspect the retry loop in the keyscan script is to deal with the connection failures which #37001 attempts to mitigate but I don't feel eager to rip them out right now. In addition to performing the keyscan from a single host and distributing to all, this change also distributes the known_hosts file to the shared user in anticipation of further changes to enable the shared user to become the default user for roachprod commands with gce. Release note: None Co-authored-by: Andrew Werner <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mberhault)
pkg/cmd/roachprod/vm/aws/support.go, line 91 at r1 (raw file):
@mberhault introduced this in cockroachdb/roachprod@1f3f698. He says:
sshguard will block ssh after logging in a few times in succession (maybe 3 times a minute? no idea what the defaults are). This got triggered when using ssh to configure a cluster where we performed each step across all nodes, then moved on to the next step across all nodes. This would turn into multiple ssh session in quick succession and get blocked.
Disabling sshguard might trigger more of these obscure ssh failures, so I'm inclined to keep it disabled until we find that we need it with this increased number of unauthenticated connections. Does that sound reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
pkg/cmd/roachprod/vm/aws/support.go, line 91 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
@mberhault introduced this in cockroachdb/roachprod@1f3f698. He says:
sshguard will block ssh after logging in a few times in succession (maybe 3 times a minute? no idea what the defaults are). This got triggered when using ssh to configure a cluster where we performed each step across all nodes, then moved on to the next step across all nodes. This would turn into multiple ssh session in quick succession and get blocked.
Disabling sshguard might trigger more of these obscure ssh failures, so I'm inclined to keep it disabled until we find that we need it with this increased number of unauthenticated connections. Does that sound reasonable?
ok
bors r+ |
Build failed (retrying...) |
Build failed (retrying...) |
Build failed |
37001: roachprod: increase concurrent unauthenticated SSH connections r=nvanbenschoten a=nvanbenschoten This PR bumps the permitted number of concurrent unauthenticated SSH connections from 10 to 64. Above this limit, sshd starts randomly dropping connections. It's possible this is what we have been running into with the frequent "Connection closed by remote host" errors in roachtests. See: - https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Load_Balancing - http://edoceo.com/notabene/ssh-exchange-identification Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
Build succeeded |
See cockroachdb#36929. Whenever these flakes happen, it'll be good to have verbose logs. Anecdotally we're seeing fewer of them now, perhaps due to cockroachdb#37001. Release note: None
Fixes cockroachdb#38785. Fixes cockroachdb#35326. Because everything roachprod does, it does through SSH, we're particularly susceptible to network delays, packet drops, etc. We've seen this before, or at least pointed to this being a problem before, over at cockroachdb#37001. Setting timeouts around our calls to roachprod helps to better surface these kind of errors. The underlying issue in cockroachdb#38785 and in cockroachdb#35326 is the fact that we're running roachprod commands that may (reasonably) fail due to connection issues, and we're unable to retry them safely (the underlying commands are non-idempotent). Presently we simply fail the entire test, when really we should be able to retry the commands. This is left unaddressed. Release justification: Category 1: Non-production code changes Release note: None
40997: roachtest: deflake bank/{node-restart,cluster-recovery} r=irfansharif a=irfansharif Fixes #38785. Fixes #35326. Because everything roachprod does, it does through SSH, we're particularly susceptible to network delays, packet drops, etc. We've seen this before, or at least pointed to this being a problem before, over at #37001. Setting timeouts around our calls to roachprod helps to better surface these kind of errors. The underlying issue in #38785 and in #35326 is the fact that we're running roachprod commands that may (reasonably) fail due to connection issues, and we're unable to retry them safely (the underlying commands are non-idempotent). Presently we simply fail the entire test, when really we should be able to retry the commands. This is left unaddressed. Release justification: Category 1: Non-production code changes Release note: None 41029: cli: fix the demo licensing code r=rohany a=knz Fixes #40734. Fixes #41024. Release justification: fixes a flaky test, fixes UX of main new feature Before this patch, there were multiple problems with the code: - if the license acquisition was disabled by the env var config, the error message would not be clear. - the licensing code would deadlock silently on OSS-only builds (because the license failure channel was not written in that control branch). - the error/warning messages would be interleaved on the same line as the input line (missing newline at start of message). - the test code would fail when the license server is not available. - the set up of the example database and workload would be performed asynchronously, with unclear signalling of when the user can expect to use them interactively. After this patch: - it's possible to override the license acquisition URL with COCKROACH_DEMO_LICENSE_URL, this is used in tests. - setting up the example database, partitioning and workload is done before presenting the interactive prompt. - partitioning the example database, if requested by --geo-partitioned-replicas, waits for license acquisition to complete (license acquisition remains asynchronous otherwise). - impossible configurations are reported early(earlier). For example: - OSS-only builds: ``` kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas * * ERROR: enterprise features are required for this demo, cannot run from OSS-only binary * Failed running "demo" ``` For license acquisition failures: ``` kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas error while contacting licensing server: Get https://192.168.2.170/api/license?clusterid=5548b310-14b7-46de-8c92-30605bfe95c4&kind=demo&version=v19.2: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) * * ERROR: license acquisition was unsuccessful. * Note: enterprise features are needed for --geo-partitioned-replicas. * Failed running "demo" ``` Additionally, this change fixes test flakiness that arises from an unavailable license server. Release note (cli change): To enable uses of `cockroach demo` with enterprise features in firewalled network environments, it is now possible to redirect the license acquisition with the environment variable COCKROACH_DEMO_LICENSE_URL to a replacement server (for example a suitably configured HTTP proxy). Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>
This PR bumps the permitted number of concurrent unauthenticated
SSH connections from 10 to 64. Above this limit, sshd starts
randomly dropping connections. It's possible this is what we have
been running into with the frequent "Connection closed by remote
host" errors in roachtests.
See:
Release note: None