Paramiko occasionally fails to open transport -- 'SSHException: Error reading SSH protocol banner' #1853

borellim · 2018-08-08T09:42:34Z

On my machine, the daemon occasionally fails to open a transport to the remote computer. The issue appears semi-randomly (hard to reproduce consistenty), but when the daemon encounters it, it seems to keep failing until it is restarted. This is not true when the issue is triggered manually from a verdi shell.

An excerpt from the daemon log:

08/07/2018 11:51:51 AM <12595> aiida.work.job_processes: [INFO] updating calculation<25178>
08/07/2018 11:51:51 AM <12595> aiida.scheduler.slurm: [DEBUG] squeue command: SLURM_TIME_FORMAT='standard' squeue --noheader -o '%i^^^%t^^^%r^^^%B^^^%u^^^%D^^^%C^^^%R^^^%P^^^%l^^^%M^^^%S^^^%j^^^%V' --jobs=8942443
08/07/2018 11:51:51 AM <12595> aiida.transport.SshTransport: [DEBUG] Command to be executed: cd '/users/mborelli' && SLURM_TIME_FORMAT='standard' squeue --noheader -o '%i^^^%t^^^%r^^^%B^^^%u^^^%D^^^%C^^^%R^^^%P^^^%l^^^%M^^^%S^^^%j^^^%V' --jobs=8942443
08/07/2018 11:51:52 AM <12595> aiida.scheduler: [DEBUG] Datetime to serialize in JobInfo is naive, this should be fixed!
08/07/2018 11:51:52 AM <12595> aiida.scheduler: [DEBUG] Datetime to serialize in JobInfo is naive, this should be fixed!
08/07/2018 11:51:52 AM <12595> aiida.work.transports: [DEBUG] Transport request closing transport for AuthInfo for [email protected] on daint5
08/07/2018 11:51:52 AM <12595> aiida.work.job_processes: [INFO] updating calculation<25178> successful
08/07/2018 11:51:52 AM <12595> aiida.work.transports: [DEBUG] transport callback with interval 5
08/07/2018 11:51:57 AM <12595> aiida.work.transports: [DEBUG] Transport request opening transport for AuthInfo for [email protected] on daint5
08/07/2018 11:52:12 AM <12595> paramiko.transport: [ERROR] Exception: Error reading SSH protocol banner
08/07/2018 11:52:12 AM <12595> paramiko.transport: [ERROR] Traceback (most recent call last):
08/07/2018 11:52:12 AM <12595> paramiko.transport: [ERROR]   File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/paramiko/transport.py", line 1893, in run
08/07/2018 11:52:12 AM <12595> paramiko.transport: [ERROR]     self._check_banner()
08/07/2018 11:52:12 AM <12595> paramiko.transport: [ERROR]   File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/paramiko/transport.py", line 2049, in _check_banner
08/07/2018 11:52:12 AM <12595> paramiko.transport: [ERROR]     'Error reading SSH protocol banner' + str(e)
08/07/2018 11:52:12 AM <12595> paramiko.transport: [ERROR] SSHException: Error reading SSH protocol banner
08/07/2018 11:52:12 AM <12595> paramiko.transport: [ERROR] 
08/07/2018 11:52:12 AM <12595> aiida.transport.SshTransport: [ERROR] Error connecting through SSH: [SSHException] Error reading SSH protocol banner, connect_args were: {'username': u'mborelli', 'allow_agent': True, 'sock': <aiida.transport.util._DetachedProxyCommand object at 0x7f9f96d87250>, 'gss_deleg_creds': False, 'key_filename': u'/home/mborelli/.ssh/id_rsa-cscs2', 'compress': True, 'look_for_keys': True, 'timeout': 60, 'gss_kex': False, 'gss_host': u'daint.cscs.ch', 'port': 22, 'gss_auth': False}
08/07/2018 11:52:12 AM <12595> aiida.work.transports: [ERROR] exception occurred while trying to open transport:
 Error reading SSH protocol banner
08/07/2018 11:52:12 AM <12595> aiida.work.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/home/mborelli/repos/aiida_dev/aiida-core/aiida/work/transports.py", line 102, in request_transport
    yield transport_request.future
  File "/home/mborelli/repos/aiida_dev/aiida-core/aiida/work/job_processes.py", line 118, in do_update
    transport = yield request
  File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
SSHException: Error reading SSH protocol banner

08/07/2018 11:52:12 AM <12595> aiida.work.transports: [DEBUG] Transport request closing transport for AuthInfo for [email protected] on daint5
08/07/2018 11:52:12 AM <12595> aiida.daemon.runner: [DEBUG] Ticking the legacy workflows
08/07/2018 11:52:12 AM <12595> aiida.daemon.runner: [DEBUG] Checking for workflows to manage
08/07/2018 11:52:12 AM <12595> aiida.daemon.runner: [DEBUG] Running execute_steps
08/07/2018 11:52:12 AM <12595> aiida.workflowmanager: [DEBUG] Querying the worflow DB
08/07/2018 11:52:12 AM <12595> aiida.aiida_quantumespresso.calculations.pw.PwCalculation: [WARNING] iteration 1 of do_update excepted, retrying after 10 seconds
Traceback (most recent call last):
  File "/home/mborelli/repos/aiida_dev/aiida-core/aiida/work/utils.py", line 73, in exponential_backoff_retry
    result = yield coro()
  File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/mborelli/repos/aiida_dev/aiida-core/aiida/work/job_processes.py", line 118, in do_update
    transport = yield request
  File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
SSHException: Error reading SSH protocol banner

08/07/2018 11:52:22 AM <12595> aiida.work.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/home/mborelli/repos/aiida_dev/aiida-core/aiida/work/transports.py", line 102, in request_transport
    yield transport_request.future
  File "/home/mborelli/repos/aiida_dev/aiida-core/aiida/work/job_processes.py", line 118, in do_update
    transport = yield request
  File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
SSHException: Error reading SSH protocol banner

08/07/2018 11:52:22 AM <12595> aiida.work.transports: [DEBUG] Transport request closing transport for AuthInfo for [email protected] on daint5
08/07/2018 11:52:22 AM <12595> aiida.aiida_quantumespresso.calculations.pw.PwCalculation: [WARNING] iteration 2 of do_update excepted, retrying after 20 seconds

This continues for 3 more times until the exponential backoff procedure gives up.

Both do_update and do_submit have shown to trigger the error.

A script to reproduce it from a shell is as follows (change the computer name as appropriate):

def testing():
    import time
    from aiida.orm.computer import Computer
    for i in range(50):  # 10 is often too few to trigger the error
        computer = Computer.get('daint5')
        t = computer.get_transport()
        t.open()
        time.sleep(1)
        t.close()

try:
    testing()
except Exception as e:
    import traceback
    traceback.print_exc()

Repeat it as necessary until an exception SSHException: Error reading SSH protocol banner is thrown:

timeout 60 1533720989.21
readline timeout 15
TEST
Traceback (most recent call last):
  File "/home/mborelli/venvs/aiida_dev/local/lib/python2.7/site-packages/paramiko/transport.py", line 2048, in _check_banner
    buf = self.packetizer.readline(timeout)
  File "/home/mborelli/venvs/aiida_dev/local/lib/python2.7/site-packages/paramiko/packet.py", line 353, in readline
    buf += self._read_timeout(timeout)
  File "/home/mborelli/venvs/aiida_dev/local/lib/python2.7/site-packages/paramiko/packet.py", line 555, in _read_timeout
    raise socket.timeout()
timeout
TEST
08/08/2018 11:35:44 AM <18341> paramiko.transport: [ERROR] Exception: Error reading SSH protocol banner
08/08/2018 11:35:44 AM <18341> paramiko.transport: [ERROR] Traceback (most recent call last):
08/08/2018 11:35:44 AM <18341> paramiko.transport: [ERROR]   File "/home/mborelli/venvs/aiida_dev/local/lib/python2.7/site-packages/paramiko/transport.py", line 1896, in run
08/08/2018 11:35:44 AM <18341> paramiko.transport: [ERROR]     self._check_banner()
08/08/2018 11:35:44 AM <18341> paramiko.transport: [ERROR]   File "/home/mborelli/venvs/aiida_dev/local/lib/python2.7/site-packages/paramiko/transport.py", line 2057, in _check_banner
08/08/2018 11:35:44 AM <18341> paramiko.transport: [ERROR]     'Error reading SSH protocol banner' + str(e)
08/08/2018 11:35:44 AM <18341> paramiko.transport: [ERROR] SSHException: Error reading SSH protocol banner
08/08/2018 11:35:44 AM <18341> paramiko.transport: [ERROR] 
None
08/08/2018 11:35:44 AM <18341> aiida.transport.SshTransport: [ERROR] Error connecting through SSH: [SSHException] Error reading SSH protocol banner, connect_args were: {'username': u'mborelli', 'allow_agent': True, 'sock': <aiida.transport.util._DetachedProxyCommand object at 0x7fab3e211b90>, 'gss_deleg_creds': False, 'key_filename': u'/home/mborelli/.ssh/id_rsa-cscs2', 'compress': True, 'look_for_keys': True, 'timeout': 60, 'gss_kex': False, 'gss_host': u'daint.cscs.ch', 'port': 22, 'gss_auth': False}
Traceback (most recent call last):
  File "<ipython-input-2-ceeb06cb44db>", line 2, in <module>
    testing()
  File "<ipython-input-1-75457a502b8c>", line 7, in testing
    t.open()
  File "/home/mborelli/repos/aiida_dev/aiida-core/aiida/transport/plugins/ssh.py", line 488, in open
    self._client.connect(self._machine, **connection_arguments)
  File "/home/mborelli/venvs/aiida_dev/local/lib/python2.7/site-packages/paramiko/client.py", line 392, in connect
    t.start_client(timeout=timeout)
  File "/home/mborelli/venvs/aiida_dev/local/lib/python2.7/site-packages/paramiko/transport.py", line 548, in start_client
    raise e
SSHException: Error reading SSH protocol banner

Note that I changed file [...]/site-packages/paramiko/transport.py around line 2052 in order to show the socket.timeout exception, before it is swallowed by paramiko and turned into an SSHException.

The text was updated successfully, but these errors were encountered:

greschd · 2018-08-10T00:27:51Z

Is it possible that there's some sort of limit on how many times you can connect to that machine? I've also encountered this issue semi-randomly, and for me simply increasing the "Connection cooldown time" (in verdi computer configure) to 20 seconds seems to have resolved it. I've also increased "Connection timeout" to 120s, but I don't think that's relevant here.

My guess is that once you've hit the rate limit, it will take a while until you can connect again, which is why it will continue failing.

However, I haven't tested this enough to know whether it's a stable fix or just makes the problem less frequent. Maybe it would be nice to catch the error instead and then just wait for a (longer than usual) while before re-trying the connection.

greschd · 2018-08-10T00:34:35Z

Incidentally, the machine I've had this issue with is also at CSCS -- it might make sense to ask them what exactly their SSH rate limits are.

greschd · 2018-08-10T00:35:19Z

Also, relevant StackOverflow: https://stackoverflow.com/questions/25609153/paramiko-error-reading-ssh-protocol-banner

borellim · 2018-08-16T14:57:49Z

Thank you Dominik.

It seems indeed to be a case of long response time by the server, as suggested in your StackOverflow post. But I don't think I am getting rate-limited, for two reasons:

in the daemon, we are using an exponential backoff procedure that waits 10, 20, 40, 80 seconds before retrying (and finally gives up after that), and yet all attempts fail in this procedure if the first attempt fails;
if instead I run the python snippet above in a verdi shell, even if I rapid-fire connections (by removing calls to time.sleep()), I don't get consistent strings of failures, just random ones.

I found a variable banner_timeout at line 399 of [...]/site-packages/paramiko/transport.py that determines the waiting time before an exception is thrown. I increased it from 15 to 60 seconds, but I still see about 1 in 10 connections taking more than that to establish (and therefore failing) when testing in the shell. I'll see if anything changes when actually using the daemon.

Paging @sphuber whom I forgot to mention at the beginning.

greschd · 2018-10-02T14:15:43Z

Since switching to Piz Daint, I'm having this issue again.

You've mentioned that you can connect even after a failure when triggering it from a verdi shell. Did you use the same instance of SshTransport (or the same paramiko SSHClient) when trying this? I think it would be extremely helpful if we could fix this to a point where it just doesn't require restarting the daemon.

sphuber · 2018-10-02T14:23:04Z

I just configured Daint yesterday for the first time and I also got this problem straight away. Likewise, the exponential backoff didn't help as all retries failed. I agree that this is really annoying and should be fixed. I have no idea where to look though. It really did seem to come from the banner timeout, but I cannot think of a reason why this would trigger consistently in a daemon worker but randomly in a shell

greschd · 2018-10-02T14:25:47Z

My thinking was that maybe there's something in the state of the SSHClient which is wrong after the initial exception.. so maybe just "resetting" the SshTransport to have a new client after it fails would fix the issue (partially).

borellim · 2018-10-03T11:48:24Z

@greschd No, I am not reusing the same SshTransport. If I try to do so, I get this error from the second connection onwards:
[ERROR] Error connecting through SSH: [ProxyCommandFailure] ('ssh -A [email protected] /usr/bin/netcat daint.cscs.ch 22', 'Broken pipe')

Two ideas:

I suspect this error to be linked to "OSError: [Errno 24] Too many open files" #1923 (Paramiko definitely leaks ssh processes), but I haven't been able to confirm this link conclusively.
Are you all using a ProxyCommand to get to Piz Daint? Maybe that has something to do with it.

greschd · 2018-10-03T12:00:56Z

Yes, I also use a ProxyCommand to connect to Piz Daint. As far as I can tell though, I'm using a different ProxyCommand ssh -W %h:%p ela, whereas you seem to be using the netcat solution. Anyhow this could well be part of the reason for this issue -- paramiko creates a subprocess when a proxy used, but not otherwise.

borellim · 2018-10-03T13:32:53Z

I agree. In fact, my ProxyCommand is the exact command that is filling the output of ps after the daemon has been running for a while. I can even reproduce it by opening+closing SshTransport objects in a loop as per the snippet above, but these processes are (mostly, not always) cleaned up when I terminate the loop. I suspect that the python garbage collector is complicating the matter, and it wouldn't be the first time for Paramiko, from a quick look at the long discussions here, here, or here.

Of course, a question remains about what is causing Error reading SSH protocol banner in the first place. I tried to reproduce it by manually opening 50 concurrent ssh shells to Piz Daint (via the same proxy), but none failed.

greschd · 2018-10-03T13:38:33Z

Two things which I've found so far:

I'm not sure if ProxyCommand.close is called when the SshTransport closes. It's possible that the client closing, but needs verification. UPDATE: The close is in fact called when closing the client.
Manually closing the proxy doesn't completely kill the process, but leaves an ssh <defunct> process lying around.

greschd · 2018-10-03T14:18:44Z

Ok, I've fixed two issues in https://github.com/greschd/aiida_core/tree/close_proxycommand:

The _DetachedProxyCommand had a super().__init__ call, which created a second instance of the subprocess
The close method of the ProxyCommand didn't poll or wait the subprocess after killing it, which resulted in the defunct processes.

I've also made the killing of the subprocess a bit more aggressive, using kill if the terminate hasn't worked after 2 seconds.

I'll check if this changes anything about the issue.

sphuber · 2018-10-03T14:21:59Z

Did you find a way to reproduce the problem consistently? How are you testing this?

greschd · 2018-10-03T14:23:52Z

No, so far I've gone by the "looking at the code" method 😄. But, it's happening often enough that I can probably tell with a reasonable certainty if it's fixed.

greschd · 2018-10-03T14:24:21Z

The other thing I've done to test is just open / close the transport and see if there are any ssh processes lingering around.

greschd · 2018-10-03T15:15:57Z

It seems this does not fix the issue, but at least I think I have a decent way of reproducing the issue: Launch a calculation, then disconnect the internet while the update task is going on. Reconnect, and the update task will still fail.

greschd · 2018-10-03T15:18:07Z

Next thing to do is to check the state of the transport after an exception was thrown, and if maybe we can fix it with some exception handling.

greschd · 2018-10-03T15:29:08Z

@sphuber: I'm looking at https://github.com/aiidateam/aiida_core/blob/develop/aiida/work/transports.py#L91

In the case where the transport raises while opening, shouldn't there be a mechanism to ensure that we don't use the same transport request again? Something like the finally case below?

Or instead, check that the transport request hasn't excepted right after the .get(...).

greschd · 2018-10-03T15:32:46Z

Haven't completely wrapped my head around that code, though.

borellim · 2018-10-03T15:45:41Z

@greschd, I manually I applied these 3 commits from your fork, and they seem to fix it! The ssh process leak is gone, and my code snippet above can't trigger the SSHException: Error reading SSH protocol banner any more. I am a few commits behind develop, though.

I will try running the actual aiida daemon with this patch.

greschd · 2018-10-03T15:52:08Z

Well, I guess it's possible this makes the proxy connection more stable.. however if I manually interrupt the connection the error is still persistent, in that it requires a daemon restart.

I think it might be because the future which fails is not removed. In that case, it would actually be the same SSH error showing up multiple times, not a different one.

sphuber · 2018-10-03T15:55:26Z

shouldn't there be a mechanism to ensure that we don't use the same transport request again?

I thought this was already done correctly. If the open call excepts, the exception is set on the future. Now all the request_transport context callers that are waiting for this future, will except as well and in the finally clause when all references are gone, the request is cleaned up. But I think you might be right that there is a stale future/transport here that is not cleaned up properly. @muhrin could you also have a look?

greschd · 2018-10-03T16:06:09Z

It should be quite straightforward to put something in the exception to see if they are always the same one..

greschd · 2018-10-03T16:14:18Z

I'm not used to tornado-style coroutines, does the yield future part raise when there's an exception in the future?

greschd · 2018-10-03T16:19:31Z

Yup, adding a timestamp to the error message confirms that it's definitely the same error being re-thrown.

greschd · 2018-10-03T16:29:12Z

Aah, I think del self._transport_requests[authinfo.id] should happen not only when the transport_request.count is zero, but also in the exception case.

sphuber · 2018-10-03T16:33:13Z

I was thinking about that, but was afraid that would lead to other exceptions.

sphuber · 2018-10-03T16:34:30Z

So might it be that as the open excepts, before that gets to the finally to clean it up, another context manager is opened, upping the ref count and preventing the del from being hit in the finally

sphuber · 2018-10-03T16:37:21Z

I think what we have to do is keep the logic but change the conditional of line 77. If we change if transport_request is None to something like if transport_request is None or transport_request.future.get_exception() is not None: (or however you determine whether an exception was set on the future). This way also when the open called excepted, a new future is created

greschd · 2018-10-03T17:00:14Z

Since the _transport_requests dict is only used in that method, I think those two things would be functionally equivalent.

greschd · 2018-10-03T17:31:41Z

I think there are roughly three ways to fix this:

remove it from the dict in the except after the yield
remove it in the except in the do_update method
check if it failed at the beginning

From my initial tests (1st method) this seems to actually solve the problem 🎉. Let me know which option you prefer, and I'll add it to my branch with the proxy fixes.

greschd · 2018-10-03T17:33:19Z

Do you think there's a reasonable way we can test this? I know we have the torquessh test, but how would we get that to predictively fail?

sphuber · 2018-10-04T09:12:55Z

For the first one, do you mean the except of try: transport_request.count += 1;yield transport_request.future? Because I think that is the wrong place. That should catch exceptions that are raised when the open transport is being used. It is simply reraising it. We should actually put it in the exception handling of the open call in the do_open callback.

Regarding the second, I wouldn't do it there, because we would have to do it in all the transport tasks. They should not know anything about invalidating the transport requests.

Not sure what you mean with the third

greschd · 2018-10-04T09:16:44Z

Yeah, that makes sense. Of course when using it we will also get an error if the future excepted, but that's except is indeed too broad.

sphuber · 2018-10-04T09:22:04Z

Are you confident about your commits regarding the ProxyCommand? I made a separate issue. Would be good if you could make a PR for that. Then I will create a PR with the fix for the transport future invalidation

sphuber · 2018-10-04T11:01:26Z

OK, so after @greschd fixed the ProxyCommand issue the SSHExceptions were no longer reproducable. It appears that the cause was then not the amount of attempted connections per unit time, but the number of open connections at one time. Now that the connections are properly closed, even when using a proxy, the timeouts stopped occurring.

sphuber · 2018-10-04T11:13:18Z

The behavior of this issue was actually a manifestion of #2018 which was fixed in #2019 . The fact that the transport futures never seemed to recover from this exception is actually due to another exception detailed in #2020 . This discussion will be continued there, so I am closing this issue.

greschd mentioned this issue Oct 4, 2018

Fix proxycommand leaks #2019

Merged

sphuber closed this as completed Oct 4, 2018

Paramiko occasionally fails to open transport -- 'SSHException: Error reading SSH protocol banner' #1853

Paramiko occasionally fails to open transport -- 'SSHException: Error reading SSH protocol banner' #1853

Comments

borellim commented Aug 8, 2018 • edited Loading

greschd commented Aug 10, 2018 • edited Loading

greschd commented Aug 10, 2018

greschd commented Aug 10, 2018

borellim commented Aug 16, 2018

greschd commented Oct 2, 2018

sphuber commented Oct 2, 2018

greschd commented Oct 2, 2018

borellim commented Oct 3, 2018 • edited Loading

greschd commented Oct 3, 2018

borellim commented Oct 3, 2018 • edited Loading

greschd commented Oct 3, 2018 • edited Loading

greschd commented Oct 3, 2018

sphuber commented Oct 3, 2018

greschd commented Oct 3, 2018

greschd commented Oct 3, 2018

greschd commented Oct 3, 2018

greschd commented Oct 3, 2018

greschd commented Oct 3, 2018 • edited Loading

greschd commented Oct 3, 2018

borellim commented Oct 3, 2018

greschd commented Oct 3, 2018

sphuber commented Oct 3, 2018

greschd commented Oct 3, 2018 • edited Loading

greschd commented Oct 3, 2018

greschd commented Oct 3, 2018

greschd commented Oct 3, 2018

sphuber commented Oct 3, 2018

sphuber commented Oct 3, 2018

sphuber commented Oct 3, 2018

greschd commented Oct 3, 2018

greschd commented Oct 3, 2018

greschd commented Oct 3, 2018

sphuber commented Oct 4, 2018 • edited Loading

greschd commented Oct 4, 2018

sphuber commented Oct 4, 2018

sphuber commented Oct 4, 2018

sphuber commented Oct 4, 2018

borellim commented Aug 8, 2018 •

edited

Loading

greschd commented Aug 10, 2018 •

edited

Loading

borellim commented Oct 3, 2018 •

edited

Loading

borellim commented Oct 3, 2018 •

edited

Loading

greschd commented Oct 3, 2018 •

edited

Loading

greschd commented Oct 3, 2018 •

edited

Loading

greschd commented Oct 3, 2018 •

edited

Loading

sphuber commented Oct 4, 2018 •

edited

Loading