Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paramiko occasionally fails to open transport -- 'SSHException: Error reading SSH protocol banner' #1853

Closed
borellim opened this issue Aug 8, 2018 · 37 comments · Fixed by #2019

Comments

@borellim
Copy link
Member

borellim commented Aug 8, 2018

On my machine, the daemon occasionally fails to open a transport to the remote computer. The issue appears semi-randomly (hard to reproduce consistenty), but when the daemon encounters it, it seems to keep failing until it is restarted. This is not true when the issue is triggered manually from a verdi shell.

An excerpt from the daemon log:

08/07/2018 11:51:51 AM <12595> aiida.work.job_processes: [INFO] updating calculation<25178>
08/07/2018 11:51:51 AM <12595> aiida.scheduler.slurm: [DEBUG] squeue command: SLURM_TIME_FORMAT='standard' squeue --noheader -o '%i^^^%t^^^%r^^^%B^^^%u^^^%D^^^%C^^^%R^^^%P^^^%l^^^%M^^^%S^^^%j^^^%V' --jobs=8942443
08/07/2018 11:51:51 AM <12595> aiida.transport.SshTransport: [DEBUG] Command to be executed: cd '/users/mborelli' && SLURM_TIME_FORMAT='standard' squeue --noheader -o '%i^^^%t^^^%r^^^%B^^^%u^^^%D^^^%C^^^%R^^^%P^^^%l^^^%M^^^%S^^^%j^^^%V' --jobs=8942443
08/07/2018 11:51:52 AM <12595> aiida.scheduler: [DEBUG] Datetime to serialize in JobInfo is naive, this should be fixed!
08/07/2018 11:51:52 AM <12595> aiida.scheduler: [DEBUG] Datetime to serialize in JobInfo is naive, this should be fixed!
08/07/2018 11:51:52 AM <12595> aiida.work.transports: [DEBUG] Transport request closing transport for AuthInfo for [email protected] on daint5
08/07/2018 11:51:52 AM <12595> aiida.work.job_processes: [INFO] updating calculation<25178> successful
08/07/2018 11:51:52 AM <12595> aiida.work.transports: [DEBUG] transport callback with interval 5
08/07/2018 11:51:57 AM <12595> aiida.work.transports: [DEBUG] Transport request opening transport for AuthInfo for [email protected] on daint5
08/07/2018 11:52:12 AM <12595> paramiko.transport: [ERROR] Exception: Error reading SSH protocol banner
08/07/2018 11:52:12 AM <12595> paramiko.transport: [ERROR] Traceback (most recent call last):
08/07/2018 11:52:12 AM <12595> paramiko.transport: [ERROR]   File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/paramiko/transport.py", line 1893, in run
08/07/2018 11:52:12 AM <12595> paramiko.transport: [ERROR]     self._check_banner()
08/07/2018 11:52:12 AM <12595> paramiko.transport: [ERROR]   File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/paramiko/transport.py", line 2049, in _check_banner
08/07/2018 11:52:12 AM <12595> paramiko.transport: [ERROR]     'Error reading SSH protocol banner' + str(e)
08/07/2018 11:52:12 AM <12595> paramiko.transport: [ERROR] SSHException: Error reading SSH protocol banner
08/07/2018 11:52:12 AM <12595> paramiko.transport: [ERROR] 
08/07/2018 11:52:12 AM <12595> aiida.transport.SshTransport: [ERROR] Error connecting through SSH: [SSHException] Error reading SSH protocol banner, connect_args were: {'username': u'mborelli', 'allow_agent': True, 'sock': <aiida.transport.util._DetachedProxyCommand object at 0x7f9f96d87250>, 'gss_deleg_creds': False, 'key_filename': u'/home/mborelli/.ssh/id_rsa-cscs2', 'compress': True, 'look_for_keys': True, 'timeout': 60, 'gss_kex': False, 'gss_host': u'daint.cscs.ch', 'port': 22, 'gss_auth': False}
08/07/2018 11:52:12 AM <12595> aiida.work.transports: [ERROR] exception occurred while trying to open transport:
 Error reading SSH protocol banner
08/07/2018 11:52:12 AM <12595> aiida.work.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/home/mborelli/repos/aiida_dev/aiida-core/aiida/work/transports.py", line 102, in request_transport
    yield transport_request.future
  File "/home/mborelli/repos/aiida_dev/aiida-core/aiida/work/job_processes.py", line 118, in do_update
    transport = yield request
  File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
SSHException: Error reading SSH protocol banner

08/07/2018 11:52:12 AM <12595> aiida.work.transports: [DEBUG] Transport request closing transport for AuthInfo for [email protected] on daint5
08/07/2018 11:52:12 AM <12595> aiida.daemon.runner: [DEBUG] Ticking the legacy workflows
08/07/2018 11:52:12 AM <12595> aiida.daemon.runner: [DEBUG] Checking for workflows to manage
08/07/2018 11:52:12 AM <12595> aiida.daemon.runner: [DEBUG] Running execute_steps
08/07/2018 11:52:12 AM <12595> aiida.workflowmanager: [DEBUG] Querying the worflow DB
08/07/2018 11:52:12 AM <12595> aiida.aiida_quantumespresso.calculations.pw.PwCalculation: [WARNING] iteration 1 of do_update excepted, retrying after 10 seconds
Traceback (most recent call last):
  File "/home/mborelli/repos/aiida_dev/aiida-core/aiida/work/utils.py", line 73, in exponential_backoff_retry
    result = yield coro()
  File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/mborelli/repos/aiida_dev/aiida-core/aiida/work/job_processes.py", line 118, in do_update
    transport = yield request
  File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
SSHException: Error reading SSH protocol banner

08/07/2018 11:52:22 AM <12595> aiida.work.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/home/mborelli/repos/aiida_dev/aiida-core/aiida/work/transports.py", line 102, in request_transport
    yield transport_request.future
  File "/home/mborelli/repos/aiida_dev/aiida-core/aiida/work/job_processes.py", line 118, in do_update
    transport = yield request
  File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/mborelli/venvs/aiida_dev/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
SSHException: Error reading SSH protocol banner

08/07/2018 11:52:22 AM <12595> aiida.work.transports: [DEBUG] Transport request closing transport for AuthInfo for [email protected] on daint5
08/07/2018 11:52:22 AM <12595> aiida.aiida_quantumespresso.calculations.pw.PwCalculation: [WARNING] iteration 2 of do_update excepted, retrying after 20 seconds

This continues for 3 more times until the exponential backoff procedure gives up.

Both do_update and do_submit have shown to trigger the error.

A script to reproduce it from a shell is as follows (change the computer name as appropriate):

def testing():
    import time
    from aiida.orm.computer import Computer
    for i in range(50):  # 10 is often too few to trigger the error
        computer = Computer.get('daint5')
        t = computer.get_transport()
        t.open()
        time.sleep(1)
        t.close()

try:
    testing()
except Exception as e:
    import traceback
    traceback.print_exc()

Repeat it as necessary until an exception SSHException: Error reading SSH protocol banner is thrown:

timeout 60 1533720989.21
readline timeout 15
TEST
Traceback (most recent call last):
  File "/home/mborelli/venvs/aiida_dev/local/lib/python2.7/site-packages/paramiko/transport.py", line 2048, in _check_banner
    buf = self.packetizer.readline(timeout)
  File "/home/mborelli/venvs/aiida_dev/local/lib/python2.7/site-packages/paramiko/packet.py", line 353, in readline
    buf += self._read_timeout(timeout)
  File "/home/mborelli/venvs/aiida_dev/local/lib/python2.7/site-packages/paramiko/packet.py", line 555, in _read_timeout
    raise socket.timeout()
timeout
TEST
08/08/2018 11:35:44 AM <18341> paramiko.transport: [ERROR] Exception: Error reading SSH protocol banner
08/08/2018 11:35:44 AM <18341> paramiko.transport: [ERROR] Traceback (most recent call last):
08/08/2018 11:35:44 AM <18341> paramiko.transport: [ERROR]   File "/home/mborelli/venvs/aiida_dev/local/lib/python2.7/site-packages/paramiko/transport.py", line 1896, in run
08/08/2018 11:35:44 AM <18341> paramiko.transport: [ERROR]     self._check_banner()
08/08/2018 11:35:44 AM <18341> paramiko.transport: [ERROR]   File "/home/mborelli/venvs/aiida_dev/local/lib/python2.7/site-packages/paramiko/transport.py", line 2057, in _check_banner
08/08/2018 11:35:44 AM <18341> paramiko.transport: [ERROR]     'Error reading SSH protocol banner' + str(e)
08/08/2018 11:35:44 AM <18341> paramiko.transport: [ERROR] SSHException: Error reading SSH protocol banner
08/08/2018 11:35:44 AM <18341> paramiko.transport: [ERROR] 
None
08/08/2018 11:35:44 AM <18341> aiida.transport.SshTransport: [ERROR] Error connecting through SSH: [SSHException] Error reading SSH protocol banner, connect_args were: {'username': u'mborelli', 'allow_agent': True, 'sock': <aiida.transport.util._DetachedProxyCommand object at 0x7fab3e211b90>, 'gss_deleg_creds': False, 'key_filename': u'/home/mborelli/.ssh/id_rsa-cscs2', 'compress': True, 'look_for_keys': True, 'timeout': 60, 'gss_kex': False, 'gss_host': u'daint.cscs.ch', 'port': 22, 'gss_auth': False}
Traceback (most recent call last):
  File "<ipython-input-2-ceeb06cb44db>", line 2, in <module>
    testing()
  File "<ipython-input-1-75457a502b8c>", line 7, in testing
    t.open()
  File "/home/mborelli/repos/aiida_dev/aiida-core/aiida/transport/plugins/ssh.py", line 488, in open
    self._client.connect(self._machine, **connection_arguments)
  File "/home/mborelli/venvs/aiida_dev/local/lib/python2.7/site-packages/paramiko/client.py", line 392, in connect
    t.start_client(timeout=timeout)
  File "/home/mborelli/venvs/aiida_dev/local/lib/python2.7/site-packages/paramiko/transport.py", line 548, in start_client
    raise e
SSHException: Error reading SSH protocol banner

Note that I changed file [...]/site-packages/paramiko/transport.py around line 2052 in order to show the socket.timeout exception, before it is swallowed by paramiko and turned into an SSHException.

@greschd
Copy link
Member

greschd commented Aug 10, 2018

Is it possible that there's some sort of limit on how many times you can connect to that machine? I've also encountered this issue semi-randomly, and for me simply increasing the "Connection cooldown time" (in verdi computer configure) to 20 seconds seems to have resolved it. I've also increased "Connection timeout" to 120s, but I don't think that's relevant here.

My guess is that once you've hit the rate limit, it will take a while until you can connect again, which is why it will continue failing.

However, I haven't tested this enough to know whether it's a stable fix or just makes the problem less frequent. Maybe it would be nice to catch the error instead and then just wait for a (longer than usual) while before re-trying the connection.

@greschd
Copy link
Member

greschd commented Aug 10, 2018

Incidentally, the machine I've had this issue with is also at CSCS -- it might make sense to ask them what exactly their SSH rate limits are.

@greschd
Copy link
Member

greschd commented Aug 10, 2018

@borellim
Copy link
Member Author

Thank you Dominik.

It seems indeed to be a case of long response time by the server, as suggested in your StackOverflow post. But I don't think I am getting rate-limited, for two reasons:

  • in the daemon, we are using an exponential backoff procedure that waits 10, 20, 40, 80 seconds before retrying (and finally gives up after that), and yet all attempts fail in this procedure if the first attempt fails;
  • if instead I run the python snippet above in a verdi shell, even if I rapid-fire connections (by removing calls to time.sleep()), I don't get consistent strings of failures, just random ones.

I found a variable banner_timeout at line 399 of [...]/site-packages/paramiko/transport.py that determines the waiting time before an exception is thrown. I increased it from 15 to 60 seconds, but I still see about 1 in 10 connections taking more than that to establish (and therefore failing) when testing in the shell. I'll see if anything changes when actually using the daemon.

Paging @sphuber whom I forgot to mention at the beginning.

@greschd
Copy link
Member

greschd commented Oct 2, 2018

Since switching to Piz Daint, I'm having this issue again.

You've mentioned that you can connect even after a failure when triggering it from a verdi shell. Did you use the same instance of SshTransport (or the same paramiko SSHClient) when trying this? I think it would be extremely helpful if we could fix this to a point where it just doesn't require restarting the daemon.

@sphuber
Copy link
Contributor

sphuber commented Oct 2, 2018

I just configured Daint yesterday for the first time and I also got this problem straight away. Likewise, the exponential backoff didn't help as all retries failed. I agree that this is really annoying and should be fixed. I have no idea where to look though. It really did seem to come from the banner timeout, but I cannot think of a reason why this would trigger consistently in a daemon worker but randomly in a shell

@greschd
Copy link
Member

greschd commented Oct 2, 2018

My thinking was that maybe there's something in the state of the SSHClient which is wrong after the initial exception.. so maybe just "resetting" the SshTransport to have a new client after it fails would fix the issue (partially).

@borellim
Copy link
Member Author

borellim commented Oct 3, 2018

@greschd No, I am not reusing the same SshTransport. If I try to do so, I get this error from the second connection onwards:
[ERROR] Error connecting through SSH: [ProxyCommandFailure] ('ssh -A [email protected] /usr/bin/netcat daint.cscs.ch 22', 'Broken pipe')

Two ideas:

  • I suspect this error to be linked to "OSError: [Errno 24] Too many open files" #1923 (Paramiko definitely leaks ssh processes), but I haven't been able to confirm this link conclusively.
  • Are you all using a ProxyCommand to get to Piz Daint? Maybe that has something to do with it.

@greschd
Copy link
Member

greschd commented Oct 3, 2018

Yes, I also use a ProxyCommand to connect to Piz Daint. As far as I can tell though, I'm using a different ProxyCommand ssh -W %h:%p ela, whereas you seem to be using the netcat solution. Anyhow this could well be part of the reason for this issue -- paramiko creates a subprocess when a proxy used, but not otherwise.

@borellim
Copy link
Member Author

borellim commented Oct 3, 2018

I agree. In fact, my ProxyCommand is the exact command that is filling the output of ps after the daemon has been running for a while. I can even reproduce it by opening+closing SshTransport objects in a loop as per the snippet above, but these processes are (mostly, not always) cleaned up when I terminate the loop. I suspect that the python garbage collector is complicating the matter, and it wouldn't be the first time for Paramiko, from a quick look at the long discussions here, here, or here.

Of course, a question remains about what is causing Error reading SSH protocol banner in the first place. I tried to reproduce it by manually opening 50 concurrent ssh shells to Piz Daint (via the same proxy), but none failed.

@greschd
Copy link
Member

greschd commented Oct 3, 2018

Two things which I've found so far:

  • I'm not sure if ProxyCommand.close is called when the SshTransport closes. It's possible that the client closing, but needs verification. UPDATE: The close is in fact called when closing the client.
  • Manually closing the proxy doesn't completely kill the process, but leaves an ssh <defunct> process lying around.

@greschd
Copy link
Member

greschd commented Oct 3, 2018

Ok, I've fixed two issues in https://github.com/greschd/aiida_core/tree/close_proxycommand:

  • The _DetachedProxyCommand had a super().__init__ call, which created a second instance of the subprocess
  • The close method of the ProxyCommand didn't poll or wait the subprocess after killing it, which resulted in the defunct processes.

I've also made the killing of the subprocess a bit more aggressive, using kill if the terminate hasn't worked after 2 seconds.

I'll check if this changes anything about the issue.

@sphuber
Copy link
Contributor

sphuber commented Oct 3, 2018

Did you find a way to reproduce the problem consistently? How are you testing this?

@greschd
Copy link
Member

greschd commented Oct 3, 2018

No, so far I've gone by the "looking at the code" method 😄. But, it's happening often enough that I can probably tell with a reasonable certainty if it's fixed.

@greschd
Copy link
Member

greschd commented Oct 3, 2018

The other thing I've done to test is just open / close the transport and see if there are any ssh processes lingering around.

@greschd
Copy link
Member

greschd commented Oct 3, 2018

It seems this does not fix the issue, but at least I think I have a decent way of reproducing the issue: Launch a calculation, then disconnect the internet while the update task is going on. Reconnect, and the update task will still fail.

@greschd
Copy link
Member

greschd commented Oct 3, 2018

Next thing to do is to check the state of the transport after an exception was thrown, and if maybe we can fix it with some exception handling.

@greschd
Copy link
Member

greschd commented Oct 3, 2018

@sphuber: I'm looking at https://github.com/aiidateam/aiida_core/blob/develop/aiida/work/transports.py#L91

In the case where the transport raises while opening, shouldn't there be a mechanism to ensure that we don't use the same transport request again? Something like the finally case below?

Or instead, check that the transport request hasn't excepted right after the .get(...).

@greschd
Copy link
Member

greschd commented Oct 3, 2018

Haven't completely wrapped my head around that code, though.

@borellim
Copy link
Member Author

borellim commented Oct 3, 2018

@greschd, I manually I applied these 3 commits from your fork, and they seem to fix it! The ssh process leak is gone, and my code snippet above can't trigger the SSHException: Error reading SSH protocol banner any more. I am a few commits behind develop, though.

I will try running the actual aiida daemon with this patch.

@greschd
Copy link
Member

greschd commented Oct 3, 2018

Well, I guess it's possible this makes the proxy connection more stable.. however if I manually interrupt the connection the error is still persistent, in that it requires a daemon restart.

I think it might be because the future which fails is not removed. In that case, it would actually be the same SSH error showing up multiple times, not a different one.

@sphuber
Copy link
Contributor

sphuber commented Oct 3, 2018

shouldn't there be a mechanism to ensure that we don't use the same transport request again?

I thought this was already done correctly. If the open call excepts, the exception is set on the future. Now all the request_transport context callers that are waiting for this future, will except as well and in the finally clause when all references are gone, the request is cleaned up. But I think you might be right that there is a stale future/transport here that is not cleaned up properly. @muhrin could you also have a look?

@greschd
Copy link
Member

greschd commented Oct 3, 2018

It should be quite straightforward to put something in the exception to see if they are always the same one..

@greschd
Copy link
Member

greschd commented Oct 3, 2018

I'm not used to tornado-style coroutines, does the yield future part raise when there's an exception in the future?

@greschd
Copy link
Member

greschd commented Oct 3, 2018

Yup, adding a timestamp to the error message confirms that it's definitely the same error being re-thrown.

@greschd
Copy link
Member

greschd commented Oct 3, 2018

Aah, I think del self._transport_requests[authinfo.id] should happen not only when the transport_request.count is zero, but also in the exception case.

@sphuber
Copy link
Contributor

sphuber commented Oct 3, 2018

I was thinking about that, but was afraid that would lead to other exceptions.

@sphuber
Copy link
Contributor

sphuber commented Oct 3, 2018

So might it be that as the open excepts, before that gets to the finally to clean it up, another context manager is opened, upping the ref count and preventing the del from being hit in the finally

@sphuber
Copy link
Contributor

sphuber commented Oct 3, 2018

I think what we have to do is keep the logic but change the conditional of line 77. If we change if transport_request is None to something like if transport_request is None or transport_request.future.get_exception() is not None: (or however you determine whether an exception was set on the future). This way also when the open called excepted, a new future is created

@greschd
Copy link
Member

greschd commented Oct 3, 2018

Since the _transport_requests dict is only used in that method, I think those two things would be functionally equivalent.

@greschd
Copy link
Member

greschd commented Oct 3, 2018

I think there are roughly three ways to fix this:

  • remove it from the dict in the except after the yield
  • remove it in the except in the do_update method
  • check if it failed at the beginning

From my initial tests (1st method) this seems to actually solve the problem 🎉. Let me know which option you prefer, and I'll add it to my branch with the proxy fixes.

@greschd
Copy link
Member

greschd commented Oct 3, 2018

Do you think there's a reasonable way we can test this? I know we have the torquessh test, but how would we get that to predictively fail?

@sphuber
Copy link
Contributor

sphuber commented Oct 4, 2018

For the first one, do you mean the except of try: transport_request.count += 1;yield transport_request.future? Because I think that is the wrong place. That should catch exceptions that are raised when the open transport is being used. It is simply reraising it. We should actually put it in the exception handling of the open call in the do_open callback.

Regarding the second, I wouldn't do it there, because we would have to do it in all the transport tasks. They should not know anything about invalidating the transport requests.

Not sure what you mean with the third

@greschd
Copy link
Member

greschd commented Oct 4, 2018

Yeah, that makes sense. Of course when using it we will also get an error if the future excepted, but that's except is indeed too broad.

@sphuber
Copy link
Contributor

sphuber commented Oct 4, 2018

Are you confident about your commits regarding the ProxyCommand? I made a separate issue. Would be good if you could make a PR for that. Then I will create a PR with the fix for the transport future invalidation

@sphuber
Copy link
Contributor

sphuber commented Oct 4, 2018

OK, so after @greschd fixed the ProxyCommand issue the SSHExceptions were no longer reproducable. It appears that the cause was then not the amount of attempted connections per unit time, but the number of open connections at one time. Now that the connections are properly closed, even when using a proxy, the timeouts stopped occurring.

@sphuber
Copy link
Contributor

sphuber commented Oct 4, 2018

The behavior of this issue was actually a manifestion of #2018 which was fixed in #2019 . The fact that the transport futures never seemed to recover from this exception is actually due to another exception detailed in #2020 . This discussion will be continued there, so I am closing this issue.

@sphuber sphuber closed this as completed Oct 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants