-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paramiko occasionally fails to open transport -- 'SSHException: Error reading SSH protocol banner' #1853
Comments
Is it possible that there's some sort of limit on how many times you can connect to that machine? I've also encountered this issue semi-randomly, and for me simply increasing the "Connection cooldown time" (in My guess is that once you've hit the rate limit, it will take a while until you can connect again, which is why it will continue failing. However, I haven't tested this enough to know whether it's a stable fix or just makes the problem less frequent. Maybe it would be nice to catch the error instead and then just wait for a (longer than usual) while before re-trying the connection. |
Incidentally, the machine I've had this issue with is also at CSCS -- it might make sense to ask them what exactly their SSH rate limits are. |
Also, relevant StackOverflow: https://stackoverflow.com/questions/25609153/paramiko-error-reading-ssh-protocol-banner |
Thank you Dominik. It seems indeed to be a case of long response time by the server, as suggested in your StackOverflow post. But I don't think I am getting rate-limited, for two reasons:
I found a variable Paging @sphuber whom I forgot to mention at the beginning. |
Since switching to Piz Daint, I'm having this issue again. You've mentioned that you can connect even after a failure when triggering it from a verdi shell. Did you use the same instance of |
I just configured Daint yesterday for the first time and I also got this problem straight away. Likewise, the exponential backoff didn't help as all retries failed. I agree that this is really annoying and should be fixed. I have no idea where to look though. It really did seem to come from the banner timeout, but I cannot think of a reason why this would trigger consistently in a daemon worker but randomly in a shell |
My thinking was that maybe there's something in the state of the |
@greschd No, I am not reusing the same Two ideas:
|
Yes, I also use a ProxyCommand to connect to Piz Daint. As far as I can tell though, I'm using a different ProxyCommand |
I agree. In fact, my ProxyCommand is the exact command that is filling the output of Of course, a question remains about what is causing |
Two things which I've found so far:
|
Ok, I've fixed two issues in https://github.com/greschd/aiida_core/tree/close_proxycommand:
I've also made the killing of the subprocess a bit more aggressive, using I'll check if this changes anything about the issue. |
Did you find a way to reproduce the problem consistently? How are you testing this? |
No, so far I've gone by the "looking at the code" method 😄. But, it's happening often enough that I can probably tell with a reasonable certainty if it's fixed. |
The other thing I've done to test is just open / close the transport and see if there are any ssh processes lingering around. |
It seems this does not fix the issue, but at least I think I have a decent way of reproducing the issue: Launch a calculation, then disconnect the internet while the update task is going on. Reconnect, and the update task will still fail. |
Next thing to do is to check the state of the transport after an exception was thrown, and if maybe we can fix it with some exception handling. |
@sphuber: I'm looking at https://github.com/aiidateam/aiida_core/blob/develop/aiida/work/transports.py#L91 In the case where the transport raises while opening, shouldn't there be a mechanism to ensure that we don't use the same transport request again? Something like the Or instead, check that the transport request hasn't excepted right after the |
Haven't completely wrapped my head around that code, though. |
@greschd, I manually I applied these 3 commits from your fork, and they seem to fix it! The I will try running the actual aiida daemon with this patch. |
Well, I guess it's possible this makes the proxy connection more stable.. however if I manually interrupt the connection the error is still persistent, in that it requires a daemon restart. I think it might be because the future which fails is not removed. In that case, it would actually be the same SSH error showing up multiple times, not a different one. |
I thought this was already done correctly. If the |
It should be quite straightforward to put something in the exception to see if they are always the same one.. |
I'm not used to tornado-style coroutines, does the |
Yup, adding a timestamp to the error message confirms that it's definitely the same error being re-thrown. |
Aah, I think |
I was thinking about that, but was afraid that would lead to other exceptions. |
So might it be that as the |
I think what we have to do is keep the logic but change the conditional of line 77. If we change |
Since the |
I think there are roughly three ways to fix this:
From my initial tests (1st method) this seems to actually solve the problem 🎉. Let me know which option you prefer, and I'll add it to my branch with the proxy fixes. |
Do you think there's a reasonable way we can test this? I know we have the torquessh test, but how would we get that to predictively fail? |
For the first one, do you mean the except of Regarding the second, I wouldn't do it there, because we would have to do it in all the transport tasks. They should not know anything about invalidating the transport requests. Not sure what you mean with the third |
Yeah, that makes sense. Of course when using it we will also get an error if the future excepted, but that's |
Are you confident about your commits regarding the |
OK, so after @greschd fixed the |
On my machine, the daemon occasionally fails to open a transport to the remote computer. The issue appears semi-randomly (hard to reproduce consistenty), but when the daemon encounters it, it seems to keep failing until it is restarted. This is not true when the issue is triggered manually from a verdi shell.
An excerpt from the daemon log:
This continues for 3 more times until the exponential backoff procedure gives up.
Both
do_update
anddo_submit
have shown to trigger the error.A script to reproduce it from a shell is as follows (change the computer name as appropriate):
Repeat it as necessary until an exception
SSHException: Error reading SSH protocol banner
is thrown:Note that I changed file
[...]/site-packages/paramiko/transport.py
around line 2052 in order to show thesocket.timeout
exception, before it is swallowed by paramiko and turned into anSSHException
.The text was updated successfully, but these errors were encountered: