-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One 'bad' Transport thread hangs indefinitely at shutdown when multiple Transports are active #520
Comments
After further investigation it looks like it's hanging on line 204 under read_all in packet.py which is
It looks like it never makes it past this line after the close() method on the Transport object is called. If I explicitly call
before the end of the script it properly closes the Transport object and cleans up as expected. |
I have this same issue and believe it has to do with calling the close method during a del method. Have you tried not including any calls to close paramiko from a delete method and let python's garbage collection process take care of the thread? |
I don't have have any del methods in my classes |
I've altered the library for debugging purposes by changing the following lines in transport.py@1420: def stop_thread(self):
self.active = False
self.packetizer.close()
while self.is_alive() and (self is not threading.current_thread()):
print("Trying to kill thread.")
self.join(10)
print("It's alive: {}".format(self.is_alive())) The result of this is the script outputs the following once it reaches the sys.exit(0):
It seems to be that the Thread is deadlocked somehow and can't close. I'm not sure what the point of this loop is in transport.py as nothing changes through each loop. Whether you do while thread.is_alive():
thread.join(10) does the same thing, AFAIK, as thread.join() I could see value in the loop if there was a debug log there to let you know it's locked up or if there was a loop counter that eventually skips locked threads. Ideally I would propose that it be changed to def stop_thread(self):
self.active = False
self.packetizer.close()
if self is not threading.current_thread():
self.join(10)
if self.is_alive():
raise Exception("Timed out while trying to kill thread") This doesn't solve whatever is causing the overall locked up Thread issue but this is still a good solution to catch and kill any leftover threads. |
What version of paramiko are you using as I think that is a change that has already been made more or less? |
What do you mean by more or less? I'm almost positive I'm on the master but if not it's 1.15.1. I took a look at the master and it looks like it will still hang the same way if a thread is hung. |
I checked, it's running on version 1.15.2 |
I'm experiencing what might be the same issue on Paramiko 1.15.2, although I don't think it's because the Transport object closed. At least, I've set a breakpoint in the Transport object's close() and it is not being hit. (So... maybe it's not?) Still, though, I'm otherwise getting stuck in the Strangely, this script runs just fine when I test it using most servers. This only occurs when I'm connecting to the servers I definitely 100% need to connect to in my office. As well, I can connect to these servers properly from other SFTP software like FileZilla. It's something in conjunction with paramiko and this server... |
Please try changing the |
this possibly related to #109 |
#698 Looks like the same issue (Hang in packet.py |
@colinmcintosh thanks for your initial post using @botanicvelious mentions a workaround put in place further up the call stack in transport.py
Thanks! |
I wouldn't use my work-around as it doesn't work in all cases, it just works for our specific use case. |
This ticket accurately describes what I'm getting with fabric v2's integration suite and which made me generate fabric/fabric#1473. Doesn't matter how many actual hosts (it can be a bunch of connections to the same host) but the more independent Explicitly calling Having that be a hard requirement of using that API or Paramiko itself, feels super crummy to me, so I'd like this to get solved eventually. Going to poke at this somewhat this weekend myself to see if I can get to the root of it. Difficulty is, threading is often fraught with peril, changes that seem to fix one issue can easily spawn others; and in code this old there's always fun landmines. But this has been an intermittent issue forever so I'd like to at least try fixing it. |
Also of note is that this is all exactly the same stuff #380 describes, though they never got resolution. The only extra wrinkle I see is the assertion that the Still unclear if that's germane, tho as noted I need to doublecheck the different treatment of the various objects involved in the two scenarios. I'm 99% sure I've seen non-hanging threads also terminate via |
MOAR:
Took a different tack and looked at how exactly these bad threads are having
Other random notes:
|
Ways forward brainstorm:
Another idea occurred to me which I think I like better: keep the ultimatum-style join timeout, but set it to a much shorter value if - by the time we're calling it - the transport's socket and packetizer both appear to have entered their closed states. That detects the symptoms of this problem, lets the interpreter exit in a reasonable-to-humans amount of time, but limits the possibility of accidentally terminating "too early" in scenarios unlike the one I am testing under. A couple more wrinkles on this could also be:
|
Yea, mutating the loop to be "not current thread + not socket.closed + not packetizer.closed" and turning the join timeout down to 0.1s seems to do the trick pretty well. I don't have a great way of testing unusually-slow server endpoints right now (something I'd like to get sometime...) but I'm probably going to at least trial this change while I continue hacking on other things. Minor side note, socket objects have no public "is closed?" flag that I can see, but |
FWIW problem + fix both seem present/workable on 1.16 too (so, this is in no way related to the switch to pyca/crypto - not that I thought it would've been). I committed a cleaned-up version of what I was testing with and forward-ported (1.16, 1.17, 2.0, master) - it's live now. If anyone watching this can give one of those branches a shot and give a thumbs-up (both re: fixing the issue, if they have it; or at least proving it's not making things worse for them) that would be cool. |
Was reminded by tef on twitter that I never chased down the assertions made in #380 about the issue potentially being how the socket in question is closed from a different thread during the Offhand (recalling that threading is not my expertise) if that's the true race condition, it would mean we do want "I'm |
Sadly doesn't seem like a workable avenue:
|
@bitprophet Should I file a separate issue for the hang that can occur in |
@sanseihappa Yea, that sounds like a good idea, seems orthogonal to me offhand & anything we can do to empower users to get exceptions instead of hangs, would be useful. Please file a PR - thanks! EDIT: if it wasn't obvious, please drop a ref to "#520" in the ticket body somewhere (not the title, GH doesn't scan those for some reason). |
@sanseihappa I tried switching exchange algorithms, but with no luck. Both sides agree on both exchange algorithms, but return same errors. |
I'm using 2.0.2 and I have hangs which I think are caused by this problem. Is there a workaround until a solution is released? |
I'm planning to pop out 2.0.3 today, which has a couple related fixes in it. |
Awesome! Thanks!
…On Fri, Dec 9, 2016 at 8:15 PM, Jeff Forcier ***@***.***> wrote:
I'm planning to pop out 2.0.3 today, which has a couple related fixes in
it.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#520 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADdyuQ8R8HFaRNTRIaUrLMbOC5Aj1tSks5rGZrMgaJpZM4EM_WF>
.
|
FTR this issue isn't marked as solved yet because it feels like one of those "many causes, similar symptoms" things. We'll see how 2.0.3 and friends do re: fixing it for involved users :) |
It didn't work for me. Looks like I'm having the same errors as previously.
|
I ran into a similar problem with this that was related to the Linux kernel I was working on refused to close the socket which would hang the transport thread. Maybe also check the OS you are running on and the kernel version. |
I'm using updated win 10. :( |
It seems I am running into the same issue when |
Hi all, I am not sure that I have the same exactly problem but paramiko hangs in
Now, in order to investigate this I made a small function that walks all the threads in
EDIT: explicit The most interesting thing is that if you remove the last line you get the symptoms back... I have absolutely no clue or explanation why Let me know if I should create a new Issue Cheers, Andreas |
Found this thread through Google, I'm running Python 2.7.13 and running into this issue with a multiprocessed script that uses paramiko 2.1.1. Here's my log output where it hangs:
|
Hi, i've the same problem. May be somebody help. Python 2.7.12. Server's parameters: |
@tyler-8 @paltsevpavel Try this: #109 (comment) |
Encountered something similar in Python 3.5.2. Blocks forever if an exception occurs in the auth_handler: For example:
and
I'm not running the latest paramiko (1.16.0-1) so I don't know if this has already been fixed, but if someone stumbles on this via google, you can band aid the issue by setting a timer in a thread to close the connection after a certain amount of time so the deadlock won't block your program indefinitely: from threading import Timer
from contextlib import contextmanager
@contextmanager
def close_conn_on_timeout(paramiko_ssh_client):
timer = Timer(10, paramiko_ssh_client.close)
timer.start()
yield
timer.cancel()
with close_conn_on_timeout(paramiko_ssh_client):
paramiko_ssh_client.connect(host) |
I am also facing the similar issue and I am running paramiko 2.4.1,
|
[MAINTAINER NOTE: a variant of this issue under Python 3 has been worked around as per this comment but it's presumably still at large for Python 2 in some scenarios.]
When running SSH connections to multiple devices the script will sometimes hang once the sys.exit(0) is reached. Once this is reached the script will hang indefinitely. It doesn't happen everytime but the more devices the script runs against the more likely it is to happen. It seems like it's related to the amount of time the script takes to run.
The last log message paramiko outputs is DEBUG:EOF in transport thread
Using the faulthandler module I dumped a stacktrace for when it hangs:
It looks like all but LAST ONE of the resources closes correctly but the LAST ONE doesn't close correctly. The same thing happens if .close() is explicitly called on the Transport object. It will hang at that .close() method indefinitely, sometimes.
There is not one specific device it happens for either. I have tried many hosts and OS's with no specific one standing out a problem.
The text was updated successfully, but these errors were encountered: