-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System transport session.read() stuck indefinitely #233
Comments
Hey @netixx thanks for opening this! Are you able to share more of the frame dump stuff? I want to try to see which timeout flavor is being ran. Iit should be multiprocessing timeout... since we cant use signals from not the main thread (also it should always be multiprocessing timeout with system transport), but possible we have a bug in the selection of the timeout flavor basically. Also just for sanity sake, can you confirm that How difficult would it be to share a minimum reproducible compose file? That would be pretty awesome to help hunt this down if it is at all possible! Totally unrelated, but, out of curiosity, why bother with nornir at all if you already have the means/setup to spawn more processes? Thanks again! Carl |
Almost forgot to add -- I would bet ssh2/paramiko will work in this scenario... and if you are running this in containers its likely that all the benefits/things I like about system transport are probably pretty pointless anyway (since you probably are not putting weird openssh things like proxy jump/control persist/custom kex/cipher/etc. things in your container). So... I definitely want to try to find the root cause and fix it, but it may be worth giving one of the other transports a shot to see if that sorts things as well! |
Hello, Here is the full frame dump I gathered:
I can confirm that transport_timeout is non 0, I am using default values (of 30.0 checked with a debugger breakpoint). We are using fancy ssh feature (we connect through a special proxy - basically it only forwards stdin/stdout to/from the devices - https://github.com/ovh/the-bastion) - this is one of the reasons we use scrapli. The basic principle of the app is as follows: a user submits a check on one or more devices in the network in the frontend, the worker consumes the tasks and performs the check, returning the result to the frontend via a queue. Nornir helps us streamline our code by taking care of :
Here is an MVP script :
|
Awesome, thanks for all the detail! I’ll try to poke around with this over the next few days to see if I can reproduce and figure out what’s going on! |
Would it be possible to get logs from one of the failing devices as well? (scrapli logs I mean). Curious to see what that looks like. Spent a bit of time poking around (and trying to re-remember how things work in the timeouts!). I think I was able to mostly recreate this by patching the read1 method and making it block. If that read blocks the thread we run the read in (that the decorator spawns) can never finish, and we have no way of externally killing it (afaik). So, in my testing I would make that block for say 10s but have a 2s timeout value -- after 10s the "read1" thing would unblock. Then the decorator can finally raise the exception/close the connection. I'm not 100% sure how/where that read1 method can get stuck blocking in real life (logs may help with that?). If I connect to an XR box via netconf (just to be as similar as possible to your issue -- I imagine as you said that this happens regardless of the target device), and set a timeout to lower than what the response should take to return to me the decorator kills the connection which causes read1 to not be able to read and things work as you'd expect. So not 100% sure I've got all this sorted in my head yet but wanted to comment to keep ya in the loop, for posterity, and to ask about those logs. Carl |
Just to make sure I understand: Shouldn't we run the 'monitoring' loop in a dedicated subprocess then (this would need an overhaul, because we cannot spawn a process for each call to read()!) ? This way we can kill the process if it gets stuck ? Other idea, in the case of SystemTransport (which I think is also the cause of this issue), we can kill the underlying SSH process when the timeout fires - this should release the read() call ? I gathered the scrapli logs by adding the thread-id to the logging formatter and matching that to the thread that is stuck:
It seems that the device stops sending it's capabilities mid-flight! |
Yeah, no way to kill the thread for sure -- as far as read actually blocking and being the thing that is causing this, that is my best guess!
Maybe?! 😁 If I'm understanding you are you basically saying run the timeout wrapper as a process such that we can kill the process and not have to worry about the thread dead locking us? My initial reaction to this is -- "that would be very costly". In fact, my personal "settings" when I use scrapli is literally to disable the transport timeout entirely as this causes us to spawn many fewer threads which makes it use less cpu and makes things a tiny bit faster. This could potentially also be a short term fix for this situation.... 🤔
I may be misremembering as I messed with all the timeouts quite a while ago, but here is my recollection:
Yep, good thought, but the
Interesting! I wonder if you could correlate that with So.... takeaways....
Thanks a bunch for sticking with me on this one -- interesting issue for sure! Carl |
A precision regarding I think we should kill the underlying ssh process in this function, before we wait on the threads (which are locked in read). I tested with transport_timeout=0, but I still get the locking issue. The threads are locked on |
I did some more testing/looking around on this issue. The _handle_timeout function does in fact call close() on the transport, and for SystemTransport, it tries to kill the subprocess:
However, upon debugging, I found that I changed the process pool name in the decorator to However, upon looking at the logs in details, I could see a log for the decorator timeout for this device (i tuned the log to include the device name as well): |
yeah, this makes sense in hindsight -- we would still be blocking on read1, and the channel timeout wrapper would still apply. I suspect if we set both transport and channel timeouts to 0 we would see it hang and/or have some other issue...
OK, tracking so far -- this feels "good" (in that that sounds like what it should be doing!)
If you see that log message wouldn't that mean it was fired? Been a bit of a long day so maybe im not thinking clearly -- but it sounds like the important part of your most recent findings is that Thanks again for all your work on digging into this! Carl |
Sorry I made a typo, I mean I did not see a log for the transport timeout in the case where read is locked (i.e. when looking back at the logs for a specific device which is in a locked state). To rephrase is a positive sentence: I now think the issue is that the _handle_timeout method is not called. I was doing more tests to find out what the state of the future would be (assuming
Testing with this server (changing the 'host' dynamically in nornir so that the target address is replaced by localhost:2222 for all hosts), I found that I couldn't reproduce the issue when I had a single worker (nornir My thoughts at this point is: why use a ThreadPoolExecutor for this purpose ? Wouldn't it be simpler to launch a I had some success with (still needs some ironing out):
Along with using force=True in the close call (SIGHUP, SIGCONT, SIGINT does cut it to kill the ssh process that is stuck) - but it needs more testing to make sure it works as expected:
Thank you also for investigating this issue :) |
Ok, cool makes more sense!
I'm able to reproduce without nornir and just core scrapli (no netconf) using pretty much your same setup -- with a single future submitted to a ThreadPoolExecutor it works as we want it to, as soon as there are two futures submitted we block forever.
Literally didn't know this existed till you brought it up here -- but it looks nice! Doing a quick bit of playing around with this it does seem like it works better... I see it is raising the ScrapliTimeout exception, but still looks like things are hanging somewhere/somehow. I also added the .... many minutes later.... 😁 I think maybe I've found out whats up... I went w/ classes for decorators for.... I'm honestly not 100% sure anymore -- I think it was partly just to break things up nicer but keep things grouped together, and maybe to retain some state about the type of timeout to run or something. Regardless... I'm still not 100% sure how, but what I think is happening is that we are trying to kill the session of the same process. This works of course if there is only one, but as soon as we add the second thread/process for whatever reason we just keep trying to kill the same process twice. You can probably(hopefully!) confirm this by just dropping a print of The reason for bringing up the class decorators is that I did a super down and dirty version of this as a function and it looks like we do not block (this is without caring about the threading.timeout thing and just leaving the existing way, I don't think that has anything to do with it either way). you can try decorating the read of system transport with something like this to see if this solves the issue.
If that is the case then I'm fine with goin to function decorators but I really want to find out why the class way is being difficult. I dropped some prints right before calling the So... maybe/hopefully there is something we can clean up in ptyprocess to make this problem go away... Thats all the brain power I have for the moment, will take another look later/tomorrow! |
I'm like 99.9% sure im going to move to the thread timing thing you showed, if for no other reason than it seems less resource intensive than spawning threads for timeouts (but ill test that theory first of course!). I think in the case of solving this issue its just kind of making it look different but I dont think its actually changing any behavior. I really want to find out whats up with the decorator bits, but at this point I'm leaning toward just overhauling the decorator(s) and moving them to functions. That feels kind of more "normal" for python anyway so maybe this is just the motivitation to do that. Probably won't look at this again till the weekend, but maybe you'll find something dumb I did and fix it all before then 🤓 |
I suppose the problem lies here in the original setup:
Using @TransportTimeout() will create a single instance of the class for all timeout functions, then the function will be wrap by the call function of the class. I think this means that When there is only one thread, this doesn't cause an issue. When there are multiple threads, it depends on the sequencing of the call to If you want to keep the classes for looks or other reasons, then we only need to make sure to use only local variables (ie. no reference to self) in the call function and other functions in the class - though that defeats the purpose of the class :). I tested your modified decorate function and it worked, so I guess we managed to find the issue in the end! Regarding |
Yeah it totally all makes sense now! Hard way to learn it, but I guess I will remember it now! I think I got it in my head that every
Awesome. I'm going to mess around and see how I want to go forward with fixing this. Keep classes but do it right, or ditch them and just go with functions etc... will need to update tests and such so may take a bit to get dialed in.
Makes sense -- I'm not too flustered about it... especially at the moment. Longer term it could be neat to improve the efficiency of things here because I think this is the weakest part of scrapli -- or put in a better way, the least efficient part. But.... thats for another day 😁 Will keep ya posted and defo will make sure that this gets sorted and merged before the next release (2022.06.30). Thank you a ton for all the help on debugging and talking through this, its been a fun one! Carl |
going to close this -- if you have a few and can give the pr over in #237 a try that would be great! thanks again for all the help on this one -- I ended up just smashing everything down into simple function decorators, so it should just work ™️ now! Carl |
I have, I believe the exact same problem with asyncssh, as well as paramiko. I found the "easy" way to deadlock, is add an autocommand to a vty, and watch it die.
Infinite hangs, stuck in read loop forever. |
Describe the bug
The
system
transport code with the device seems to get stuck indefinitely. This seems impossible as this is wrapped with @TransportTimeout. However we could get this stuck for at least 30 minutes, which means that the exception does not fire.I think this has to do with the use of scrapli in threads, because I couldn't reproduce it with a simple script.
To Reproduce
We are using nornir-scrapli, getting config via netconf for around 60 devices.
For around 1 out of 3 runs, one of the threads get stuck on the
self.fileobj.read1(size)
.The behavior seems quite random (we couldn't single out a host).
The full setup is as follows:
ssh
processWe only have Cisco IOS-XR platforms at the moment (but I don't think it's relevant, as this seems a low-level issue).
Expected behavior
A timeout exception should have been raised by the call to
session.read
taking too long (more that 30s by default).Stack Trace
**OS **
The text was updated successfully, but these errors were encountered: