-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeouts on newer RPyC versions #527
Comments
Script used to validate my changes #!/usr/bin/env bash
function ansi_wrap_msg() { printf "\e[1;${1}m${2}\e[m"; }
function blue_msg() { ansi_wrap_msg "34" "$1"; }
function red_msg() { ansi_wrap_msg "31" "$1"; }
function getpid() {
printf "%s" "$(ps -aux | grep 'versions/3[^/]*/bin/flexget' | grep -v vim | awk '{print $2}' | tr -d '\s')"
}
function testflex() {
timeout 5 flexget execute --tasks 'digest 1' && blue_msg "exit code $?: task-passed\n"
}
pid=$(getpid)
[[ "$pid" =~ ^[0-9]+$ ]] && kill -9 "$pid"
printf "" > "/tmp/rpyc.log"
flexget --loglevel DEBUG daemon start -d
testflex || red_msg "exit code $?: task-failed\n"
pid=$(getpid)
[[ "$pid" =~ ^[0-9]+$ ]] && kill -9 $pid |
Thanks so much! |
Thank you! You gave me a much needed test case. I plan to do a release this week. Maybe tomorrow 👍 |
Hi, since I was experiencing this class of issues a while ago and put some effort into fixing it (#492), I thought it would make sense to shed some extra light on this because the "fix" actually unfixes a fixed issue. Consider thread A at Line 438 in ec5fbe5
Line 47 in ec5fbe5
Connection.serve .
The current "fix" contains the following path
If #492 led to Line 445 in 60608d4
on the other hand references a commit that isn't on the surface of any of the newer releases. So please revert the "fix" or tell me that I'm wrong :) * I was wrong about that, sry! There are still many ways to lock up rpyc in a multithreaded environment because of its default behavior: any thread may process any request. You could try #507 and see if it helps. It is experimental for a few reasons, but not because it were unreliable. With an actual project using/needing it (other than my own) I would go the extra mile and write documentation, tests, benchmark, ... |
I was definitely on 5.2.1 and 5.3.0 release versions when experiencing the issue.
I actually did try that. On 5.3.0 it crashes like this with bind_threads on, and just locks up with it off.
On 5.3.1, everything works fine with bind_threads on or off. |
Yes, I saw the code, didn't find it in the release commits and wrongly deduced that you must have used it. My bad! 5.3.1 with bind_threads off works, until it doesn't... Originally, #492 proposed a more fine grained release strategy for the lock which might have covered your use case, or not, who knows. Its probably worth figuring out what causes the timeout using 5.3.0 and fix the issue properly, if possible. |
I'm assuming the not working times would be the same as in 5.1.0? We've been using that version (or older) for many years (and many users) with no issues. Not trying to deny that there are issues your fixes solve, just that there weren't any issues with the old way for our use case, but with the changes introduced by 5.2 (and reverted in 5.3.1) we have a consistent deadlock.
I agree. It could be that we are using rpyc "wrong," but in a way that never caused issues until now. I was having a hell of a time debugging or simplifying the exact cause though. |
I get the sentiment. It worked for your use case, and now it doesn't. For me it was entirely different though. I ran into the race condition I described soon after I learned about and started using rpyc, and I was able to reliably reproduce it. Also there are a significant number of issues opened, "fixed" and closed related to this race, just search for timeout, dead lock, race condition and the like, or look at the commit history related to
yes
sometimes I wish issues were consistent xD
I doubt it, there is no wrong. The fix for the race condition just buys correctness at the expense of multithreading-ness which in general is a weak point of rpyc. Thread binding was implemented to provide both. |
Good news! It is possible to fix the issue by using the release strategy proposed in #492 The client requests It is not necessary to hold the lock for all paths of Even then, I just realized that the race was never truly fixed because it would require the lock to extend out to |
Sry for the spam! Just fyi: https://github.com/notEvil/rpyc/tree/benchmark/benchmark |
@gazpachoking yay!
@gazpachoking, I would rephrase "using rpyc wrong" as "not compensating for RPyC's short comings." A good library should be easy to use right and hard to use wrong. To be more specific, RPyC's connections/protocol tend to struggle around exceptions/threading/co-routines/locks based on GH issues.
Me too xD. This is why I ask for unit tests to prevent regressions like this.
@notEvil linking to benchmarking is not spam, it is a god send 💯 .
From what @gazpachoking stated, their issue is resolved. If you could be more specific and provide some tests or a client-server example, stable or not, please open a new issue. My biggest hesitation with #492 is my lack of clarity around why some changes are introduced, but I'll give it another look. Namely, changes around the struct are unclear and I could have given the work more time/attention. Given the back and forth around code changes in those areas, I'd like to expand unit tests to be able to prove the fix works and prevent regressions. Pull requests with unit tests are always welcome, but I tend to push back if it looks like there are changes unrelated to the fix when it comes to threading... mostly because threading in RPyC is problematic as is so the bar for contribution is a bit higher.
Yeah, I agree that the current locking implementation locks too many operations. Protecting the right data/operations is non-trivial, to say the least.
I think thread binding is the way forward. Thread binding seems like it might make it easier to acquire the lock for fewer operations and protect the right data/operations.
I tend to be more stubborn when there aren't unit tests to prove code works or minimal test cases to show demonstrate a bug exists. From my perspective, unit tests and minimal test cases allow me to determine whether the issue is in the protocol or how people are using RPyC. Of course, RPyC should be easier to use right and harder to use wrong, but it is not there yet. My frustrations were with RPyC's threading implementation/behavior/bugs, not you @notEvil. The biggest design issues that RPyC currently has are related to threading and coroutines—most open issues are related to this topic. With all that said, @notEvil let's start a new issue around #492 because it seems there is some improvements in that area we can do. |
@comrumino I agree, unit tests are important. Back then I thought the race is obvious and doesn't require explanation. And if it wasn't for thread binding, I probably would've tried again after a few weeks. from PR message:
e.g. no need to hold the lock while unpacking the entire payload. It could be huge ... :) Again, I probably should have been more verbose about this. |
Describe the issue briefly here, including:
I'm having an issue on RPyC versions 5.2.1 and above which we didn't have earlier. (Flexget/Flexget#3601 for reference) It seems that after this change 94dbe36 we started hitting the timeout here
rpyc/rpyc/core/protocol.py
Line 387 in 94dbe36
rpyc/rpyc/core/protocol.py
Lines 445 to 446 in 18a44e5
I didn't open a PR, because I have no idea the intent of these things, and if doing this swap breaks something somewhere else. This could also be an issue with how I'm using RPyC, because it's not quite standard, and I'm having trouble creating a minimal example to reproduce. I'm hoping maybe there's something obvious to someone else about what's going wrong, or that the fix I've done is obviously correct to someone in the know. If not, I'll keep banging away to dig further and try to make a minimal reproduction.
The text was updated successfully, but these errors were encountered: