-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sporadic "Connection not ready" exceptions with BlockingConnectionPool since 3.2.0 #1136
Comments
Do you have a traceback by chance? How often is this happening? |
Also, what version of redis-py did you upgrade from? I assume you weren't seeing errors in that version? |
Interesting. Do you happen to know if eventlet is patching the Would it be possible for you to run the following snippet in your environment (with eventlet activated)? from redis.selector import has_selector
has_selector('poll') This way we'll know whether redis-py is using |
It's in line with what I was skimming in the eventlet docs. So that means that in your environment redis-py is likely using
If you can correlate these errors around traffic spikes, that might further suggest that we're on the right track. One thing you could try is to reduce the We could also make the selector more pluggable such that you could inject your own logic or turn off the health checks for your environment. You could actually do this now with a (albeit ugly) monkey patch like so: from redis.selector import SelectSelector
class MySelector(SelectSelector):
def check_is_ready_for_command(self, timeout):
return True
from redis import connection
connection.DefaultSelector = MySelector Or another possible solution is a simple flag to the connection pool to turn of the health checks. Enabling such a flag would cause the pool to behave the same as it did in 3.1.0. |
That makes a lot of sense, the server is also using lots of sockets for other tasks (other than connecting to redis) - and has the file descriptor limit ( Enabling such a flag in the connection pool would solve this in a cleaner way (for now I tested this and the patch does work). |
We were investigating a similar issue today in regards to using 3.2.0 and a monkey-patched select with eventlet. In our case, 3.2.0 solved many of the ConnectionError's caused by redis closing inactive connections, however, they didn't disappear completely and would still show on a regular basis. We traced these latter errors down to a broken/incomplete eventlet select implementation. As it turns out, in case a socket is both in a readable/writable state, select will randomly return either writable or readable, but never both at the same time (or exception for that matter). You can see this in the following code from eventlet: Each event handler will switch back to the original greenlet without knowing about any other events on the same file descriptor. As a workaround, with limited testing, but so far no more errors, we found the following change to may do the job:
Hadn't had a ConnectionError due to inactive connections since then. |
@gogglesguy Thanks for digging into this. This is incredibly helpful. My current thought is that we should make an |
Perhaps something like this: |
Although the workaround worked fine on our servers, the unit test (test_selector) still randomly failed (1 out of 10). I came up with a slight variation that I'm testing now which doesn't fail the unit tests. Will send a pull request. |
Related eventlet issue and fix to select: eventlet/eventlet#551 |
And as we have been testing the eventlet patch the last couple of days, it is clear to me that we should keep redis-py as is, and fix the issue in eventlet. |
why not simply drop the usage of select at all, and simple use socket.settimeout() on the socket? this will do the same thing. the usage of the select module, in blocking code is, is more or less pointles. it is meant for async code, and handling multiple sockets at once... using socket.settimeout(), will raise an exception after the specified timeout. this archives the same thing as using the select module, but in a cleaner way. using socket.settimeout will also allow the removal of some clunky code around the socket handling, and you will not need to take care of other async frameworks that monkey patch core libraries. |
@schlitzered I was under the impression that there are specific scenarios where a connection can be broken or stalled and reading from the connection blocks until the socket timeout is hit. If this is true and we only rely on I can't recall a specific scenario that produces this behavior but I seem to remember it coming up at some point years ago. If I'm wrong, then your suggestion may very well work. |
i am not aware of such a scenario. if you use socket.settimeout(), it is guaranteed that a socket.timeout exception is raised after the specified timeout. |
@NirBenor @gogglesguy @schlitzered I've put together a patch that uses nonblocking sockets to test the health of connections This completely removes the usage of selectors. If you guys could help test before I merge this patch I'd really appreciate it! https://github.com/andymccurdy/redis-py/tree/nonblocking |
Sure, I'll update here in a few days. |
Appears to work fine 👍 |
Great, thanks for testing @NirBenor. I'll get this merged in a few days. |
Is this issue fixed then? |
I believe this to be fixed. Please open a new issue if anyone continues to see this behavior. |
Version:
3.2.0
Platform:
Python 3.6.7 | packaged by conda-forge | (default, Nov 21 2018, 02:32:25)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
CentOS Linux release 7.5.1804 (Core) on docker
The redis server is using the official docker image.
Redis server v=4.0.11 sha=00000000:0 malloc=jemalloc-4.0.3 bits=64 build=74253224a862200c
Description:
Since upgrading to 3.2.0, we started getting sporadic errors in getting connections.
image
The code that is running looks like this:
Additional information:
This code is running inside an eventlet gunicorn worker.
The server is very "network heavy" and opens lots of sockets (for example, dns queries).
The text was updated successfully, but these errors were encountered: