Sporadic "Connection not ready" exceptions with BlockingConnectionPool since 3.2.0 #1136

NirBenor · 2019-02-19T08:53:57Z

Version:
3.2.0

Platform:
Python 3.6.7 | packaged by conda-forge | (default, Nov 21 2018, 02:32:25)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
CentOS Linux release 7.5.1804 (Core) on docker

The redis server is using the official docker image.
Redis server v=4.0.11 sha=00000000:0 malloc=jemalloc-4.0.3 bits=64 build=74253224a862200c

Description:
Since upgrading to 3.2.0, we started getting sporadic errors in getting connections.
image

The code that is running looks like this:

    pool = BlockingConnectionPool(max_connections=config.REDIS_CONNECTIONS_PER_WORKER, host=config.REDIS_HOST,  port=config.REDIS_PORT, db=0, timeout=config.REDIS_TIMEOUT)
   redis = StrictRedis(connection_pool=pool)
   redis.setex(...) / redis.get(...)

Additional information:
This code is running inside an eventlet gunicorn worker.
The server is very "network heavy" and opens lots of sockets (for example, dns queries).

127.0.0.1:6379> config get maxclients
1) "maxclients"
2) "10000"
127.0.0.1:6379> info clients
# Clients
connected_clients:104
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

The text was updated successfully, but these errors were encountered:

andymccurdy · 2019-02-19T15:43:56Z

Do you have a traceback by chance?

How often is this happening?

andymccurdy · 2019-02-19T17:52:20Z

Also, what version of redis-py did you upgrade from? I assume you weren't seeing errors in that version?

NirBenor · 2019-02-20T09:00:40Z

This does not occur in 3.1.0 (we reverted to that and we no longer get these errors).

How often is about 2 exceptions/minute (can be seen in the first image), and we send about 1-2K commands/minute (a graph of "cache misses" only can be seen in the second image). It occurs in less than 0.1% of commands.

Each bar in the graphs represents 20 minutes.

The attached image in the issue contains a traceback from Sentry.
Here is a text version of a similar traceback:

[2019-02-18 09:18:57.161794] ERROR: hydrogen.app: An unexpected error has occurred on: 172.18.0.2 POST http /enrich/bulk?
Traceback (most recent call last):
  File "/opt/hydrogen/hydrogen/app.py", line 196, in enrich_bulk
    enriched_data = _enrich_bulk(dispatcher, bulk_data, use_mirror, metadata)
  File "/opt/hydrogen/hydrogen/app.py", line 117, in _enrich_bulk
    use_mirror=use_mirror, metadata=metadata)
  File "/opt/hydrogen/hydrogen/core/dispatcher.py", line 99, in enrich
    enrichment_data = enricher.enrich(data_type=data_type, value=value, use_mirror=use_mirror)
  File "/opt/hydrogen/hydrogen/core/enrichers/<>.py", line 55, in enrich
    cached_<> = self.redis.get(<>)
  File "/opt/conda/envs/hydrogen/lib/python3.6/site-packages/redis/client.py", line 1264, in get
    return self.execute_command('GET', name)
  File "/opt/conda/envs/hydrogen/lib/python3.6/site-packages/redis/client.py", line 772, in execute_command
    connection = pool.get_connection(command_name, **options)
  File "/opt/conda/envs/hydrogen/lib/python3.6/site-packages/redis/connection.py", line 1151, in get_connection
    raise ConnectionError('Connection not ready')
redis.exceptions.ConnectionError: Connection not ready

andymccurdy · 2019-02-20T09:20:18Z

Interesting. Do you happen to know if eventlet is patching the select module? Do you have select.poll available in your environment?

Would it be possible for you to run the following snippet in your environment (with eventlet activated)?

from redis.selector import has_selector
has_selector('poll')

This way we'll know whether redis-py is using select.select or select.poll to determine the healthiness of the socket.

NirBenor · 2019-02-20T09:37:58Z

Yes, eventlet does patch the select module.
I followed their monkey_patch code, and it appears that in the bottom line it is using select rather than poll:

I can replicate this on a different env if you'll say it's necessary (It'll take some time, and I won't touch the production server obviously).

andymccurdy · 2019-02-20T10:03:16Z

It's in line with what I was skimming in the eventlet docs. So that means that in your environment redis-py is likely using select.select to validate the health of a connection.

select.select has a number of issues, most notably only being able to poll file descriptors with file numbers < ~1024. If you're running over that limit, the current implementation will simply return that the connection isn't ready, which would explain why you're seeing the error. See here: https://github.com/andymccurdy/redis-py/blob/master/redis/selector.py#L47

If you can correlate these errors around traffic spikes, that might further suggest that we're on the right track.

One thing you could try is to reduce the max_connections in the pool. Fewer connections means fewer file descriptors which should reduce the chance of hitting the select.select issue.

We could also make the selector more pluggable such that you could inject your own logic or turn off the health checks for your environment.

You could actually do this now with a (albeit ugly) monkey patch like so:

from redis.selector import SelectSelector

class MySelector(SelectSelector):
    def check_is_ready_for_command(self, timeout):
        return True


from redis import connection
connection.DefaultSelector = MySelector

Or another possible solution is a simple flag to the connection pool to turn of the health checks. Enabling such a flag would cause the pool to behave the same as it did in 3.1.0.

NirBenor · 2019-02-20T11:42:30Z

That makes a lot of sense, the server is also using lots of sockets for other tasks (other than connecting to redis) - and has the file descriptor limit (ulimit -n) set to a higher value than the default 1024. So lowering max_connections won't do the trick.

Enabling such a flag in the connection pool would solve this in a cleaner way (for now I tested this and the patch does work).
Thanks!

gogglesguy · 2019-02-23T05:58:42Z

We were investigating a similar issue today in regards to using 3.2.0 and a monkey-patched select with eventlet. In our case, 3.2.0 solved many of the ConnectionError's caused by redis closing inactive connections, however, they didn't disappear completely and would still show on a regular basis.

We traced these latter errors down to a broken/incomplete eventlet select implementation.

As it turns out, in case a socket is both in a readable/writable state, select will randomly return either writable or readable, but never both at the same time (or exception for that matter). You can see this in the following code from eventlet:

https://github.com/eventlet/eventlet/blob/a915bb642dd6cd4e92c959addff30509977a637c/eventlet/green/select.py#L51-L61

Each event handler will switch back to the original greenlet without knowing about any other events on the same file descriptor.

As a workaround, with limited testing, but so far no more errors, we found the following change to may do the job:

def check_is_ready_for_command(self, timeout):

    # first check read with no timeout
    r, _, e = select.select([self.sock],[],[self.sock], 0)
    if r or e:
        return False

    r, w, e = select.select([self.sock], [self.sock], [self.sock],
                            timeout)

    return bool(w and not r and not e)

Hadn't had a ConnectionError due to inactive connections since then.

andymccurdy · 2019-02-23T06:52:46Z

@gogglesguy Thanks for digging into this. This is incredibly helpful.

My current thought is that we should make an EventletSelector implementation based on your patch above. Is there a canonical way to know that eventlet has patched a module in the current interpreter process? If so, we could modify DefaultSelector (https://github.com/andymccurdy/redis-py/blob/master/redis/selector.py#L186) to look for the presence of eventlet and return the EventletSelector.

gogglesguy · 2019-02-23T16:59:49Z

Perhaps something like this:
https://github.com/ask/kombu/blob/master/kombu/syn.py

gogglesguy · 2019-02-25T22:27:05Z

Although the workaround worked fine on our servers, the unit test (test_selector) still randomly failed (1 out of 10). I came up with a slight variation that I'm testing now which doesn't fail the unit tests. Will send a pull request.

gogglesguy · 2019-02-26T16:31:06Z

Related eventlet issue and fix to select: eventlet/eventlet#551
Not sure how actively maintained eventlet is these days..

gogglesguy · 2019-02-28T19:57:10Z

And as we have been testing the eventlet patch the last couple of days, it is clear to me that we should keep redis-py as is, and fix the issue in eventlet.

schlitzered · 2019-05-25T13:09:00Z

why not simply drop the usage of select at all, and simple use socket.settimeout() on the socket? this will do the same thing. the usage of the select module, in blocking code is, is more or less pointles. it is meant for async code, and handling multiple sockets at once...

using socket.settimeout(), will raise an exception after the specified timeout. this archives the same thing as using the select module, but in a cleaner way.

using socket.settimeout will also allow the removal of some clunky code around the socket handling, and you will not need to take care of other async frameworks that monkey patch core libraries.

andymccurdy · 2019-05-25T19:01:51Z

@schlitzered I was under the impression that there are specific scenarios where a connection can be broken or stalled and reading from the connection blocks until the socket timeout is hit.

If this is true and we only rely on socket.settimeout() to control the blocking interval, we would never hit our global socket_timeout to indicate that the connection is having an issue. You would see this in the pubsub listen loop and possibly other places.

I can't recall a specific scenario that produces this behavior but I seem to remember it coming up at some point years ago. If I'm wrong, then your suggestion may very well work.

schlitzered · 2019-05-27T06:14:58Z

i am not aware of such a scenario. if you use socket.settimeout(), it is guaranteed that a socket.timeout exception is raised after the specified timeout.

andymccurdy · 2019-06-01T23:53:03Z

@NirBenor @gogglesguy @schlitzered

I've put together a patch that uses nonblocking sockets to test the health of connections This completely removes the usage of selectors. If you guys could help test before I merge this patch I'd really appreciate it!

https://github.com/andymccurdy/redis-py/tree/nonblocking

NirBenor · 2019-06-02T14:11:45Z

Sure, I'll update here in a few days.

NirBenor · 2019-06-06T10:37:37Z

Appears to work fine 👍

andymccurdy · 2019-06-06T17:52:48Z

Great, thanks for testing @NirBenor. I'll get this merged in a few days.

dralley · 2020-01-06T14:55:15Z

Is this issue fixed then?

andymccurdy · 2020-01-06T18:46:59Z

I believe this to be fixed. Please open a new issue if anyone continues to see this behavior.

NirBenor changed the title ~~Sparse "Connection not ready" exceptions with BlockingConnectionPool since 3.2.0~~ Sporadic "Connection not ready" exceptions with BlockingConnectionPool since 3.2.0 Feb 19, 2019

rochacbruno mentioned this issue Feb 20, 2019

PERSIST error rq/rq#1044

Closed

gogglesguy mentioned this issue Feb 25, 2019

Add a eventlet specific selector in case eventlet has monkey patched select #1139

Closed

andymccurdy mentioned this issue Mar 4, 2019

3.2.0 Error while reading from socket: ('Connection closed by server.',) #1140

Closed

alexandre-paroissien mentioned this issue Mar 13, 2019

"Error 110 while writing to socket. Connection timed out." With kombu 4.4.0/4.5.0 and redis 3.2.0/3.2.1 celery/kombu#1019

Closed

danielrs mentioned this issue Jun 11, 2019

Issues when using with eventlet ray-project/ray#4424

Closed

andymccurdy closed this as completed Jan 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sporadic "Connection not ready" exceptions with BlockingConnectionPool since 3.2.0 #1136

Sporadic "Connection not ready" exceptions with BlockingConnectionPool since 3.2.0 #1136

NirBenor commented Feb 19, 2019 •

edited

Loading

andymccurdy commented Feb 19, 2019

andymccurdy commented Feb 19, 2019

NirBenor commented Feb 20, 2019 •

edited

Loading

andymccurdy commented Feb 20, 2019

NirBenor commented Feb 20, 2019

andymccurdy commented Feb 20, 2019

NirBenor commented Feb 20, 2019 •

edited

Loading

gogglesguy commented Feb 23, 2019 •

edited

Loading

andymccurdy commented Feb 23, 2019

gogglesguy commented Feb 23, 2019

gogglesguy commented Feb 25, 2019

gogglesguy commented Feb 26, 2019

gogglesguy commented Feb 28, 2019

schlitzered commented May 25, 2019

andymccurdy commented May 25, 2019

schlitzered commented May 27, 2019

andymccurdy commented Jun 1, 2019

NirBenor commented Jun 2, 2019

NirBenor commented Jun 6, 2019

andymccurdy commented Jun 6, 2019

dralley commented Jan 6, 2020

andymccurdy commented Jan 6, 2020

Sporadic "Connection not ready" exceptions with BlockingConnectionPool since 3.2.0 #1136

Sporadic "Connection not ready" exceptions with BlockingConnectionPool since 3.2.0 #1136

Comments

NirBenor commented Feb 19, 2019 • edited Loading

andymccurdy commented Feb 19, 2019

andymccurdy commented Feb 19, 2019

NirBenor commented Feb 20, 2019 • edited Loading

andymccurdy commented Feb 20, 2019

NirBenor commented Feb 20, 2019

andymccurdy commented Feb 20, 2019

NirBenor commented Feb 20, 2019 • edited Loading

gogglesguy commented Feb 23, 2019 • edited Loading

andymccurdy commented Feb 23, 2019

gogglesguy commented Feb 23, 2019

gogglesguy commented Feb 25, 2019

gogglesguy commented Feb 26, 2019

gogglesguy commented Feb 28, 2019

schlitzered commented May 25, 2019

andymccurdy commented May 25, 2019

schlitzered commented May 27, 2019

andymccurdy commented Jun 1, 2019

NirBenor commented Jun 2, 2019

NirBenor commented Jun 6, 2019

andymccurdy commented Jun 6, 2019

dralley commented Jan 6, 2020

andymccurdy commented Jan 6, 2020

NirBenor commented Feb 19, 2019 •

edited

Loading

NirBenor commented Feb 20, 2019 •

edited

Loading

NirBenor commented Feb 20, 2019 •

edited

Loading

gogglesguy commented Feb 23, 2019 •

edited

Loading