Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic "Connection not ready" exceptions with BlockingConnectionPool since 3.2.0 #1136

Closed
NirBenor opened this issue Feb 19, 2019 · 22 comments

Comments

@NirBenor
Copy link

NirBenor commented Feb 19, 2019

Version:
3.2.0

Platform:
Python 3.6.7 | packaged by conda-forge | (default, Nov 21 2018, 02:32:25)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
CentOS Linux release 7.5.1804 (Core) on docker

The redis server is using the official docker image.
Redis server v=4.0.11 sha=00000000:0 malloc=jemalloc-4.0.3 bits=64 build=74253224a862200c

Description:
Since upgrading to 3.2.0, we started getting sporadic errors in getting connections.
image

The code that is running looks like this:

    pool = BlockingConnectionPool(max_connections=config.REDIS_CONNECTIONS_PER_WORKER, host=config.REDIS_HOST,  port=config.REDIS_PORT, db=0, timeout=config.REDIS_TIMEOUT)
   redis = StrictRedis(connection_pool=pool)
   redis.setex(...) / redis.get(...)

Additional information:
This code is running inside an eventlet gunicorn worker.
The server is very "network heavy" and opens lots of sockets (for example, dns queries).

127.0.0.1:6379> config get maxclients
1) "maxclients"
2) "10000"
127.0.0.1:6379> info clients
# Clients
connected_clients:104
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
@NirBenor NirBenor changed the title Sparse "Connection not ready" exceptions with BlockingConnectionPool since 3.2.0 Sporadic "Connection not ready" exceptions with BlockingConnectionPool since 3.2.0 Feb 19, 2019
@andymccurdy
Copy link
Contributor

Do you have a traceback by chance?

How often is this happening?

@andymccurdy
Copy link
Contributor

Also, what version of redis-py did you upgrade from? I assume you weren't seeing errors in that version?

@NirBenor
Copy link
Author

NirBenor commented Feb 20, 2019

This does not occur in 3.1.0 (we reverted to that and we no longer get these errors).

How often is about 2 exceptions/minute (can be seen in the first image), and we send about 1-2K commands/minute (a graph of "cache misses" only can be seen in the second image). It occurs in less than 0.1% of commands.

Each bar in the graphs represents 20 minutes.
image
image

The attached image in the issue contains a traceback from Sentry.
Here is a text version of a similar traceback:

[2019-02-18 09:18:57.161794] ERROR: hydrogen.app: An unexpected error has occurred on: 172.18.0.2 POST http /enrich/bulk?
Traceback (most recent call last):
  File "/opt/hydrogen/hydrogen/app.py", line 196, in enrich_bulk
    enriched_data = _enrich_bulk(dispatcher, bulk_data, use_mirror, metadata)
  File "/opt/hydrogen/hydrogen/app.py", line 117, in _enrich_bulk
    use_mirror=use_mirror, metadata=metadata)
  File "/opt/hydrogen/hydrogen/core/dispatcher.py", line 99, in enrich
    enrichment_data = enricher.enrich(data_type=data_type, value=value, use_mirror=use_mirror)
  File "/opt/hydrogen/hydrogen/core/enrichers/<>.py", line 55, in enrich
    cached_<> = self.redis.get(<>)
  File "/opt/conda/envs/hydrogen/lib/python3.6/site-packages/redis/client.py", line 1264, in get
    return self.execute_command('GET', name)
  File "/opt/conda/envs/hydrogen/lib/python3.6/site-packages/redis/client.py", line 772, in execute_command
    connection = pool.get_connection(command_name, **options)
  File "/opt/conda/envs/hydrogen/lib/python3.6/site-packages/redis/connection.py", line 1151, in get_connection
    raise ConnectionError('Connection not ready')
redis.exceptions.ConnectionError: Connection not ready

@andymccurdy
Copy link
Contributor

Interesting. Do you happen to know if eventlet is patching the select module? Do you have select.poll available in your environment?

Would it be possible for you to run the following snippet in your environment (with eventlet activated)?

from redis.selector import has_selector
has_selector('poll')

This way we'll know whether redis-py is using select.select or select.poll to determine the healthiness of the socket.

@NirBenor
Copy link
Author

Yes, eventlet does patch the select module.
I followed their monkey_patch code, and it appears that in the bottom line it is using select rather than poll:
image

I can replicate this on a different env if you'll say it's necessary (It'll take some time, and I won't touch the production server obviously).

@andymccurdy
Copy link
Contributor

It's in line with what I was skimming in the eventlet docs. So that means that in your environment redis-py is likely using select.select to validate the health of a connection.

select.select has a number of issues, most notably only being able to poll file descriptors with file numbers < ~1024. If you're running over that limit, the current implementation will simply return that the connection isn't ready, which would explain why you're seeing the error. See here: https://github.com/andymccurdy/redis-py/blob/master/redis/selector.py#L47

If you can correlate these errors around traffic spikes, that might further suggest that we're on the right track.

One thing you could try is to reduce the max_connections in the pool. Fewer connections means fewer file descriptors which should reduce the chance of hitting the select.select issue.

We could also make the selector more pluggable such that you could inject your own logic or turn off the health checks for your environment.

You could actually do this now with a (albeit ugly) monkey patch like so:

from redis.selector import SelectSelector

class MySelector(SelectSelector):
    def check_is_ready_for_command(self, timeout):
        return True


from redis import connection
connection.DefaultSelector = MySelector

Or another possible solution is a simple flag to the connection pool to turn of the health checks. Enabling such a flag would cause the pool to behave the same as it did in 3.1.0.

@NirBenor
Copy link
Author

NirBenor commented Feb 20, 2019

That makes a lot of sense, the server is also using lots of sockets for other tasks (other than connecting to redis) - and has the file descriptor limit (ulimit -n) set to a higher value than the default 1024. So lowering max_connections won't do the trick.

Enabling such a flag in the connection pool would solve this in a cleaner way (for now I tested this and the patch does work).
Thanks!

@gogglesguy
Copy link

gogglesguy commented Feb 23, 2019

We were investigating a similar issue today in regards to using 3.2.0 and a monkey-patched select with eventlet. In our case, 3.2.0 solved many of the ConnectionError's caused by redis closing inactive connections, however, they didn't disappear completely and would still show on a regular basis.

We traced these latter errors down to a broken/incomplete eventlet select implementation.

As it turns out, in case a socket is both in a readable/writable state, select will randomly return either writable or readable, but never both at the same time (or exception for that matter). You can see this in the following code from eventlet:

https://github.com/eventlet/eventlet/blob/a915bb642dd6cd4e92c959addff30509977a637c/eventlet/green/select.py#L51-L61

Each event handler will switch back to the original greenlet without knowing about any other events on the same file descriptor.

As a workaround, with limited testing, but so far no more errors, we found the following change to may do the job:

def check_is_ready_for_command(self, timeout):

    # first check read with no timeout
    r, _, e = select.select([self.sock],[],[self.sock], 0)
    if r or e:
        return False

    r, w, e = select.select([self.sock], [self.sock], [self.sock],
                            timeout)

    return bool(w and not r and not e)

Hadn't had a ConnectionError due to inactive connections since then.

@andymccurdy
Copy link
Contributor

@gogglesguy Thanks for digging into this. This is incredibly helpful.

My current thought is that we should make an EventletSelector implementation based on your patch above. Is there a canonical way to know that eventlet has patched a module in the current interpreter process? If so, we could modify DefaultSelector (https://github.com/andymccurdy/redis-py/blob/master/redis/selector.py#L186) to look for the presence of eventlet and return the EventletSelector.

@gogglesguy
Copy link

Perhaps something like this:
https://github.com/ask/kombu/blob/master/kombu/syn.py

@gogglesguy
Copy link

Although the workaround worked fine on our servers, the unit test (test_selector) still randomly failed (1 out of 10). I came up with a slight variation that I'm testing now which doesn't fail the unit tests. Will send a pull request.

@gogglesguy
Copy link

Related eventlet issue and fix to select: eventlet/eventlet#551
Not sure how actively maintained eventlet is these days..

@gogglesguy
Copy link

And as we have been testing the eventlet patch the last couple of days, it is clear to me that we should keep redis-py as is, and fix the issue in eventlet.

@schlitzered
Copy link

why not simply drop the usage of select at all, and simple use socket.settimeout() on the socket? this will do the same thing. the usage of the select module, in blocking code is, is more or less pointles. it is meant for async code, and handling multiple sockets at once...

using socket.settimeout(), will raise an exception after the specified timeout. this archives the same thing as using the select module, but in a cleaner way.

using socket.settimeout will also allow the removal of some clunky code around the socket handling, and you will not need to take care of other async frameworks that monkey patch core libraries.

@andymccurdy
Copy link
Contributor

@schlitzered I was under the impression that there are specific scenarios where a connection can be broken or stalled and reading from the connection blocks until the socket timeout is hit.

If this is true and we only rely on socket.settimeout() to control the blocking interval, we would never hit our global socket_timeout to indicate that the connection is having an issue. You would see this in the pubsub listen loop and possibly other places.

I can't recall a specific scenario that produces this behavior but I seem to remember it coming up at some point years ago. If I'm wrong, then your suggestion may very well work.

@schlitzered
Copy link

i am not aware of such a scenario. if you use socket.settimeout(), it is guaranteed that a socket.timeout exception is raised after the specified timeout.

@andymccurdy
Copy link
Contributor

@NirBenor @gogglesguy @schlitzered

I've put together a patch that uses nonblocking sockets to test the health of connections This completely removes the usage of selectors. If you guys could help test before I merge this patch I'd really appreciate it!

https://github.com/andymccurdy/redis-py/tree/nonblocking

@NirBenor
Copy link
Author

NirBenor commented Jun 2, 2019

Sure, I'll update here in a few days.

@NirBenor
Copy link
Author

NirBenor commented Jun 6, 2019

Appears to work fine 👍

@andymccurdy
Copy link
Contributor

Great, thanks for testing @NirBenor. I'll get this merged in a few days.

@dralley
Copy link

dralley commented Jan 6, 2020

Is this issue fixed then?

@andymccurdy
Copy link
Contributor

I believe this to be fixed. Please open a new issue if anyone continues to see this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants