Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.1.0 causing intermittent Connection closed by server error #1127

Closed
andyfoundi opened this issue Jan 29, 2019 · 35 comments
Closed

3.1.0 causing intermittent Connection closed by server error #1127

andyfoundi opened this issue Jan 29, 2019 · 35 comments

Comments

@andyfoundi
Copy link

Version:
redis-py: 3.1.0
redis: 3.2.4
django-redis: 4.10.0

Platform:
Python 2.7 on Alpine-Linux inside Docker

Description:
After upgrading from redis-py 3.0.1, our service becomes very unstable talking to the existing redis server. It generates around 30 'Connection closed by server.' errors in 10 minutes while the server is under ~20 QPS. The error is intermittent and I am not able to reproduce what exactly caused the errors. I tried restarting the redis server, rebuild our Docker images without any cache, and none of them worked.

After rolling back to redis-py==3.0.1, all errors are gone.

I understand that I don't really provide enough information to fix the problem, but I hope to at least highlight this problem and others might provide more.

Errors

File "lib/last_seen/models.py" in user_seen
  96.     seen = cache.get(cache_key)

File "/usr/lib/python2.7/site-packages/django_redis/cache.py" in _decorator
  39.             raise e.parent

Exception Type: ConnectionError at /helper/listing/list_591.5917864/
Exception Value: Error while reading from socket: (u'Connection closed by server.',)
@andymccurdy
Copy link
Contributor

Hi, thanks for reporting this. Would it be possible for you to re-test with the retry_on_timeout flag set to True?

3.1.0 fixed a bug that caused the retry logic to trigger even when it was turned off. I’m wondering if you were benefiting from the retry logic in previous versions and now aren’t using it.

@andyfoundi
Copy link
Author

I am trying the config on a lower QPS machine and I am still getting the error, though just one time. Looking at client.py it seems the fix you mentioned is for TimeoutError, but I am getting ConnectionError.

Any idea?

@andymccurdy
Copy link
Contributor

Do you have a traceback that shows specifically where in the redis.connection module the error is originating from?

@andyfoundi
Copy link
Author

Unfortunately I can't really traceback to where it happens in redis-py as it is called indirectly from the django_redis module.

This is where the error is raised: django_redis

It happens when I try to call cache.get(key) with Django's cache framework.

@andymccurdy
Copy link
Contributor

And you're sure the connections are being created with retry_on_timeout=True? That's the only connection-related change between 3.0.1 and 3.1.0.

@andyfoundi
Copy link
Author

andyfoundi commented Jan 30, 2019

After tracing the live code on prod, I can be sure that retry_on_timeout is set correctly

In [38]: connection_pool.connection_class
Out[38]: redis.connection.Connection

In [39]: connection_pool.connection_kwargs
Out[39]:
{u'db': 0,
 u'host': u'10.140.0.16',
 'parser_class': redis.connection.PythonParser,
 u'password': None,
 u'port': 6379,
 'retry_on_timeout': True}

In [40]: print connection_pool
ConnectionPool<Connection<host=10.140.0.16,port=6379,db=0>>

In [41]: connection_pool.connection_class
Out[41]: redis.connection.Connection

In [42]: connection = connection_pool.get_connection('get', 'abc')
In [43]: connection.retry_on_timeout
Out[43]: True

Given the above, I wonder if it is really the timeout problem. As I mentioned earlier, the exception is actually ConnectionError instead of timeout.

Also, the error disappears after I switch back to 3.0.1 again.

@andymccurdy
Copy link
Contributor

andymccurdy commented Jan 30, 2019

I think you are seeing the side effects of the retry_on_timeout fix in this commit: b940d07

The previous (<=3.0.1) condition check would retry the command on both TimeoutError and ConnectionError exceptions when the retry_on_timeout option was False. Obviously that conditional was wrong on multiple levels.

When you tested 3.1.0 with retry_on_timeout=True, I suspect that the one error you encountered was a legitimate ConnectionError, which was raised properly (ConnectionErrors are now never retried).

I went through all the changes between 3.0.1...3.1.0. The retry_on_timeout fix was the only thing that could be impacting this.

@andyfoundi
Copy link
Author

You are saying the ConnectionError was incorrectly ignored in 3.0.1, right? I think it makes sense.
In this case, Redis intermittently returns ConnectionError somehow, but it will success after retries.

Is it normal that Redis return Connection reset by peers intermittently? Any hint on how I can debug this?

Thanks a lot

@andymccurdy
Copy link
Contributor

The error message you're seeing is defined here: https://github.com/andymccurdy/redis-py/blob/master/redis/connection.py#L59 and raised in 5 or so cases throughout the connection module.

What I suspect is happening is that you have connections that sit idle for long enough and are killed by either the Redis server or a network device. Eventually the connection pool provides that connection to you and you attempt to issue a command which fails because the connection is closed. In previous versions, the command would be silently retried because of the faulty retry_on_timeout logic.

You could try to use the socket_keepalive and socket_keepalive_options parameters to keep the connections healthy. They are passed directly to the Python socket object here: https://github.com/andymccurdy/redis-py/blob/master/redis/connection.py#L529. I apologize there isn't better documentation on these settings.

I'm in the process now of attempting to improve the Connection and ConnectionPool classes. Making sure the ConnectionPool only returns healthy connections is one of my goals although that will take a little bit of time.

@andyfoundi
Copy link
Author

@andymccurdy This is great! I would have never come up with this insight myself. I will work on the approach you provide and let you know the result later. Thanks a lot

@bartels
Copy link

bartels commented Jan 30, 2019

I'm getting this error as well, also with retry_on_timeout. In my case the server has timeout 300 set in redis.conf. I can get this to happen consistently by setting a really low timeout 1.

I believe what's happening is the connection is timed out by the server, but isn't being removed from the client's connection pool. A subsequent request that attempts to use that connection triggers ConnectionError: Error while reading from socket: ('Connection closed by server.',)

Previously, version 3.0.1 would retry and succeed, presumably with another working connection in the pool. In 3.1.0 it fails with an exception. This resutls in a 500 error with django-redis.

Issue #306 seems like it could be involved here. If a connection that has timed out on the server is not removed from the pool until it is tried again and fails, we'd get this behavior. The retry_on_timeout behavior in 3.0.1 mitigated this.

@gogglesguy
Copy link

gogglesguy commented Jan 30, 2019

One could argue that ConnectionErrors on already existing connections should be retried to handle redis dropping idle connections, as there is a good chance it will recover. While perhaps ConnectionErrors on totally new connections should be passed on to the user (as was intended in 3.1.0).

@andyfoundi
Copy link
Author

I notice in my redis.conf (mostly default values) there are configs like this:

# On Linux, the specified value (in seconds) is the period used to send ACKs.
# Note that to close the connection the double of the time is needed.
# On other kernels the period depends on the kernel configuration.
tcp-keepalive 300

# Close the connection after a client is idle for N seconds (0 to disable)
timeout 120

It seems odd to me that the timeout is less than tcp-keepalive as it will timeout before it ever has a chance to send keepalive ACKs.

@bartels @andymccurdy Do you think settings these two values properly, say timeout=120; tcp-keepalive=60, is a way to fix this?

Sorry I don't really have an easy way to test it out before I ask.

@bartels
Copy link

bartels commented Jan 31, 2019

@andyfoundi Yeah, I believe you are correct. With your initial settings, ACK won't have been sent before timeout kicks in.

Maybe @andymccurdy can shed some more light, but testing this out, it appears timeout and tcp-keepalive operate separately, and timeout will close connections even if keepalive shows them as healthy. With your new values, the first ACK will be sent after 60s, but the server still considers the connection idle if it hasn't been used, so closes it after 120s.

I think the only time it makes sense to use timeout is if you definitely want connections closed after some period of inactivity, regardless of keepalive. So disabling entirely might make sense, or setting to a much higher value.

The larger problem is that anything that closes a connection on the server side (timeout, or redis server restart for example) results in ConnectionErrorin redis-py, as the connection pool is unware and blindly tries the connection again. It looks like pull request #886 tries to address this.

@andyfoundi
Copy link
Author

My Redis server is shared between multiple servers, so I suppose timeout is still needed just in case one instance has a spike and holds a large portion of the connections. keepalive on the other hand solves the problem for dead connections for scaled-down instances.

So in conclusion, I think the "only return valid connections" approach is still needed, or django_redis should handle the retry itself. Personally, I think redis-py to handle the valid connections makes more sense.

Any idea why #886 was never committed?

@andymccurdy
Copy link
Contributor

I tested #886 a bit and couldn't get it to work. The code in the PR itself doesn't work on OSX (which is described in the comments of that PR). The commenter who pointed that out posted a separate patch to make it more cross platform, but when I tested it I could never get it to report an error condition.

I spent some time this weekend working on this. I hope to have a branch published later today or tomorrow. @andyfoundi are you able to easily swap to a git version of redis-py to test against?

@matthiask
Copy link

I have seen those errors too and can easily switch to an editable requirement in the staging environment.

@andyfoundi
Copy link
Author

@andymccurdy I can swap redis-py to a specific version and test it
Thanks!

@andymccurdy
Copy link
Contributor

@matthiask @andyfoundi Can you guys give the "healthy_connections" branch (https://github.com/andymccurdy/redis-py/tree/healty_connections) a try please? Thanks!

@andyfoundi
Copy link
Author

@andymccurdy I already pushed the "healthy_connections" branch and so far I don't see any error. Let's wait a bit longer.

@matthiask
Copy link

I have just deployed the branch as well. Let's see what happens!

@matthiask
Copy link

Not a single ConnectionError since deploying the healthy_connections branch. Awesome! Thanks @andymccurdy

@andyfoundi
Copy link
Author

Same here. Not a single error. Awesome

@andymccurdy
Copy link
Contributor

Great, glad things are going well. I'm going to add an EPollSelector today or over the weekend, write a few more tests and then get this merged to master.

Thanks for helping test this stuff!

@andymccurdy
Copy link
Contributor

@andyfoundi @matthiask Have either of you seen any errors yet with the "healthy_connections" branch?

I just pushed a few bug fixes and better test coverage. Would it be possible to grab the latest code from this branch and test with your apps? Assuming all goes well with the testing I would like to get a new release out ASAP. Thanks!

@andyfoundi
Copy link
Author

Have not yet seen any error with the previous fix. I will update the new version tomorrow to test again. Thanks

@matthiask
Copy link

matthiask commented Feb 12, 2019

@andymccurdy Not a single error. I just deployed the latest code and will report back if anything goes wrong. Thanks!

--

UPDATE: And again, no error yet @andymccurdy

@andymccurdy
Copy link
Contributor

I assume everything is still good with the latest changes? If so I'll merge it today and create a new pypi release.

@matthiask
Copy link

@andymccurdy Yes, I've deployed the additional changes two days ago and haven't seen a single error since. I think that the errors appeared after a few minutes only before, so if there was anything amiss it surely would have shown up in the last days.

@andyfoundi
Copy link
Author

Sorry for the late reply. It took me sometime to finally put it up for tests.
After testing it for couple hours without a single error, it looks great.
Comparing to errors appearing right away as @matthiask mentioned, I am convinced the problem is fixed with this patch.

@andyfoundi
Copy link
Author

I deleted the previous post regarding the error showing up again. It was actually caused by another server that just updated py-redis to the latest version, not the server in question. I also push the fix to that server, and the problem does seem to go away now which double verified the cure.

@andymccurdy
Copy link
Contributor

Thanks so much for helping test this. I've just released 3.2.0 with these changes.

@andyfoundi
Copy link
Author

great! Thanks @andyfoundi

@yuanzhenbd
Copy link

File "test_tbmask.py", line 44, in mget
res = rc.mget(keys)
File "/root/.jumbo/lib/python3.6/site-packages/redis/client.py", line 1397, in mget
return self.execute_command('MGET', *args, **options)
File "/root/.jumbo/lib/python3.6/site-packages/redis/client.py", line 839, in execute_command
return self.parse_response(conn, command_name, **options)
File "/root/.jumbo/lib/python3.6/site-packages/redis/client.py", line 853, in parse_response
response = connection.read_response()
File "/root/.jumbo/lib/python3.6/site-packages/redis/connection.py", line 700, in read_response
response = self._parser.read_response()
File "/root/.jumbo/lib/python3.6/site-packages/redis/connection.py", line 457, in read_response
self.read_from_socket()
File "/root/.jumbo/lib/python3.6/site-packages/redis/connection.py", line 416, in read_from_socket
raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.

who can help me?why i got this error?
centos 6.3
python 3.6
redis-py 3.3.0

@romainsacchi
Copy link

Using py-redis 4.0.2 here, and still the same issue. Will try to downgrade to 3.0.1...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants