-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.1.0 causing intermittent Connection closed by server error #1127
Comments
Hi, thanks for reporting this. Would it be possible for you to re-test with the 3.1.0 fixed a bug that caused the retry logic to trigger even when it was turned off. I’m wondering if you were benefiting from the retry logic in previous versions and now aren’t using it. |
I am trying the config on a lower QPS machine and I am still getting the error, though just one time. Looking at client.py it seems the fix you mentioned is for TimeoutError, but I am getting ConnectionError. Any idea? |
Do you have a traceback that shows specifically where in the |
Unfortunately I can't really traceback to where it happens in redis-py as it is called indirectly from the django_redis module. This is where the error is raised: django_redis It happens when I try to call cache.get(key) with Django's cache framework. |
And you're sure the connections are being created with |
After tracing the live code on prod, I can be sure that
Given the above, I wonder if it is really the timeout problem. As I mentioned earlier, the exception is actually Also, the error disappears after I switch back to 3.0.1 again. |
I think you are seeing the side effects of the The previous (<=3.0.1) condition check would retry the command on both When you tested 3.1.0 with I went through all the changes between 3.0.1...3.1.0. The |
You are saying the Is it normal that Redis return Thanks a lot |
The error message you're seeing is defined here: https://github.com/andymccurdy/redis-py/blob/master/redis/connection.py#L59 and raised in 5 or so cases throughout the connection module. What I suspect is happening is that you have connections that sit idle for long enough and are killed by either the Redis server or a network device. Eventually the connection pool provides that connection to you and you attempt to issue a command which fails because the connection is closed. In previous versions, the command would be silently retried because of the faulty You could try to use the I'm in the process now of attempting to improve the Connection and ConnectionPool classes. Making sure the ConnectionPool only returns healthy connections is one of my goals although that will take a little bit of time. |
@andymccurdy This is great! I would have never come up with this insight myself. I will work on the approach you provide and let you know the result later. Thanks a lot |
I'm getting this error as well, also with I believe what's happening is the connection is timed out by the server, but isn't being removed from the client's connection pool. A subsequent request that attempts to use that connection triggers ConnectionError: Previously, version 3.0.1 would retry and succeed, presumably with another working connection in the pool. In 3.1.0 it fails with an exception. This resutls in a 500 error with django-redis. Issue #306 seems like it could be involved here. If a connection that has timed out on the server is not removed from the pool until it is tried again and fails, we'd get this behavior. The |
One could argue that ConnectionErrors on already existing connections should be retried to handle redis dropping idle connections, as there is a good chance it will recover. While perhaps ConnectionErrors on totally new connections should be passed on to the user (as was intended in 3.1.0). |
I notice in my
It seems odd to me that the @bartels @andymccurdy Do you think settings these two values properly, say Sorry I don't really have an easy way to test it out before I ask. |
@andyfoundi Yeah, I believe you are correct. With your initial settings, ACK won't have been sent before timeout kicks in. Maybe @andymccurdy can shed some more light, but testing this out, it appears I think the only time it makes sense to use The larger problem is that anything that closes a connection on the server side (timeout, or redis server restart for example) results in |
My Redis server is shared between multiple servers, so I suppose So in conclusion, I think the "only return valid connections" approach is still needed, or Any idea why #886 was never committed? |
I tested #886 a bit and couldn't get it to work. The code in the PR itself doesn't work on OSX (which is described in the comments of that PR). The commenter who pointed that out posted a separate patch to make it more cross platform, but when I tested it I could never get it to report an error condition. I spent some time this weekend working on this. I hope to have a branch published later today or tomorrow. @andyfoundi are you able to easily swap to a git version of redis-py to test against? |
I have seen those errors too and can easily switch to an editable requirement in the staging environment. |
@andymccurdy I can swap redis-py to a specific version and test it |
@matthiask @andyfoundi Can you guys give the "healthy_connections" branch (https://github.com/andymccurdy/redis-py/tree/healty_connections) a try please? Thanks! |
@andymccurdy I already pushed the "healthy_connections" branch and so far I don't see any error. Let's wait a bit longer. |
I have just deployed the branch as well. Let's see what happens! |
Not a single |
Same here. Not a single error. Awesome |
Great, glad things are going well. I'm going to add an EPollSelector today or over the weekend, write a few more tests and then get this merged to master. Thanks for helping test this stuff! |
@andyfoundi @matthiask Have either of you seen any errors yet with the "healthy_connections" branch? I just pushed a few bug fixes and better test coverage. Would it be possible to grab the latest code from this branch and test with your apps? Assuming all goes well with the testing I would like to get a new release out ASAP. Thanks! |
Have not yet seen any error with the previous fix. I will update the new version tomorrow to test again. Thanks |
@andymccurdy Not a single error. I just deployed the latest code and will report back if anything goes wrong. Thanks! -- UPDATE: And again, no error yet @andymccurdy |
I assume everything is still good with the latest changes? If so I'll merge it today and create a new pypi release. |
@andymccurdy Yes, I've deployed the additional changes two days ago and haven't seen a single error since. I think that the errors appeared after a few minutes only before, so if there was anything amiss it surely would have shown up in the last days. |
Sorry for the late reply. It took me sometime to finally put it up for tests. |
I deleted the previous post regarding the error showing up again. It was actually caused by another server that just updated py-redis to the latest version, not the server in question. I also push the fix to that server, and the problem does seem to go away now which double verified the cure. |
Thanks so much for helping test this. I've just released 3.2.0 with these changes. |
great! Thanks @andyfoundi |
File "test_tbmask.py", line 44, in mget who can help me?why i got this error? |
Using py-redis 4.0.2 here, and still the same issue. Will try to downgrade to 3.0.1... |
Version:
redis-py: 3.1.0
redis: 3.2.4
django-redis: 4.10.0
Platform:
Python 2.7 on Alpine-Linux inside Docker
Description:
After upgrading from redis-py 3.0.1, our service becomes very unstable talking to the existing redis server. It generates around 30 'Connection closed by server.' errors in 10 minutes while the server is under ~20 QPS. The error is intermittent and I am not able to reproduce what exactly caused the errors. I tried restarting the redis server, rebuild our Docker images without any cache, and none of them worked.
After rolling back to redis-py==3.0.1, all errors are gone.
I understand that I don't really provide enough information to fix the problem, but I hope to at least highlight this problem and others might provide more.
Errors
The text was updated successfully, but these errors were encountered: