-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection does not re-establish for 15 minutes when running on Linux #1848
Comments
Looks like we faced the same issue. After migration our infrastructure to Linux and .Net Core 5, we starting to getting the following issue. Some instances of the application hang on Redis operation for about 15 minutes. After that, they are restored without any external influence. We are using client version 1.2.6. We have 2 dump of application from 2 different instances in this state. It contains sensitive data, so I cannot share it, but I can share some data needed for investigation. The Started Requests Count drop in the middle is just deploy. |
We were able to reproduce the problem under controlled conditions. We will try to check how |
Connection stalls lasting for 15 minutes like this are often caused by very optimistic default TCP settings in some Linux distros (confirmed on CentOS so far). When a server stops responding without gracefully closing the connection, the client TCP stack will continue retransmitting packets for 15 minutes before declaring the connection dead and allowing the StackExchange.Redis reconnect logic to kick in. With Azure Cache for Redis, it's fairly easy to reproduce this by rebooting nodes as mentioned above. In this case, the machine goes down abruptly and the Redis server isn't able to transmit a FIN packet to the client. The client TCP stack continues retransmitting on the same socket hoping the server will come back up. Even when the node has rebooted and come back, it has no record of that connection so it continues ignoring the client. If the client gave up and created a NEW connection, it would be able to resume communication with the server much sooner than 15 minutes. As you found, there are TCP settings you can change on the client machine to force it to timeout the connection sooner and allow for reconnect. In addition to tcp_retries2, you can try tuning the keepalive settings as discussed here: redis/lettuce#1428 (comment). It should be safe to reduce these timeouts to more realistic durations machine-wide unless you have systems that actually depend on the unusually long retransmits. An additional approach is using the ForceReconnect pattern recommended in the Azure best practices. If you're seeing issues like this, it's perfectly appropriate to trigger reconnect on RedisTimeoutExceptions in addition to RedisConnectionExceptions. Just don't be too aggressive with it because an overloaded server can also result in persistent RedisTimeoutExceptions. Recreating connections in that situation can cause additional server load and a cascade failure. Unfortunately there's not much the StackExchange.Redis library can do about this situation, because the Linux TCP stack is hiding the lost connection. Detecting the stall at the library level would require making assumptions that would almost certainly lead to false positives in some scenarios. Instead, it's better for the client application to implement some detection/reconnection logic based on what it knows about its load and latency patterns. |
it is due to connection in the connection pool is invalid, check the keep alive time setting |
Closing this issue since I think this documents the issue and how to remedy when encountered. |
Currently those of us using Azure App Service Linux containers cannot adjust the values as we can neither pass parameters to the docker run command nor can we run in privileged mode. One of these are required to be able to modify underlying values such as tcp_retries2. |
@ShaneCourtrille fair point that the TCP configuration is not accessible in many client app environments. In those cases, it's best to implement a ForceReconnect pattern to detect and replace connections that have stalled. You can find examples of the pattern in the quickstart samples here: https://github.com/Azure-Samples/azure-cache-redis-samples/tree/main/quickstart |
@philon-msft Are you aware of any implementation of the ForceReconnect pattern when DistributedCache is in use? I opened this issue with them but it's not going to be looked at for awhile. |
@ShaneCourtrille For DistributedCache it looks like any ForceReconnect pattern will need to be implemented in aspnetcore code, so the issue you opened is the right long-term approach. |
cross ref: dotnet/aspnetcore#45261 |
@philon-msft are you of the opinion that the net.ipv4.tcp_retries2 setting is still needed in addition to In AKS, we need to allow Privileged containers to make this change and it goes against what we really want to do or the Azure policy that is in place. We can also have workloads on those pods that may not benefit from this lower value or could possibly cause issues. I see them both in the best practices so I was not sure if that was a consideration here. thanks for your insights! |
@adamyager configuring et.ipv4.tcp_retries2 would be an additional layer of defense-in-depth, but I agree it's not a good fit in AKS. |
This is great context. I do wonder if this could be called out in the best practices guide. Azure Support is pointing us to this and I think the answer is more nuanced. I would speculate that most of the Linux interactomes in Azure are via AKS and not Linux IaaS. As was called out, Linux on App Service cannot have the TCP settings modified. So a large majority need a better solution! |
@adamyager great point - I've created a work item to get the Azure Redis docs updated to include this suggestion for clients running in Kubernetes. |
@philon-msft This is super helpful. Thank you. One other item. A peer of yours a Microsoft that is a heavy contributor to this client is suggesting that none of this is needed if we use a newer client sdk. At least that’s how I read his post below. Here is what he says for reference. |
@adamyager It's true that with recent versions of StackExchange.Redis it's very rare for client apps to experience the types of hangs or stalls that require ForceReconnect to recover. And improvements in the Azure Redis service have made it unlikely that clients will experience stalled sockets where default et.ipv4.tcp_retries2 settings would delay recovery. Years ago, we hadn't flushed out all those issues, so we recommended ForceReconnect more strongly. |
@philon-msft great context again! Super helpful We are a big azure customer and speak to Product all time on various topics. I would really like a session with you on our experience. Azure Cache For Redis is one I think we can have a great discussion on that will benefit both parties. We did speak with the Company Redis via our Azure Account team but that was a bit less helpful as they are more focused on Enterprise Edition. |
@adamyager can you shoot me a mail please to get in touch? nick craver @ microsoft without spaces yada yada :) |
@NickCraver thanks so much. I have sent an email and look forward to our chat. |
To simulate a network failure we reboot both the primary and replica nodes in an Azure Cache for Redis instance and have found that the library reacts differently based on the host it is deployed to.
Application
Expected Result
StackExchange.Redis.RedisConnectionException
exceptions.Windows & Docker on Windows Result
The application reconnects approximately 1 minute after the nodes went down as expected.
Error:
Load Test Result
Linux Result
The application throws TimeoutExceptions and does not reconnect for 15 minutes.
Error:
Load Test Result
Observations
net.ipv4.tcp_retries2
. This setting decides the total time before a connection failure is declared. Lowering this setting to '5', I found that the application threw the correct type of errorsStackExchange.Redis.RedisConnectionException
and reconnected approximately 1 minute after the nodes went down. The downside to making this change is that it is a TCP setting for the server and if have multiple applications running on that server, they are all affected.net.ipv4.tcp_retries2
to 5 and running the application as a container did not reconnect quickly. Updating the setting did not have any impact when the application reconnected. It reconnected after 15 minutes.Questions
Referenced Issues
#1782
#1822
The text was updated successfully, but these errors were encountered: