-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Managing Redis Connection Loss on Azure Linux App Service #2595
Comments
cc @philon-msft for similar reports of the 15 minute issue here |
@mgravell am I understanding the UseForceReconnect code correctly? Any ideas for detecting/replacing stalled connections faster in this situation? |
That's how it seemed to us. Out of desperation, we modified the IDistributedCache implementation to take account of RedisTimeoutExceptions, but this is obviously not a long-term solution. Luckily, we never encounter any timeouts during normal operation. Couldn't there be a problem with the library itself? We also have java applications using Jedis, not on the same App Plan but connected to the same Redis server, and the problem never occurs. |
The OS is telling us the socket is fine and that's the main problem here. But we also know that commands are timing out (due to no responses) and that we have read anything at all off the socket. I'm taking a stab at handling this better so that we recovery much faster (in 4 timeouts) over in #2610. Other libraries open a new connection per command and other approaches (slower, different issues) or maybe they have handling for this - not sure but I'm seeing if we can make the multiplexed case better there. |
Good news, we will install this version on our test environment with traces activated. @philon-msft I guess it's better to disable the Microsoft.AspNetCore.Caching.StackExchangeRedis.UseForceReconnect switch? |
I'd recommend leaving ForceReconnect enabled for an additional layer of protection. |
Is anyone facing any issues with https://github.com/StackExchange/StackExchange.Redis/releases/tag/2.7.10? We need the fix for the 15-minute saga but are concerned about load and other issues. |
@kalaumapathy There's a null ref we fixed in 2.7.17, but this was not new - just solving a rare case of using-after-disposed a few users were hitting. I'd have no reservations about 2.7.x, and advise using latest. |
Thank you @NickCraver . We will do a load test before sending it to prod and will report here if we see scale issues. Also, How could we simulate this in lower env for testing these events? 1. We would like to log all the maintenance events from MS, and validate if this fix helps to recover from that rather than waiting for 15 mins. 2. Try to remove a shard from an existing setup or Remove/swap a replica node and how it affects during the load. 3. What other possible scenarios would you recommend to validate so this fix is working, and what other areas do we need to keep an eye on? 4. We would also like to use the newly introduced logging feature. |
Hello,
We are experiencing undetected connection losses with Azure Cache for Redis once or twice a month on our .NET 6 application hosted with Docker on a Azure Linux App Service. This leads to Redis Timeout Exceptions, and the number of clients connected to the Redis server drops sharply. The problem occurs at the same time on about half the instances. All applications hosted on the same instance experience the same problem at the same time.
The problem resolves itself after 15 minutes. We believe it's due to the TCP parameter net.ip4v.tcp_retries2 (#1848)
We understand that the origin of these errors is external to our application, but the frequency with which they occur surprises us. Is it normal for this to happen so regularly?
How can this problem be corrected or mitigated? We've tried updating the library to the latest version and enabling the ForceReconnect pattern with the switch Microsoft.AspNetCore.Caching.StackExchangeRedis.UseForceReconnect with no effect.
Here is an example of stacktrace when this happens:
Here's the evolution of the number of clients connected to the Redis server
The text was updated successfully, but these errors were encountered: