Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managing Redis Connection Loss on Azure Linux App Service #2595

Closed
FPierrel opened this issue Nov 12, 2023 · 10 comments
Closed

Managing Redis Connection Loss on Azure Linux App Service #2595

FPierrel opened this issue Nov 12, 2023 · 10 comments

Comments

@FPierrel
Copy link

Hello,

We are experiencing undetected connection losses with Azure Cache for Redis once or twice a month on our .NET 6 application hosted with Docker on a Azure Linux App Service. This leads to Redis Timeout Exceptions, and the number of clients connected to the Redis server drops sharply. The problem occurs at the same time on about half the instances. All applications hosted on the same instance experience the same problem at the same time.

The problem resolves itself after 15 minutes. We believe it's due to the TCP parameter net.ip4v.tcp_retries2 (#1848)

We understand that the origin of these errors is external to our application, but the frequency with which they occur surprises us. Is it normal for this to happen so regularly?

How can this problem be corrected or mitigated? We've tried updating the library to the latest version and enabling the ForceReconnect pattern with the switch Microsoft.AspNetCore.Caching.StackExchangeRedis.UseForceReconnect with no effect.


Here is an example of stacktrace when this happens:

Timeout awaiting response (outbound=0KiB, inbound=15KiB, 5984ms elapsed, timeout is 5000ms), command=HMGET, next: EXPIRE ecentred119300e-be34-276a-2255-d347328600d2, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 0, cur-in: 0, sync-ops: 223371, async-ops: 753868, serverEndpoint: xxxx:6380, conn-sec: 8596.01, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: 3a734420707e(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=250,Max=1000), WORKER: (Busy=3,Free=32764,Min=250,Max=32767), POOL: (Threads=19,QueuedItems=0,CompletedItems=9254999,Timers=256), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts) 

StackExchange.Redis.RedisTimeoutException:
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Extensions.Caching.StackExchangeRedis.RedisCache+<GetAndRefreshAsync>d__39.MoveNext (Microsoft.Extensions.Caching.StackExchangeRedis, Version=8.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Extensions.Caching.StackExchangeRedis.RedisCache+<GetAsync>d__27.MoveNext (Microsoft.Extensions.Caching.StackExchangeRedis, Version=8.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Session.DistributedSession+<CommitAsync>d__31.MoveNext (Microsoft.AspNetCore.Session, Version=6.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)

Here's the evolution of the number of clients connected to the Redis server
image

@NickCraver
Copy link
Collaborator

cc @philon-msft for similar reports of the 15 minute issue here

@philon-msft
Copy link
Collaborator

UseForceReconnect is designed to help in this situation, but it may not be getting triggered by the RedisTimeoutExceptions thrown in this scenario. RedisConnectionExceptions or SocketExceptions would trigger it, but when commands are simply timing out while the socket is stalled, only RedisTimeoutExceptions are thrown.

@mgravell am I understanding the UseForceReconnect code correctly? Any ideas for detecting/replacing stalled connections faster in this situation?

@FPierrel
Copy link
Author

That's how it seemed to us. Out of desperation, we modified the IDistributedCache implementation to take account of RedisTimeoutExceptions, but this is obviously not a long-term solution. Luckily, we never encounter any timeouts during normal operation.

Couldn't there be a problem with the library itself? We also have java applications using Jedis, not on the same App Plan but connected to the same Redis server, and the problem never occurs.

@NickCraver
Copy link
Collaborator

The OS is telling us the socket is fine and that's the main problem here. But we also know that commands are timing out (due to no responses) and that we have read anything at all off the socket. I'm taking a stab at handling this better so that we recovery much faster (in 4 timeouts) over in #2610.

Other libraries open a new connection per command and other approaches (slower, different issues) or maybe they have handling for this - not sure but I'm seeing if we can make the multiplexed case better there.

@philon-msft
Copy link
Collaborator

Update: a new version 2.7.10 has been released, including #2610 to detect and recover stalled sockets. This should help prevent the situation where connections can stall for ~15 minutes on Linux clients.

@FPierrel
Copy link
Author

Good news, we will install this version on our test environment with traces activated.

@philon-msft I guess it's better to disable the Microsoft.AspNetCore.Caching.StackExchangeRedis.UseForceReconnect switch?

@philon-msft
Copy link
Collaborator

I'd recommend leaving ForceReconnect enabled for an additional layer of protection.

@kalaumapathy
Copy link

Is anyone facing any issues with https://github.com/StackExchange/StackExchange.Redis/releases/tag/2.7.10? We need the fix for the 15-minute saga but are concerned about load and other issues.
Please advise @philon-msft @NickCraver

@NickCraver
Copy link
Collaborator

@kalaumapathy There's a null ref we fixed in 2.7.17, but this was not new - just solving a rare case of using-after-disposed a few users were hitting. I'd have no reservations about 2.7.x, and advise using latest.

@kalaumapathy
Copy link

@kalaumapathy There's a null ref we fixed in 2.7.17, but this was not new - just solving a rare case of using-after-disposed a few users were hitting. I'd have no reservations about 2.7.x, and advise using latest.

Thank you @NickCraver . We will do a load test before sending it to prod and will report here if we see scale issues. Also, How could we simulate this in lower env for testing these events? 1. We would like to log all the maintenance events from MS, and validate if this fix helps to recover from that rather than waiting for 15 mins. 2. Try to remove a shard from an existing setup or Remove/swap a replica node and how it affects during the load. 3. What other possible scenarios would you recommend to validate so this fix is working, and what other areas do we need to keep an eye on? 4. We would also like to use the newly introduced logging feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants