Managing Redis Connection Loss on Azure Linux App Service #2595

FPierrel · 2023-11-12T18:27:04Z

Hello,

We are experiencing undetected connection losses with Azure Cache for Redis once or twice a month on our .NET 6 application hosted with Docker on a Azure Linux App Service. This leads to Redis Timeout Exceptions, and the number of clients connected to the Redis server drops sharply. The problem occurs at the same time on about half the instances. All applications hosted on the same instance experience the same problem at the same time.

The problem resolves itself after 15 minutes. We believe it's due to the TCP parameter net.ip4v.tcp_retries2 (#1848)

We understand that the origin of these errors is external to our application, but the frequency with which they occur surprises us. Is it normal for this to happen so regularly?

How can this problem be corrected or mitigated? We've tried updating the library to the latest version and enabling the ForceReconnect pattern with the switch Microsoft.AspNetCore.Caching.StackExchangeRedis.UseForceReconnect with no effect.

Here is an example of stacktrace when this happens:

Timeout awaiting response (outbound=0KiB, inbound=15KiB, 5984ms elapsed, timeout is 5000ms), command=HMGET, next: EXPIRE ecentred119300e-be34-276a-2255-d347328600d2, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 0, cur-in: 0, sync-ops: 223371, async-ops: 753868, serverEndpoint: xxxx:6380, conn-sec: 8596.01, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: 3a734420707e(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=250,Max=1000), WORKER: (Busy=3,Free=32764,Min=250,Max=32767), POOL: (Threads=19,QueuedItems=0,CompletedItems=9254999,Timers=256), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts) 

StackExchange.Redis.RedisTimeoutException:
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Extensions.Caching.StackExchangeRedis.RedisCache+<GetAndRefreshAsync>d__39.MoveNext (Microsoft.Extensions.Caching.StackExchangeRedis, Version=8.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Extensions.Caching.StackExchangeRedis.RedisCache+<GetAsync>d__27.MoveNext (Microsoft.Extensions.Caching.StackExchangeRedis, Version=8.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Session.DistributedSession+<CommitAsync>d__31.MoveNext (Microsoft.AspNetCore.Session, Version=6.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)

Here's the evolution of the number of clients connected to the Redis server

The text was updated successfully, but these errors were encountered:

NickCraver · 2023-11-23T00:42:22Z

cc @philon-msft for similar reports of the 15 minute issue here

philon-msft · 2023-11-23T02:32:02Z

UseForceReconnect is designed to help in this situation, but it may not be getting triggered by the RedisTimeoutExceptions thrown in this scenario. RedisConnectionExceptions or SocketExceptions would trigger it, but when commands are simply timing out while the socket is stalled, only RedisTimeoutExceptions are thrown.

@mgravell am I understanding the UseForceReconnect code correctly? Any ideas for detecting/replacing stalled connections faster in this situation?

FPierrel · 2023-11-24T07:34:39Z

That's how it seemed to us. Out of desperation, we modified the IDistributedCache implementation to take account of RedisTimeoutExceptions, but this is obviously not a long-term solution. Luckily, we never encounter any timeouts during normal operation.

Couldn't there be a problem with the library itself? We also have java applications using Jedis, not on the same App Plan but connected to the same Redis server, and the problem never occurs.

NickCraver · 2023-12-08T19:47:28Z

The OS is telling us the socket is fine and that's the main problem here. But we also know that commands are timing out (due to no responses) and that we have read anything at all off the socket. I'm taking a stab at handling this better so that we recovery much faster (in 4 timeouts) over in #2610.

Other libraries open a new connection per command and other approaches (slower, different issues) or maybe they have handling for this - not sure but I'm seeing if we can make the multiplexed case better there.

philon-msft · 2023-12-12T17:01:28Z

Update: a new version 2.7.10 has been released, including #2610 to detect and recover stalled sockets. This should help prevent the situation where connections can stall for ~15 minutes on Linux clients.

FPierrel · 2023-12-13T08:00:55Z

Good news, we will install this version on our test environment with traces activated.

@philon-msft I guess it's better to disable the Microsoft.AspNetCore.Caching.StackExchangeRedis.UseForceReconnect switch?

philon-msft · 2023-12-13T15:12:30Z

I'd recommend leaving ForceReconnect enabled for an additional layer of protection.

kalaumapathy · 2024-02-10T18:23:26Z

Is anyone facing any issues with https://github.com/StackExchange/StackExchange.Redis/releases/tag/2.7.10? We need the fix for the 15-minute saga but are concerned about load and other issues.
Please advise @philon-msft @NickCraver

NickCraver · 2024-02-10T18:35:34Z

@kalaumapathy There's a null ref we fixed in 2.7.17, but this was not new - just solving a rare case of using-after-disposed a few users were hitting. I'd have no reservations about 2.7.x, and advise using latest.

kalaumapathy · 2024-02-10T19:57:44Z

@kalaumapathy There's a null ref we fixed in 2.7.17, but this was not new - just solving a rare case of using-after-disposed a few users were hitting. I'd have no reservations about 2.7.x, and advise using latest.

Thank you @NickCraver . We will do a load test before sending it to prod and will report here if we see scale issues. Also, How could we simulate this in lower env for testing these events? 1. We would like to log all the maintenance events from MS, and validate if this fix helps to recover from that rather than waiting for 15 mins. 2. Try to remove a shard from an existing setup or Remove/swap a replica node and how it affects during the load. 3. What other possible scenarios would you recommend to validate so this fix is working, and what other areas do we need to keep an eye on? 4. We would also like to use the newly introduced logging feature.

NickCraver added the ⏱️ timeout label Dec 8, 2023

FPierrel closed this as completed Jan 10, 2024

philon-msft mentioned this issue Jan 25, 2024

Timeout unreasonably longer than AsyncTimeout #2477

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managing Redis Connection Loss on Azure Linux App Service #2595

Managing Redis Connection Loss on Azure Linux App Service #2595

FPierrel commented Nov 12, 2023

NickCraver commented Nov 23, 2023

philon-msft commented Nov 23, 2023

FPierrel commented Nov 24, 2023

NickCraver commented Dec 8, 2023

philon-msft commented Dec 12, 2023

FPierrel commented Dec 13, 2023

philon-msft commented Dec 13, 2023

kalaumapathy commented Feb 10, 2024

NickCraver commented Feb 10, 2024

kalaumapathy commented Feb 10, 2024

Managing Redis Connection Loss on Azure Linux App Service #2595

Managing Redis Connection Loss on Azure Linux App Service #2595

Comments

FPierrel commented Nov 12, 2023

NickCraver commented Nov 23, 2023

philon-msft commented Nov 23, 2023

FPierrel commented Nov 24, 2023

NickCraver commented Dec 8, 2023

philon-msft commented Dec 12, 2023

FPierrel commented Dec 13, 2023

philon-msft commented Dec 13, 2023

kalaumapathy commented Feb 10, 2024

NickCraver commented Feb 10, 2024

kalaumapathy commented Feb 10, 2024