Timeout unreasonably longer than AsyncTimeout #2477

rayao · 2023-06-03T09:31:39Z

Recently our Azure service encountered a Redis instance down incident, during the down time, our Redis client throws RedisTimeoutException with unreasonably long timeout time, like this:

Failed to make redis cache call due to exception: StackExchange.Redis.RedisTimeoutException, Timeout awaiting response (outbound=2KiB, inbound=14KiB, 20969ms elapsed, timeout is 2000ms), command=SETEX, next: UNWATCH, inst: 0, qu: 152, qs: 0, aw: True, bw: WritingMessage, rs: ReadAsync, ws: Flushed, in: 0, last-in: 1, cur-in: 0, sync-ops: 16661, async-ops: 2226416, serverEndpoint: 104.42.147.56:15000, conn-sec: 45102.45, mc: 1/1/0, mgr: 10 of 10 available, clientName: RDA085FC4D4170(SE.Redis-v2.6.96.30123), PerfCounterHelperkeyHashSlot: 7889, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=6,Free=32761,Min=320,Max=32767), POOL: (Threads=12,QueuedItems=0,CompletedItems=47866509), v: 2.6.96.30123

On 1 of our VMs, we saw 102550 timeout exceptions in total, 92% of them longer than 2 seconds, 80% > 10 seconds, 72% > 20 seconds, 33% > 80 seconds, I can even see 977 exceptions with timeout > 120 seconds.
This made thousands of requests stuck in Processing stage, and further resulted in some other very bad situations.
I want to know if that's avoidable or not, if yes, how to configure the connection or client lib.

NickCraver · 2023-06-09T12:49:36Z

Do any of your other exceptions show higher thread counts? My best guess here is that under load you may have exhausted the thread pool causing timers not to fire for an extended duration (or you had a lot of timers ahead of us, preventing the next from firing?) - the trends for the POOL section above across your error history would be enlightening there.

rayao · 2023-06-09T17:28:38Z

@NickCraver
During the entire incident, thread count is below 100, seems to me not thread pool exhausted.
In our own code timer creation rate is pretty low, likely 1 or 2 per 5 seconds.
I collected some data here, hope that helps. We did close & reconnect Redis connection during the timeout burst, every VM here reconnected once, when avg/max elapsed dropped.

The VM which has max queued

Another VM

Another VM which has relatively less queued

sabaton1111 · 2023-06-15T05:01:40Z

@NickCraver
We are experiencing something similar everyday.
StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=3KiB, 5125ms elapsed, timeout is 5000ms), command=HMGET, next: INFO, inst: 0, qu: 0, qs: 0, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 7693, last-in: 0, cur-in: 0, sync-ops: 2, async-ops: 297, serverEndpoint: servername.redis.cache.windows.net:6380, conn-sec: 67.51, mc: 1/1/0, mgr: 10 of 10 available, clientName: clname(SE.Redis-v2.6.96.30123), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=3,Free=32764,Min=1,Max=32767), POOL: (Threads=11,QueuedItems=22,CompletedItems=2557), v: 2.6.96.30123 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Redis metrics from Azure App Insights:

CPU max load is under 30% (11% when timeouts started)
Used Memory Percentage is 1%
Server load is below 25%
Max expired keys ~5.5k

Maybe we should reconfigure something in our Redis?

rayao · 2023-06-15T11:36:59Z

@sabaton1111 seems to me your min worker thread number is too low?

sabaton1111 · 2023-06-15T12:50:31Z

@rayao Thank you, I will give it a try. :)

rayao · 2023-06-17T09:48:23Z

@NickCraver do you think those statistics explain the delay? What could be done from consuming side to make the situation better?

NickCraver · 2023-07-02T12:54:55Z

@rayao I'm trying to reason about this - how many are in the bw: WritingMessage state when you see timeouts? I wonder if you're in a case where we can send but not receive for some reason and the backlog is happily busy purging the queue downstream, not assessing timeouts because it's continually active...but this is purely a theory.

I'm looking at adding timer counts for .NET Core in the next release to timeout messages now, so we can rule that out easier.

NickCraver · 2023-07-02T13:08:33Z

To eliminate the timer overload cause, I'm adding this to logging in #2500.

rayao · 2023-07-03T06:38:19Z

@NickCraver I can't query log anymore, 5/30 log is purged now for too long.

We're seeing some instances of quite delayed timeouts and at least in two deeper investigations it was due to a huge number of timers firing and the backlog timeout assessment timer not triggering for an extended period of time. To help users diagnose this, adding the cheap counter to the .NET Core pool stats where `Timer.ActiveCount` is readily available. This is available in all the .NET (non-Framework) versions we support. Before: > ...POOL: (Threads=25,QueuedItems=0,CompletedItems=1066) After: > ...POOL: (Threads=25,QueuedItems=0,CompletedItems=1066,Timers=46) See #2477 for another possible instance of this.

vladislav-karamfilov · 2023-07-31T13:24:24Z

To eliminate the timer overload cause, I'm adding this to logging in #2500.

@NickCraver, what is the expected count of timers? We observe some timeout exceptions from the last 3 days and all of them have >= 23 timers (1 of them even has 409). Is it too many or we should seek the issue elsewhere?

Here's an example of failed HGETALL from a collection that had 2 (short string) elements max at the time of the exception:

Timeout awaiting response (outbound=0KiB, inbound=0KiB, 5297ms elapsed, timeout is 5000ms), command=HGETALL, next: HGETALL asterisk-call-timers, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 0, cur-in: 0, sync-ops: 0, async-ops: 121753, serverEndpoint: <server>, conn-sec: 43573.57, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: Public000000(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=4,Free=32763,Min=10,Max=32767), POOL: (Threads=8,QueuedItems=0,CompletedItems=2142304,Timers=27), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

stambunan · 2023-07-31T23:07:52Z

@NickCraver , We've also recently just updated and are seeing large numbers of timers on timeout exceptions. We use Polly to do retries when a request fails and I'm wondering if this could be a reason that our Timers is so high?

Sample exception:

Redis intermittent exception: StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=0KiB, 16250ms elapsed, timeout is 10000ms), command=LRANGE, next: LRANGE value, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 33, last-in: 2, cur-in: 0, sync-ops: 637400, async-ops: 126159, serverEndpoint: redis-endpoint, conn-sec: 220544.69, aoc: 0, mc: 1/1/0, mgr: 9 of 10 available, clientName: workername(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=10,Max=1000), WORKER: (Busy=12,Free=32755,Min=10,Max=32767), POOL: (Threads=15,QueuedItems=2203,CompletedItems=199785096,Timers=29004), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

NickCraver · 2023-08-01T22:56:59Z

@vladislav-karamfilov Not worried about that timer count, on yours I'd check SLOWLOG on server-side as I don't see anything standing out there.

NickCraver · 2023-08-01T22:59:40Z

@stambunan whoa that's a lot of timers, holy crap. I can't say why they're so high (given the volume, I'd try a memory dump at any busy period and you'll probably get lucky with some volume in play). Normally we see this with things like spinning up a timer per command for timeouts or some such wrapping somewhere, so it could be your retry or could be something else.

Please note that we see some thread elevation, queued items, and sync operations against Redis, all pointing to some amount of thread pool starvation. Combined with really high timers, I'm not surprised you'd see some contention/timeouts under load. Recommendations would be: find out what those timers are, trim down, and eliminate sync ops to Redis.

vladislav-karamfilov · 2023-08-04T09:01:03Z

@vladislav-karamfilov Not worried about that timer count, on yours I'd check SLOWLOG on server-side as I don't see anything standing out there.

Almost all of the commands in the output of SLOWLOG are not from our app. They are: setclientaddr, INFO replication, INFOINTERNAL timings, ManagementCommand ACK, ManagementCommand NO ONE`. I suspect that the issue is internal in the Azure resource because a couple of days ago I noticed that there were spikes of cache reads on approximately 1 hour interval and did a reboot and the issue was gone. But even without the spikes now, we still observe timeouts. Any suggestion how to proceed from here?

philon-msft · 2023-08-04T15:35:57Z

@vladislav-karamfilov it sounds like high Redis server load is causing commands to run slowly, causing even low-cost commands to appear in the SLOWLOG. I'd recommend investigating the high load - docs like this may be helpful: https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-troubleshoot-timeouts

stambunan · 2023-08-06T23:12:40Z

Thanks @NickCraver . I think we found a different issue which caused the active timers to be so high. Now, when we see timeouts it's in the 200s but we do use a couple of timers in our code which might be the reason for this.

vladislav-karamfilov · 2023-08-07T08:35:39Z

@vladislav-karamfilov it sounds like high Redis server load is causing commands to run slowly, causing even low-cost commands to appear in the SLOWLOG. I'd recommend investigating the high load - docs like this may be helpful: https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-troubleshoot-timeouts

This is our server load graph:

We did a lot of improvements to decrease the server load (smaller values, less connections, increased worker pool, optimized requests) at the end of June and beginning of July and it seemed that we fixed the issue around July 5th (#2486 (comment)). There were no timeout exceptions till July 25th and all of a sudden timeout exceptions started to appear in our Application Insights logs again. On July 23th we had a scheduled server maintenance (confirmed by following the steps from https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-troubleshoot-timeouts#server-maintenance). I think the new timeout exceptions wave is related to this maintenance event. As I wrote in my last comment, last week I also tried a reboot that seemed to have fixed the issue for a couple of days but after that we still observe timeout exceptions.

dstoynov-suzu · 2023-08-18T12:17:47Z

I can confirm that from 2 months or less we started to get these kind of errors in burst batches
Timeout awaiting response (outbound=2KiB, inbound=513KiB, 6062ms elapsed, timeout is 5000ms), command=HMGET, next: HMGET XXXXXXXXXXXXX, inst: 0, qu: 34, qs: 53, aw: True, bw: WritingMessage, rs: DequeueResult, ws: Writing, in: 0, in-pipe: 526205, out-pipe: 0, last-in: 0, cur-in: 526205, sync-ops: 0, async-ops: 3676273, serverEndpoint: rd.redis.cache.windows.net:6379, conn-sec: 93130.31, aoc: 0, mc: 1/1/0, mgr: 9 of 10 available, clientName: RD0003FF552664(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=280,Max=1000), WORKER: (Busy=40,Free=32727,Min=280,Max=32767), POOL: (Threads=40,QueuedItems=32,CompletedItems=28313302,Timers=46), v: 2.6.122.38350

and I've tried all the suggestion I've mange to find - from increasing the Redis from C2 to C3 (increasing the bandwidth) and increasing the worker count (280 as you can see) and increasing the app instance count so more unit to handle the requests. So far nothing has changed. I still see the log errors (no more, no less).

deivydas321 · 2023-10-27T07:11:21Z

We are experiencing these timeouts while the server load is < 10%.

Timeout awaiting response (outbound=0KiB, inbound=4KiB, 844ms elapsed, timeout is 500ms), command=EVAL, next: EVAL, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 1, cur-in: 0, sync-ops: 0, async-ops: 5423477, serverEndpoint: xxx.redis.cache.windows.net:6380, conn-sec: 81518.1, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: 514f69175ded(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=10,Max=1000), WORKER: (Busy=4,Free=32763,Min=10,Max=32767), POOL: (Threads=22,QueuedItems=10,CompletedItems=116322835,Timers=137), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Is there any documentation of what Timers means? In my case, it is Timers=141 can it be related to the timeouts I experience?

Magnum1945 · 2024-01-24T08:12:20Z

@NickCraver
We are experiencing timeouts sporadically. We get them on RedisHealthChecks for GetDatabase().PingAsync().ConfigureAwait(false) and server.PingAsync().ConfigureAwait(false). The only strange thing I can see is the large amount of async-ops.

"Timeout awaiting response (outbound=0KiB, inbound=0KiB, 5228ms elapsed, timeout is 5000ms), command=PING, next: INFO, inst: 0, qu: 0, qs: 0, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 0, last-in: 4, cur-in: 0, sync-ops: 4, async-ops: 670885, serverEndpoint: ***.redis.cache.windows.net:6380, conn-sec: 273596.92, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: ***, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=3,Free=32764,Min=1,Max=32767), POOL: (Threads=6,QueuedItems=0,CompletedItems=32332523,Timers=7), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)"

Can you explain what can be the issue here?

NickCraver · 2024-01-25T01:25:05Z

@Magnum1945 How often do you see this? e.g. could it be when the servers are patched every ~6 weeks, or what kind of sporadic are we talking? Nothing in the message seems out of the ordinary with load of volume, that seems pretty reasonable overall so we're likely looking external.

Magnum1945 · 2024-01-25T08:12:53Z

@NickCraver Here is the statistic of errors for our US region:
Dec 06, Dec 15, Jan 09, Jan 11, Jan 14. Every time the error lasts for around 15 minutes. The hours when the issue occurs are different every time so the is no pattern.
According to our DevOps, there was no patching at that time.

philon-msft · 2024-01-25T15:39:51Z

The 15 minute duration sounds like you may be experiencing stalled sockets. I'd recommend upgrading StackExchange.Redis to v2.7.17 to get Fix #2595: Add detection handling for dead sockets that the OS says are okay, seen especially in Linux environments (#2610 by @NickCraver)
That will help recover connections faster, in less than a minute rather than 15m, but won't eliminate the connection drops themselves. They seem to be occurring more frequently than can be explained by Azure Redis server maintenance. Most likely, there's another problem somewhere in your app environment causing Redis connections to be lost.

Magnum1945 · 2024-01-26T10:57:52Z

Thanks @philon-msft. I will upgrade the StackExchange.Redis to the latest stable as well as increase the worker threads min value. Although I still don't know why the problem is happening - not all the errors last for 15 minutes. Currently, I'm thinking about some bandwidth issues on the client side or between the client and Redis, but I'm not sure that's the case in my situation.

Shumaister · 2024-02-14T12:13:49Z

Hello there,

Here we've this log. We increase minimum count of threads too but see like no differences

StackExchange.Redis.RedisTimeoutException: Timeout performing HMGET (3ms), next: HMGET F2-Session 2646bd0e-6673-791f-5be1-49f54fdcefde, inst: 3, qu: 0, qs: 1, aw: False, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, serverEndpoint: redis.fargate-services.local:6379, mc: 1/1/0, mgr: 10 of 10 available, clientName: ip-10-0-130-208, IOCP: (Busy=0,Free=1000,Min=200,Max=1000), WORKER: (Busy=4,Free=32763,Min=200,Max=32767), v: 2.2.4.27433 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

NickCraver · 2024-02-18T22:37:26Z

@Shumaister What are your settings? That message indicates it timed out doing a multi-get in 3ms on a version with only ~1,000ms fidelity in timeout checks. That seems to me like it's set for 3ms, rather than 3 seconds as intended perhaps.

hacst · 2024-03-07T14:52:01Z

@philon-msft

They seem to be occurring more frequently than can be explained by Azure Redis server maintenance. Most likely, there's another problem somewhere in your app environment causing Redis connections to be lost.

Could you clarify whether a stalled socket is expected in regular Azure Redis server maintenance? It would seem very strange to me if regular maintenance didn't take care to close the sockets instead of just dropping them. However we also see these stalls with Azure Cache for Redis in an AKS service in unexpected frequency so I am wondering.

philon-msft · 2024-03-07T18:08:01Z

Routine Azure Redis server maintenance closes connections gracefully to avoid stalled sockets, but in some cases maintenance on underlying Azure infrastructure (networking or compute) can cause connections to be interrupted without graceful closure, leading to stalled sockets on clients.
Client apps running in Kubernetes often experience more connection issues like this because there are more networking pieces involved, like the Kubernetes nodes which undergo maintenance themselves. Sidecars like Istio or Envoy can also disrupt Redis connections if they haven't been configured properly to ignore/allow all Redis traffic. We have some best practices documented here: https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-best-practices-kubernetes

NickCraver added ☁️ platform:Azure ⏱️ timeout labels Jun 9, 2023

NickCraver mentioned this issue Jul 2, 2023

Exceptions: include Timer.ActiveCount in .NET Core stats #2500

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout unreasonably longer than AsyncTimeout #2477

Timeout unreasonably longer than AsyncTimeout #2477

rayao commented Jun 3, 2023 •

edited

Loading

NickCraver commented Jun 9, 2023

rayao commented Jun 9, 2023

sabaton1111 commented Jun 15, 2023

rayao commented Jun 15, 2023

sabaton1111 commented Jun 15, 2023

rayao commented Jun 17, 2023

NickCraver commented Jul 2, 2023

NickCraver commented Jul 2, 2023

rayao commented Jul 3, 2023

vladislav-karamfilov commented Jul 31, 2023

stambunan commented Jul 31, 2023

NickCraver commented Aug 1, 2023

NickCraver commented Aug 1, 2023

vladislav-karamfilov commented Aug 4, 2023

philon-msft commented Aug 4, 2023

stambunan commented Aug 6, 2023

vladislav-karamfilov commented Aug 7, 2023 •

edited

Loading

dstoynov-suzu commented Aug 18, 2023

deivydas321 commented Oct 27, 2023 •

edited

Loading

Magnum1945 commented Jan 24, 2024

NickCraver commented Jan 25, 2024

Magnum1945 commented Jan 25, 2024 •

edited

Loading

philon-msft commented Jan 25, 2024

Magnum1945 commented Jan 26, 2024

Shumaister commented Feb 14, 2024

NickCraver commented Feb 18, 2024 •

edited

Loading

hacst commented Mar 7, 2024

philon-msft commented Mar 7, 2024

Timeout unreasonably longer than AsyncTimeout #2477

Timeout unreasonably longer than AsyncTimeout #2477

Comments

rayao commented Jun 3, 2023 • edited Loading

NickCraver commented Jun 9, 2023

rayao commented Jun 9, 2023

sabaton1111 commented Jun 15, 2023

rayao commented Jun 15, 2023

sabaton1111 commented Jun 15, 2023

rayao commented Jun 17, 2023

NickCraver commented Jul 2, 2023

NickCraver commented Jul 2, 2023

rayao commented Jul 3, 2023

vladislav-karamfilov commented Jul 31, 2023

stambunan commented Jul 31, 2023

NickCraver commented Aug 1, 2023

NickCraver commented Aug 1, 2023

vladislav-karamfilov commented Aug 4, 2023

philon-msft commented Aug 4, 2023

stambunan commented Aug 6, 2023

vladislav-karamfilov commented Aug 7, 2023 • edited Loading

dstoynov-suzu commented Aug 18, 2023

deivydas321 commented Oct 27, 2023 • edited Loading

Magnum1945 commented Jan 24, 2024

NickCraver commented Jan 25, 2024

Magnum1945 commented Jan 25, 2024 • edited Loading

philon-msft commented Jan 25, 2024

Magnum1945 commented Jan 26, 2024

Shumaister commented Feb 14, 2024

NickCraver commented Feb 18, 2024 • edited Loading

hacst commented Mar 7, 2024

philon-msft commented Mar 7, 2024

rayao commented Jun 3, 2023 •

edited

Loading

vladislav-karamfilov commented Aug 7, 2023 •

edited

Loading

deivydas321 commented Oct 27, 2023 •

edited

Loading

Magnum1945 commented Jan 25, 2024 •

edited

Loading

NickCraver commented Feb 18, 2024 •

edited

Loading