Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout unreasonably longer than AsyncTimeout #2477

Open
rayao opened this issue Jun 3, 2023 · 28 comments
Open

Timeout unreasonably longer than AsyncTimeout #2477

rayao opened this issue Jun 3, 2023 · 28 comments

Comments

@rayao
Copy link

rayao commented Jun 3, 2023

Recently our Azure service encountered a Redis instance down incident, during the down time, our Redis client throws RedisTimeoutException with unreasonably long timeout time, like this:

Failed to make redis cache call due to exception: StackExchange.Redis.RedisTimeoutException, Timeout awaiting response (outbound=2KiB, inbound=14KiB, 20969ms elapsed, timeout is 2000ms), command=SETEX, next: UNWATCH, inst: 0, qu: 152, qs: 0, aw: True, bw: WritingMessage, rs: ReadAsync, ws: Flushed, in: 0, last-in: 1, cur-in: 0, sync-ops: 16661, async-ops: 2226416, serverEndpoint: 104.42.147.56:15000, conn-sec: 45102.45, mc: 1/1/0, mgr: 10 of 10 available, clientName: RDA085FC4D4170(SE.Redis-v2.6.96.30123), PerfCounterHelperkeyHashSlot: 7889, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=6,Free=32761,Min=320,Max=32767), POOL: (Threads=12,QueuedItems=0,CompletedItems=47866509), v: 2.6.96.30123

On 1 of our VMs, we saw 102550 timeout exceptions in total, 92% of them longer than 2 seconds, 80% > 10 seconds, 72% > 20 seconds, 33% > 80 seconds, I can even see 977 exceptions with timeout > 120 seconds.
This made thousands of requests stuck in Processing stage, and further resulted in some other very bad situations.
I want to know if that's avoidable or not, if yes, how to configure the connection or client lib.

@NickCraver
Copy link
Collaborator

Do any of your other exceptions show higher thread counts? My best guess here is that under load you may have exhausted the thread pool causing timers not to fire for an extended duration (or you had a lot of timers ahead of us, preventing the next from firing?) - the trends for the POOL section above across your error history would be enlightening there.

@rayao
Copy link
Author

rayao commented Jun 9, 2023

@NickCraver
During the entire incident, thread count is below 100, seems to me not thread pool exhausted.
In our own code timer creation rate is pretty low, likely 1 or 2 per 5 seconds.
I collected some data here, hope that helps. We did close & reconnect Redis connection during the timeout burst, every VM here reconnected once, when avg/max elapsed dropped.
image
The VM which has max queued
image
Another VM
image
Another VM which has relatively less queued
image

@sabaton1111
Copy link

@NickCraver
We are experiencing something similar everyday.
StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=3KiB, 5125ms elapsed, timeout is 5000ms), command=HMGET, next: INFO, inst: 0, qu: 0, qs: 0, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 7693, last-in: 0, cur-in: 0, sync-ops: 2, async-ops: 297, serverEndpoint: servername.redis.cache.windows.net:6380, conn-sec: 67.51, mc: 1/1/0, mgr: 10 of 10 available, clientName: clname(SE.Redis-v2.6.96.30123), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=3,Free=32764,Min=1,Max=32767), POOL: (Threads=11,QueuedItems=22,CompletedItems=2557), v: 2.6.96.30123 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Redis metrics from Azure App Insights:

  • CPU max load is under 30% (11% when timeouts started)
  • Used Memory Percentage is 1%
  • Server load is below 25%
  • Max expired keys ~5.5k

Maybe we should reconfigure something in our Redis?

@rayao
Copy link
Author

rayao commented Jun 15, 2023

@sabaton1111 seems to me your min worker thread number is too low?

@sabaton1111
Copy link

@rayao Thank you, I will give it a try. :)

@rayao
Copy link
Author

rayao commented Jun 17, 2023

@NickCraver do you think those statistics explain the delay? What could be done from consuming side to make the situation better?

@NickCraver
Copy link
Collaborator

@rayao I'm trying to reason about this - how many are in the bw: WritingMessage state when you see timeouts? I wonder if you're in a case where we can send but not receive for some reason and the backlog is happily busy purging the queue downstream, not assessing timeouts because it's continually active...but this is purely a theory.

I'm looking at adding timer counts for .NET Core in the next release to timeout messages now, so we can rule that out easier.

@NickCraver
Copy link
Collaborator

To eliminate the timer overload cause, I'm adding this to logging in #2500.

@rayao
Copy link
Author

rayao commented Jul 3, 2023

@NickCraver I can't query log anymore, 5/30 log is purged now for too long.

NickCraver added a commit that referenced this issue Jul 5, 2023
We're seeing some instances of quite delayed timeouts and at least in two deeper investigations it was due to a huge number of timers firing and the backlog timeout assessment timer not triggering for an extended period of time. To help users diagnose this, adding the cheap counter to the .NET Core pool stats where `Timer.ActiveCount` is readily available.

This is available in all the .NET (non-Framework) versions we support.

Before:
> ...POOL: (Threads=25,QueuedItems=0,CompletedItems=1066)

After:
> ...POOL: (Threads=25,QueuedItems=0,CompletedItems=1066,Timers=46)

See #2477 for another possible instance of this.
@vladislav-karamfilov
Copy link

To eliminate the timer overload cause, I'm adding this to logging in #2500.

@NickCraver, what is the expected count of timers? We observe some timeout exceptions from the last 3 days and all of them have >= 23 timers (1 of them even has 409). Is it too many or we should seek the issue elsewhere?

Here's an example of failed HGETALL from a collection that had 2 (short string) elements max at the time of the exception:

Timeout awaiting response (outbound=0KiB, inbound=0KiB, 5297ms elapsed, timeout is 5000ms), command=HGETALL, next: HGETALL asterisk-call-timers, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 0, cur-in: 0, sync-ops: 0, async-ops: 121753, serverEndpoint: <server>, conn-sec: 43573.57, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: Public000000(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=4,Free=32763,Min=10,Max=32767), POOL: (Threads=8,QueuedItems=0,CompletedItems=2142304,Timers=27), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

@stambunan
Copy link
Contributor

@NickCraver , We've also recently just updated and are seeing large numbers of timers on timeout exceptions. We use Polly to do retries when a request fails and I'm wondering if this could be a reason that our Timers is so high?

Sample exception:

Redis intermittent exception: StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=0KiB, 16250ms elapsed, timeout is 10000ms), command=LRANGE, next: LRANGE value, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 33, last-in: 2, cur-in: 0, sync-ops: 637400, async-ops: 126159, serverEndpoint: redis-endpoint, conn-sec: 220544.69, aoc: 0, mc: 1/1/0, mgr: 9 of 10 available, clientName: workername(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=10,Max=1000), WORKER: (Busy=12,Free=32755,Min=10,Max=32767), POOL: (Threads=15,QueuedItems=2203,CompletedItems=199785096,Timers=29004), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts) 

@NickCraver
Copy link
Collaborator

@vladislav-karamfilov Not worried about that timer count, on yours I'd check SLOWLOG on server-side as I don't see anything standing out there.

@NickCraver
Copy link
Collaborator

@stambunan whoa that's a lot of timers, holy crap. I can't say why they're so high (given the volume, I'd try a memory dump at any busy period and you'll probably get lucky with some volume in play). Normally we see this with things like spinning up a timer per command for timeouts or some such wrapping somewhere, so it could be your retry or could be something else.

Please note that we see some thread elevation, queued items, and sync operations against Redis, all pointing to some amount of thread pool starvation. Combined with really high timers, I'm not surprised you'd see some contention/timeouts under load. Recommendations would be: find out what those timers are, trim down, and eliminate sync ops to Redis.

@vladislav-karamfilov
Copy link

@vladislav-karamfilov Not worried about that timer count, on yours I'd check SLOWLOG on server-side as I don't see anything standing out there.

Almost all of the commands in the output of SLOWLOG are not from our app. They are: setclientaddr, INFO replication, INFOINTERNAL timings, ManagementCommand ACK, ManagementCommand NO ONE`. I suspect that the issue is internal in the Azure resource because a couple of days ago I noticed that there were spikes of cache reads on approximately 1 hour interval and did a reboot and the issue was gone. But even without the spikes now, we still observe timeouts. Any suggestion how to proceed from here?

@philon-msft
Copy link
Collaborator

@vladislav-karamfilov it sounds like high Redis server load is causing commands to run slowly, causing even low-cost commands to appear in the SLOWLOG. I'd recommend investigating the high load - docs like this may be helpful: https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-troubleshoot-timeouts

@stambunan
Copy link
Contributor

Thanks @NickCraver . I think we found a different issue which caused the active timers to be so high. Now, when we see timeouts it's in the 200s but we do use a couple of timers in our code which might be the reason for this.

@vladislav-karamfilov
Copy link

vladislav-karamfilov commented Aug 7, 2023

@vladislav-karamfilov it sounds like high Redis server load is causing commands to run slowly, causing even low-cost commands to appear in the SLOWLOG. I'd recommend investigating the high load - docs like this may be helpful: https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-troubleshoot-timeouts

This is our server load graph:
image

We did a lot of improvements to decrease the server load (smaller values, less connections, increased worker pool, optimized requests) at the end of June and beginning of July and it seemed that we fixed the issue around July 5th (#2486 (comment)). There were no timeout exceptions till July 25th and all of a sudden timeout exceptions started to appear in our Application Insights logs again. On July 23th we had a scheduled server maintenance (confirmed by following the steps from https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-troubleshoot-timeouts#server-maintenance). I think the new timeout exceptions wave is related to this maintenance event. As I wrote in my last comment, last week I also tried a reboot that seemed to have fixed the issue for a couple of days but after that we still observe timeout exceptions.

@dstoynov-suzu
Copy link

I can confirm that from 2 months or less we started to get these kind of errors in burst batches
Timeout awaiting response (outbound=2KiB, inbound=513KiB, 6062ms elapsed, timeout is 5000ms), command=HMGET, next: HMGET XXXXXXXXXXXXX, inst: 0, qu: 34, qs: 53, aw: True, bw: WritingMessage, rs: DequeueResult, ws: Writing, in: 0, in-pipe: 526205, out-pipe: 0, last-in: 0, cur-in: 526205, sync-ops: 0, async-ops: 3676273, serverEndpoint: rd.redis.cache.windows.net:6379, conn-sec: 93130.31, aoc: 0, mc: 1/1/0, mgr: 9 of 10 available, clientName: RD0003FF552664(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=280,Max=1000), WORKER: (Busy=40,Free=32727,Min=280,Max=32767), POOL: (Threads=40,QueuedItems=32,CompletedItems=28313302,Timers=46), v: 2.6.122.38350

and I've tried all the suggestion I've mange to find - from increasing the Redis from C2 to C3 (increasing the bandwidth) and increasing the worker count (280 as you can see) and increasing the app instance count so more unit to handle the requests. So far nothing has changed. I still see the log errors (no more, no less).

@deivydas321
Copy link

deivydas321 commented Oct 27, 2023

We are experiencing these timeouts while the server load is < 10%.

Timeout awaiting response (outbound=0KiB, inbound=4KiB, 844ms elapsed, timeout is 500ms), command=EVAL, next: EVAL, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 1, cur-in: 0, sync-ops: 0, async-ops: 5423477, serverEndpoint: xxx.redis.cache.windows.net:6380, conn-sec: 81518.1, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: 514f69175ded(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=10,Max=1000), WORKER: (Busy=4,Free=32763,Min=10,Max=32767), POOL: (Threads=22,QueuedItems=10,CompletedItems=116322835,Timers=137), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Is there any documentation of what Timers means? In my case, it is Timers=141 can it be related to the timeouts I experience?

@Magnum1945
Copy link

@NickCraver
We are experiencing timeouts sporadically. We get them on RedisHealthChecks for GetDatabase().PingAsync().ConfigureAwait(false) and server.PingAsync().ConfigureAwait(false). The only strange thing I can see is the large amount of async-ops.

"Timeout awaiting response (outbound=0KiB, inbound=0KiB, 5228ms elapsed, timeout is 5000ms), command=PING, next: INFO, inst: 0, qu: 0, qs: 0, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 0, last-in: 4, cur-in: 0, sync-ops: 4, async-ops: 670885, serverEndpoint: ***.redis.cache.windows.net:6380, conn-sec: 273596.92, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: ***, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=3,Free=32764,Min=1,Max=32767), POOL: (Threads=6,QueuedItems=0,CompletedItems=32332523,Timers=7), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)"

Can you explain what can be the issue here?

@NickCraver
Copy link
Collaborator

@Magnum1945 How often do you see this? e.g. could it be when the servers are patched every ~6 weeks, or what kind of sporadic are we talking? Nothing in the message seems out of the ordinary with load of volume, that seems pretty reasonable overall so we're likely looking external.

@Magnum1945
Copy link

Magnum1945 commented Jan 25, 2024

@NickCraver Here is the statistic of errors for our US region:
Dec 06, Dec 15, Jan 09, Jan 11, Jan 14. Every time the error lasts for around 15 minutes. The hours when the issue occurs are different every time so the is no pattern.
According to our DevOps, there was no patching at that time.

@philon-msft
Copy link
Collaborator

The 15 minute duration sounds like you may be experiencing stalled sockets. I'd recommend upgrading StackExchange.Redis to v2.7.17 to get Fix #2595: Add detection handling for dead sockets that the OS says are okay, seen especially in Linux environments (#2610 by @NickCraver)
That will help recover connections faster, in less than a minute rather than 15m, but won't eliminate the connection drops themselves. They seem to be occurring more frequently than can be explained by Azure Redis server maintenance. Most likely, there's another problem somewhere in your app environment causing Redis connections to be lost.

@Magnum1945
Copy link

Thanks @philon-msft. I will upgrade the StackExchange.Redis to the latest stable as well as increase the worker threads min value. Although I still don't know why the problem is happening - not all the errors last for 15 minutes. Currently, I'm thinking about some bandwidth issues on the client side or between the client and Redis, but I'm not sure that's the case in my situation.

@Shumaister
Copy link

Hello there,

Here we've this log. We increase minimum count of threads too but see like no differences

StackExchange.Redis.RedisTimeoutException: Timeout performing HMGET (3ms), next: HMGET F2-Session 2646bd0e-6673-791f-5be1-49f54fdcefde, inst: 3, qu: 0, qs: 1, aw: False, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, serverEndpoint: redis.fargate-services.local:6379, mc: 1/1/0, mgr: 10 of 10 available, clientName: ip-10-0-130-208, IOCP: (Busy=0,Free=1000,Min=200,Max=1000), WORKER: (Busy=4,Free=32763,Min=200,Max=32767), v: 2.2.4.27433 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

@NickCraver
Copy link
Collaborator

NickCraver commented Feb 18, 2024

@Shumaister What are your settings? That message indicates it timed out doing a multi-get in 3ms on a version with only ~1,000ms fidelity in timeout checks. That seems to me like it's set for 3ms, rather than 3 seconds as intended perhaps.

@hacst
Copy link

hacst commented Mar 7, 2024

@philon-msft

They seem to be occurring more frequently than can be explained by Azure Redis server maintenance. Most likely, there's another problem somewhere in your app environment causing Redis connections to be lost.

Could you clarify whether a stalled socket is expected in regular Azure Redis server maintenance? It would seem very strange to me if regular maintenance didn't take care to close the sockets instead of just dropping them. However we also see these stalls with Azure Cache for Redis in an AKS service in unexpected frequency so I am wondering.

@philon-msft
Copy link
Collaborator

Routine Azure Redis server maintenance closes connections gracefully to avoid stalled sockets, but in some cases maintenance on underlying Azure infrastructure (networking or compute) can cause connections to be interrupted without graceful closure, leading to stalled sockets on clients.
Client apps running in Kubernetes often experience more connection issues like this because there are more networking pieces involved, like the Kubernetes nodes which undergo maintenance themselves. Sidecars like Istio or Envoy can also disrupt Redis connections if they haven't been configured properly to ignore/allow all Redis traffic. We have some best practices documented here: https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-best-practices-kubernetes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests