-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout unreasonably longer than AsyncTimeout #2477
Comments
Do any of your other exceptions show higher thread counts? My best guess here is that under load you may have exhausted the thread pool causing timers not to fire for an extended duration (or you had a lot of timers ahead of us, preventing the next from firing?) - the trends for the |
@NickCraver |
@NickCraver Redis metrics from Azure App Insights:
Maybe we should reconfigure something in our Redis? |
@sabaton1111 seems to me your min worker thread number is too low? |
@rayao Thank you, I will give it a try. :) |
@NickCraver do you think those statistics explain the delay? What could be done from consuming side to make the situation better? |
@rayao I'm trying to reason about this - how many are in the I'm looking at adding timer counts for .NET Core in the next release to timeout messages now, so we can rule that out easier. |
To eliminate the timer overload cause, I'm adding this to logging in #2500. |
@NickCraver I can't query log anymore, 5/30 log is purged now for too long. |
We're seeing some instances of quite delayed timeouts and at least in two deeper investigations it was due to a huge number of timers firing and the backlog timeout assessment timer not triggering for an extended period of time. To help users diagnose this, adding the cheap counter to the .NET Core pool stats where `Timer.ActiveCount` is readily available. This is available in all the .NET (non-Framework) versions we support. Before: > ...POOL: (Threads=25,QueuedItems=0,CompletedItems=1066) After: > ...POOL: (Threads=25,QueuedItems=0,CompletedItems=1066,Timers=46) See #2477 for another possible instance of this.
@NickCraver, what is the expected count of timers? We observe some timeout exceptions from the last 3 days and all of them have >= 23 timers (1 of them even has 409). Is it too many or we should seek the issue elsewhere? Here's an example of failed HGETALL from a collection that had 2 (short string) elements max at the time of the exception:
|
@NickCraver , We've also recently just updated and are seeing large numbers of Sample exception:
|
@vladislav-karamfilov Not worried about that timer count, on yours I'd check SLOWLOG on server-side as I don't see anything standing out there. |
@stambunan whoa that's a lot of timers, holy crap. I can't say why they're so high (given the volume, I'd try a memory dump at any busy period and you'll probably get lucky with some volume in play). Normally we see this with things like spinning up a timer per command for timeouts or some such wrapping somewhere, so it could be your retry or could be something else. Please note that we see some thread elevation, queued items, and sync operations against Redis, all pointing to some amount of thread pool starvation. Combined with really high timers, I'm not surprised you'd see some contention/timeouts under load. Recommendations would be: find out what those timers are, trim down, and eliminate sync ops to Redis. |
Almost all of the commands in the output of SLOWLOG are not from our app. They are: |
@vladislav-karamfilov it sounds like high Redis server load is causing commands to run slowly, causing even low-cost commands to appear in the SLOWLOG. I'd recommend investigating the high load - docs like this may be helpful: https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-troubleshoot-timeouts |
Thanks @NickCraver . I think we found a different issue which caused the active timers to be so high. Now, when we see timeouts it's in the 200s but we do use a couple of timers in our code which might be the reason for this. |
This is our server load graph: We did a lot of improvements to decrease the server load (smaller values, less connections, increased worker pool, optimized requests) at the end of June and beginning of July and it seemed that we fixed the issue around July 5th (#2486 (comment)). There were no timeout exceptions till July 25th and all of a sudden timeout exceptions started to appear in our Application Insights logs again. On July 23th we had a scheduled server maintenance (confirmed by following the steps from https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-troubleshoot-timeouts#server-maintenance). I think the new timeout exceptions wave is related to this maintenance event. As I wrote in my last comment, last week I also tried a reboot that seemed to have fixed the issue for a couple of days but after that we still observe timeout exceptions. |
I can confirm that from 2 months or less we started to get these kind of errors in burst batches and I've tried all the suggestion I've mange to find - from increasing the Redis from C2 to C3 (increasing the bandwidth) and increasing the worker count (280 as you can see) and increasing the app instance count so more unit to handle the requests. So far nothing has changed. I still see the log errors (no more, no less). |
We are experiencing these timeouts while the server load is < 10%.
Is there any documentation of what |
@NickCraver
Can you explain what can be the issue here? |
@Magnum1945 How often do you see this? e.g. could it be when the servers are patched every ~6 weeks, or what kind of sporadic are we talking? Nothing in the message seems out of the ordinary with load of volume, that seems pretty reasonable overall so we're likely looking external. |
@NickCraver Here is the statistic of errors for our US region: |
The 15 minute duration sounds like you may be experiencing stalled sockets. I'd recommend upgrading StackExchange.Redis to v2.7.17 to get Fix #2595: Add detection handling for dead sockets that the OS says are okay, seen especially in Linux environments (#2610 by @NickCraver) |
Thanks @philon-msft. I will upgrade the StackExchange.Redis to the latest stable as well as increase the worker threads min value. Although I still don't know why the problem is happening - not all the errors last for 15 minutes. Currently, I'm thinking about some bandwidth issues on the client side or between the client and Redis, but I'm not sure that's the case in my situation. |
Hello there, Here we've this log. We increase minimum count of threads too but see like no differences
|
@Shumaister What are your settings? That message indicates it timed out doing a multi-get in 3ms on a version with only ~1,000ms fidelity in timeout checks. That seems to me like it's set for 3ms, rather than 3 seconds as intended perhaps. |
Could you clarify whether a stalled socket is expected in regular Azure Redis server maintenance? It would seem very strange to me if regular maintenance didn't take care to close the sockets instead of just dropping them. However we also see these stalls with Azure Cache for Redis in an AKS service in unexpected frequency so I am wondering. |
Routine Azure Redis server maintenance closes connections gracefully to avoid stalled sockets, but in some cases maintenance on underlying Azure infrastructure (networking or compute) can cause connections to be interrupted without graceful closure, leading to stalled sockets on clients. |
Recently our Azure service encountered a Redis instance down incident, during the down time, our Redis client throws RedisTimeoutException with unreasonably long timeout time, like this:
On 1 of our VMs, we saw 102550 timeout exceptions in total, 92% of them longer than 2 seconds, 80% > 10 seconds, 72% > 20 seconds, 33% > 80 seconds, I can even see 977 exceptions with timeout > 120 seconds.
This made thousands of requests stuck in Processing stage, and further resulted in some other very bad situations.
I want to know if that's avoidable or not, if yes, how to configure the connection or client lib.
The text was updated successfully, but these errors were encountered: