You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have an HTTP server where every request triggers a query to a Redis cluster. When the backends are down, I expect each Get call to return with an error.
Current Behavior
Call to Get block seemingly indefinitely, eventually causing an out-of-memory condition and a crash.
Possible Solution
The blocking appears to be happening because of this call to cmdsInfoCache.Get:
When the backends are down, f returns an error. That results in unlocking the Mutex, allowing the next query to try and run f, also resulting in a failure. So all the queries are stacked up waiting on a single Mutex, which they take in turn.
Some options to fix:
internal.Once could respect a Context, so the waiters would at least time out rather than blocking indefinitely.
cmdsInfoCache could be treated as best-effort: If the cache is not yet available, queries would go ahead and execute without it. This would result in read-only commands only being sent to masters, at least until the cache became available.
The func that fills the cache could distinguish between temporary and permanent errors. On permanent errors it would log the error and return nil, which would indicate to the internal.Once that initialization was complete and should not be tried again.
This demonstrates that all 100 goroutines are effectively leaked. The numbers can be adjusted to demonstrate any size of leak. In our HTTP server, 4000 requests per second equates to 4000 goroutines leaked per second.
Context (Environment)
go version go1.18 linux/amd64
github.com/go-redis/redis/v8 v8.11.5
This is a simplified version of a problem we encountered when bringing up the OCSP responder for https://github.com/letsencrypt/boulder/ with a set of backends that could not be reached.
The text was updated successfully, but these errors were encountered:
Expected Behavior
I have an HTTP server where every request triggers a query to a Redis cluster. When the backends are down, I expect each Get call to return with an error.
Current Behavior
Call to Get block seemingly indefinitely, eventually causing an out-of-memory condition and a crash.
Possible Solution
The blocking appears to be happening because of this call to cmdsInfoCache.Get:
https://github.com/go-redis/redis/blob/9f5aacde23f23a9c45b59b675680e7d375c27de2/cluster.go#L1606
Which calls
internal.Once.Do
: https://github.com/go-redis/redis/blob/9f5aacde23f23a9c45b59b675680e7d375c27de2/command.go#L3431internal.Once.Do
locks a mutex. All other requests that come in whilef
is running will block on that Mutex:https://github.com/go-redis/redis/blob/9f5aacde23f23a9c45b59b675680e7d375c27de2/internal/once.go#L50
When the backends are down,
f
returns an error. That results in unlocking the Mutex, allowing the next query to try and runf
, also resulting in a failure. So all the queries are stacked up waiting on a single Mutex, which they take in turn.Some options to fix:
internal.Once
could respect a Context, so the waiters would at least time out rather than blocking indefinitely.cmdsInfoCache
could be treated as best-effort: If the cache is not yet available, queries would go ahead and execute without it. This would result in read-only commands only being sent to masters, at least until the cache became available.nil
, which would indicate to theinternal.Once
that initialization was complete and should not be tried again.Steps to Reproduce
This demonstrates that all 100 goroutines are effectively leaked. The numbers can be adjusted to demonstrate any size of leak. In our HTTP server, 4000 requests per second equates to 4000 goroutines leaked per second.
Context (Environment)
go version go1.18 linux/amd64
github.com/go-redis/redis/v8 v8.11.5
This is a simplified version of a problem we encountered when bringing up the OCSP responder for https://github.com/letsencrypt/boulder/ with a set of backends that could not be reached.
The text was updated successfully, but these errors were encountered: