goroutine leak when backends are down #2147

jsha · 2022-07-06T23:05:49Z

Expected Behavior

I have an HTTP server where every request triggers a query to a Redis cluster. When the backends are down, I expect each Get call to return with an error.

Current Behavior

Call to Get block seemingly indefinitely, eventually causing an out-of-memory condition and a crash.

Possible Solution

The blocking appears to be happening because of this call to cmdsInfoCache.Get:

https://github.com/go-redis/redis/blob/9f5aacde23f23a9c45b59b675680e7d375c27de2/cluster.go#L1606

Which calls internal.Once.Do: https://github.com/go-redis/redis/blob/9f5aacde23f23a9c45b59b675680e7d375c27de2/command.go#L3431

internal.Once.Do locks a mutex. All other requests that come in while f is running will block on that Mutex:
https://github.com/go-redis/redis/blob/9f5aacde23f23a9c45b59b675680e7d375c27de2/internal/once.go#L50

When the backends are down, f returns an error. That results in unlocking the Mutex, allowing the next query to try and run f, also resulting in a failure. So all the queries are stacked up waiting on a single Mutex, which they take in turn.

Some options to fix:

internal.Once could respect a Context, so the waiters would at least time out rather than blocking indefinitely.
cmdsInfoCache could be treated as best-effort: If the cache is not yet available, queries would go ahead and execute without it. This would result in read-only commands only being sent to masters, at least until the cache became available.
The func that fills the cache could distinguish between temporary and permanent errors. On permanent errors it would log the error and return nil, which would indicate to the internal.Once that initialization was complete and should not be tried again.

Steps to Reproduce

package main

import (
	"context"
	"fmt"
	"runtime"
	"time"

	"github.com/go-redis/redis/v8"
)

func main() {
	rdb := redis.NewClusterClient(&redis.ClusterOptions{
		Addrs: []string{
			"203.0.113.91:999",
		},
		DialTimeout:  time.Second,
		ReadTimeout:  time.Second,
		WriteTimeout: time.Second,
	})
	for i := 0; i < 100; i++ {
		ctx, cancel := context.WithTimeout(context.Background(), 200*time.Millisecond)
		defer cancel()
		go rdb.Get(ctx, "example")
		time.Sleep(200 * time.Millisecond)
	}
	fmt.Println(runtime.NumGoroutine())
}

$ go run main.go 
101

This demonstrates that all 100 goroutines are effectively leaked. The numbers can be adjusted to demonstrate any size of leak. In our HTTP server, 4000 requests per second equates to 4000 goroutines leaked per second.

Context (Environment)

go version go1.18 linux/amd64
github.com/go-redis/redis/v8 v8.11.5

This is a simplified version of a problem we encountered when bringing up the OCSP responder for https://github.com/letsencrypt/boulder/ with a set of backends that could not be reached.

The text was updated successfully, but these errors were encountered:

jiayiming001 · 2022-07-28T03:04:40Z

go version: 1.18.3 mac/Intel
github.com/go-redis/redis/v8 v8.11.5

go run main.go
2

mufaddaltahir · 2022-09-27T21:57:28Z

go version go1.18.2 darwin/amd64
github.com/go-redis/redis/v8 v8.11.5

go run.main.go
3

Oneill38 · 2023-10-05T17:49:13Z

go version go1.19 mac/Intel
github.com/go-redis/redis/v8 v8.11.

go run main.go
101

go version go1.19 mac/Intel
github.com/go-redis/redis/v9 v9.2.1

go run main.go
4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

goroutine leak when backends are down #2147

goroutine leak when backends are down #2147

jsha commented Jul 6, 2022

jiayiming001 commented Jul 28, 2022

mufaddaltahir commented Sep 27, 2022

Oneill38 commented Oct 5, 2023

goroutine leak when backends are down #2147

goroutine leak when backends are down #2147

Comments

jsha commented Jul 6, 2022

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context (Environment)

jiayiming001 commented Jul 28, 2022

mufaddaltahir commented Sep 27, 2022

Oneill38 commented Oct 5, 2023