Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

goroutine leak when backends are down #2147

Open
jsha opened this issue Jul 6, 2022 · 3 comments
Open

goroutine leak when backends are down #2147

jsha opened this issue Jul 6, 2022 · 3 comments

Comments

@jsha
Copy link
Contributor

jsha commented Jul 6, 2022

Expected Behavior

I have an HTTP server where every request triggers a query to a Redis cluster. When the backends are down, I expect each Get call to return with an error.

Current Behavior

Call to Get block seemingly indefinitely, eventually causing an out-of-memory condition and a crash.

Possible Solution

The blocking appears to be happening because of this call to cmdsInfoCache.Get:

https://github.com/go-redis/redis/blob/9f5aacde23f23a9c45b59b675680e7d375c27de2/cluster.go#L1606

Which calls internal.Once.Do: https://github.com/go-redis/redis/blob/9f5aacde23f23a9c45b59b675680e7d375c27de2/command.go#L3431

internal.Once.Do locks a mutex. All other requests that come in while f is running will block on that Mutex:
https://github.com/go-redis/redis/blob/9f5aacde23f23a9c45b59b675680e7d375c27de2/internal/once.go#L50

When the backends are down, f returns an error. That results in unlocking the Mutex, allowing the next query to try and run f, also resulting in a failure. So all the queries are stacked up waiting on a single Mutex, which they take in turn.

Some options to fix:

  • internal.Once could respect a Context, so the waiters would at least time out rather than blocking indefinitely.
  • cmdsInfoCache could be treated as best-effort: If the cache is not yet available, queries would go ahead and execute without it. This would result in read-only commands only being sent to masters, at least until the cache became available.
  • The func that fills the cache could distinguish between temporary and permanent errors. On permanent errors it would log the error and return nil, which would indicate to the internal.Once that initialization was complete and should not be tried again.

Steps to Reproduce

package main

import (
	"context"
	"fmt"
	"runtime"
	"time"

	"github.com/go-redis/redis/v8"
)

func main() {
	rdb := redis.NewClusterClient(&redis.ClusterOptions{
		Addrs: []string{
			"203.0.113.91:999",
		},
		DialTimeout:  time.Second,
		ReadTimeout:  time.Second,
		WriteTimeout: time.Second,
	})
	for i := 0; i < 100; i++ {
		ctx, cancel := context.WithTimeout(context.Background(), 200*time.Millisecond)
		defer cancel()
		go rdb.Get(ctx, "example")
		time.Sleep(200 * time.Millisecond)
	}
	fmt.Println(runtime.NumGoroutine())
}
$ go run main.go 
101

This demonstrates that all 100 goroutines are effectively leaked. The numbers can be adjusted to demonstrate any size of leak. In our HTTP server, 4000 requests per second equates to 4000 goroutines leaked per second.

Context (Environment)

go version go1.18 linux/amd64
github.com/go-redis/redis/v8 v8.11.5

This is a simplified version of a problem we encountered when bringing up the OCSP responder for https://github.com/letsencrypt/boulder/ with a set of backends that could not be reached.

@jiayiming001
Copy link

go version: 1.18.3 mac/Intel
github.com/go-redis/redis/v8 v8.11.5

go run main.go
2

@mufaddaltahir
Copy link

go version go1.18.2 darwin/amd64
github.com/go-redis/redis/v8 v8.11.5

go run.main.go
3

@Oneill38
Copy link

Oneill38 commented Oct 5, 2023

go version go1.19 mac/Intel
github.com/go-redis/redis/v8 v8.11.

go run main.go
101

go version go1.19 mac/Intel
github.com/go-redis/redis/v9 v9.2.1

go run main.go
4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants