Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: leaking tablet client connections in poolDialer #15563

Closed
shlomi-noach opened this issue Mar 25, 2024 · 1 comment
Closed

Bug Report: leaking tablet client connections in poolDialer #15563

shlomi-noach opened this issue Mar 25, 2024 · 1 comment

Comments

@shlomi-noach
Copy link
Contributor

Overview of the Issue

This piece of code in tablet manager client causes connection leaks:

if client.rpcClientMap == nil {
client.rpcClientMap = make(map[string]chan *tmc)
}
c, ok := client.rpcClientMap[addr]
if !ok {
c = make(chan *tmc, concurrency)
client.rpcClientMap[addr] = c
client.mu.Unlock()

The scenario is:

  • We make a gRPC call to some tablet, say a Primary throttler is running CheckThrottler on one of the replicas.
  • CheckThrottler uses a poolDialer:
    if poolDialer, ok := client.dialer.(poolDialer); ok {
    c, err = poolDialer.dialPool(ctx, tablet)
    if err != nil {
    return nil, err
    }
    }

    Which means it uses cached tablet manager client connections.
  • Say the replica tablet goes down (e.g. in a k8s cluster the pod goes down).
  • There is nothing to remove the cached client connection.
  • Say a new tablet comes up, potentially in a different shard/keyspace/cluster, with the same IP (as happens in k8s)
  • The cached connection now attempts to reconnect to the new tablet. This is the connection leak.

There needs to be a cleanup/invalidation mechanism where cached connections are dropped upon gRPC error (or by whatever other indication).

Some function calls can be expected to return error codes because they are invoked by users (ExecuteFetchAsApp). Other would only return error codes on an internal or network problem (CheckThrottler, FullStatus). These will likely benefit a separate handling logic.

Side discussion

To be discussed in a spin-off issue, the current mechanism is using misleading terminology. concurrency and pool are inappropriate terms for the mechanism. While the cache holds, say, 8 cached connection to a tablet, these 8 connections are shared without limit among as many users as asked for, without any concurrency limitation. A single connection can be held by, say, a dozen users, each invoking any queries without locking. This is fine, but it's not a "pool", and "concurrency" is incorrect.

Reproduction Steps

Binary Version

v19

Operating System and Environment details

-

Log Fragments

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant