Bug Report: leaking tablet client connections in `poolDialer` #15563

shlomi-noach · 2024-03-25T07:33:56Z

Overview of the Issue

This piece of code in tablet manager client causes connection leaks:

vitess/go/vt/vttablet/grpctmclient/client.go

Lines 163 to 170 in 8833092

    
           if client.rpcClientMap == nil { 
        
           	client.rpcClientMap = make(map[string]chan *tmc) 
        
           } 
        
           c, ok := client.rpcClientMap[addr] 
        
           if !ok { 
        
           	c = make(chan *tmc, concurrency) 
        
           	client.rpcClientMap[addr] = c 
        
           	client.mu.Unlock()

The scenario is:

We make a gRPC call to some tablet, say a Primary throttler is running CheckThrottler on one of the replicas.

CheckThrottler uses a poolDialer:

vitess/go/vt/vttablet/grpctmclient/client.go

Lines 1105 to 1110 in 8833092

    
           if poolDialer, ok := client.dialer.(poolDialer); ok { 
        
           	c, err = poolDialer.dialPool(ctx, tablet) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           }

Which means it uses cached tablet manager client connections.

Say the replica tablet goes down (e.g. in a k8s cluster the pod goes down).
There is nothing to remove the cached client connection.
Say a new tablet comes up, potentially in a different shard/keyspace/cluster, with the same IP (as happens in k8s)
The cached connection now attempts to reconnect to the new tablet. This is the connection leak.

There needs to be a cleanup/invalidation mechanism where cached connections are dropped upon gRPC error (or by whatever other indication).

Some function calls can be expected to return error codes because they are invoked by users (ExecuteFetchAsApp). Other would only return error codes on an internal or network problem (CheckThrottler, FullStatus). These will likely benefit a separate handling logic.

Side discussion

To be discussed in a spin-off issue, the current mechanism is using misleading terminology. concurrency and pool are inappropriate terms for the mechanism. While the cache holds, say, 8 cached connection to a tablet, these 8 connections are shared without limit among as many users as asked for, without any concurrency limitation. A single connection can be held by, say, a dozen users, each invoking any queries without locking. This is fine, but it's not a "pool", and "concurrency" is incorrect.

Reproduction Steps

Binary Version

v19

Operating System and Environment details

Log Fragments

No response

The text was updated successfully, but these errors were encountered:

shlomi-noach · 2024-03-26T06:14:45Z

Closed by #15562

shlomi-noach added Type: Bug Component: TabletManager labels Mar 25, 2024

shlomi-noach self-assigned this Mar 25, 2024

shlomi-noach closed this as completed Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: leaking tablet client connections in `poolDialer` #15563

Bug Report: leaking tablet client connections in `poolDialer` #15563

shlomi-noach commented Mar 25, 2024

shlomi-noach commented Mar 26, 2024

Bug Report: leaking tablet client connections in poolDialer #15563

Bug Report: leaking tablet client connections in poolDialer #15563

Comments

shlomi-noach commented Mar 25, 2024

Overview of the Issue

Side discussion

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

shlomi-noach commented Mar 26, 2024

Bug Report: leaking tablet client connections in `poolDialer` #15563

Bug Report: leaking tablet client connections in `poolDialer` #15563