Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

连接池整个不可用是否恰当? #199

Open
xiaolipro opened this issue Aug 6, 2024 · 9 comments
Open

连接池整个不可用是否恰当? #199

xiaolipro opened this issue Aug 6, 2024 · 9 comments

Comments

@xiaolipro
Copy link

1 异常堆栈

😭 轨迹中台测试环境异常_Error

cap-msg-id:316993851092066309

System.Exception: 【redis-test.default.svc.cluster.local:6379/6】Status unavailable, waiting for recovery. Connect to redis-server(redis-test.default.svc.cluster.local:6379 -> Unspecified/redis-test.default.svc.cluster.local:6379) timeout, DEBUG: Dns.GetHostEntry(redis-test.default.svc.cluster.local)=System.Net.IPHostEntry

---> System.TimeoutException: Connect to redis-server(redis-test.default.svc.cluster.local:6379 -> Unspecified/redis-test.default.svc.cluster.local:6379) timeout, DEBUG: Dns.GetHostEntry(redis-test.default.svc.cluster.local)=System.Net.IPHostEntry

at FreeRedis.Internal.DefaultRedisSocket.Connect()

at FreeRedis.Internal.DefaultRedisSocket.Write(CommandPacket cmd)

at FreeRedis.RedisClient.SingleInsideAdapter.<>c__DisplayClass5_0`1.b__0()

at FreeRedis.RedisClient.LogCallCtrl[T](CommandPacket cmd, Func`1 func, Boolean aopBefore, Boolean aopAfter)

at FreeRedis.RedisClient.LogCall[T](CommandPacket cmd, Func`1 func)

at FreeRedis.RedisClient.SingleInsideAdapter.AdapterCall[TValue](CommandPacket cmd, Func`2 parse)

at FreeRedis.RedisClient.Call(CommandPacket cmd)

at FreeRedis.Internal.RedisClientPoolPolicy.PrevReheatConnectionPool(ObjectPool`1 pool, Int32 minPoolSize)

--- End of inner exception stack trace ---

at FreeRedis.Internal.ObjectPool.ObjectPool`1.GetFree(Boolean checkAvailable)

at FreeRedis.Internal.ObjectPool.ObjectPool1.Get(Nullable1 timeout)

at FreeRedis.RedisClient.PoolingAdapter.GetRedisSocket(CommandPacket cmd)

at FreeRedis.RedisClient.PoolingAdapter.<>c__DisplayClass10_0`1.<b__0>d.MoveNext()

--- End of stack trace from previous location ---

at FreeRedis.RedisClient.LogCallAsync[T](CommandPacket cmd, Func`1 func)

at SJZY.Track.Application.Services.TrackPush.TrackPushCacheService.GetPushContentConfigsAsync(Int32 subSystemId, Int32 customId) in /var/lib/jenkins/workspace/trackapi-test/src/SJZY.Track.Application/Services/TrackPush/TrackPushCacheService.cs:line 24

at SJZY.Track.Application.EventHandlers.TrackPushEventHandler.ConsumerAsync(TrackPushEvent event) in /var/lib/jenkins/workspace/trackapi-test/src/SJZY.Track.Application/EventHandlers/TrackPushEventHandler.cs:line 43

at lambda_method987(Closure, Object)

at Microsoft.Extensions.Internal.ObjectMethodExecutorAwaitable.Awaiter.GetResult()

at DotNetCore.CAP.Internal.SubscribeInvoker.ExecuteWithParameterAsync(ObjectMethodExecutor executor, Object class, Object[] parameter)

at DotNetCore.CAP.Internal.SubscribeInvoker.InvokeAsync(ConsumerContext context, CancellationToken cancellationToken)

2 猜测

2.1 连接池问题

2.2 网络问题

3 问题分析

此问题和业务代码木有直接关系,业务代码只是简单访问redis。

从异常来看是网络问题,找了运维看几次也没看出问题,所以只能分析freeredis源码了。

2.1 定位异常

直接定位到freedis核心方法Get,这是我们取得redis client的方法

img

可以看见,正是因为满足UnavailableException != null,才出现了我们看见的

img

我们找到赋值UnavailableException 的地方:将对象池设置为不可用,后续 Get/GetAsync 均会报错,同时启动后台定时检查服务恢复可用

img

反推引用,会发现回到了innert exception的stack底部 AdapterCall

img

2.2 排除连接池

通过上面的定位已经可以排除连接池嫌疑了,因为get rdc的点只有一个:GetRedisSocket。

感兴趣的自己去具体了解一下,这里稍微介绍一下free redis连接池设计:任何redis操作都要从ObjectPool Get一个rdc(RedisClient)出来

// 如果木有空闲rdc 并且 池内rdc小于MaxPoolSize(默认100)
if ((_freeObjects.TryPop(out var obj) == false || obj == null) && _allObjects.Count < Policy.PoolSize)
{
    lock (_allObjectsLock)
        // DCL 如果池内rdc数量仍然小于MaxPoolSize,则添加一个rdc到池子里去
        if (_allObjects.Count < Policy.PoolSize)
            _allObjects.Add(obj = new Object<T> { Pool = this, Id = _allObjects.Count + 1 });
}

// ...

// 如果rdc释放了 或者 rdc已经超过IdleTimeout(默认20s)未接收数据
if (obj != null && obj.Value == null ||
    obj != null && Policy.IdleTimeout > TimeSpan.Zero && DateTime.Now.Subtract(obj.LastReturnTime) > Policy.IdleTimeout)
{
    try
    {
        // 重置rdc(昂贵的)
        obj.ResetValue();
    }
    catch
    {
        Return(obj);
        throw;
    }
}

而池子的容量是有限的(MaxPoolSize控制),如果在某个时间内有大量的redis访问,可能导致池满

// 如果木有空闲的rdc 也无法再创建新的rdc(池满)
if (obj == null)
{
    var queueItem = new GetSyncQueueInfo();

    _getSyncQueue.Enqueue(queueItem);
    _getQueue.Enqueue(false);

    if (timeout == null) timeout = Policy.SyncGetTimeout;

    try
    {
        // 等待SyncGetTimeout(默认10s)看看rdc池会不会归还空闲rdc
        if (queueItem.Wait.Wait(timeout.Value))
            obj = queueItem.ReturnValue;
    }
    catch { }

    if (obj == null) obj = queueItem.ReturnValue;
    if (obj == null) lock (queueItem.Lock) queueItem.IsTimeout = (obj = queueItem.ReturnValue) == null;
    if (obj == null) obj = queueItem.ReturnValue;
    if (queueItem.Exception != null) throw queueItem.Exception;

    if (obj == null)
    {
        // 如果仍然获取不到rdc,throw超时
        // 但是该异常会被我们上面看见的AdapterCall捕获住,然后设置池状态为不可用
        Policy.OnGetTimeout();
        if (Policy.IsThrowGetTimeoutException)
            throw new TimeoutException($"【{Policy.Name}】ObjectPool.Get() timeout {timeout.Value.TotalSeconds} seconds, see: https://github.com/dotnetcore/FreeSql/discussions/1081");

        return null;
    }
}

何时归还:

  1. OnGet异常,导致的主动归还(该情况会reset rdc)
    1. 默认情况下使用pooling policy是不会出现error,致使除非return的

    2. // OnGet 默认实现
      if (_pool.IsAvailable)
      {
          // 如果超过60s未使用,或者socket已经断开,尝试ping
          if (DateTime.Now.Subtract(obj.LastReturnTime).TotalSeconds > 60 || obj.Value.Adapter.GetRedisSocket(null).IsConnected == false)
          {
              try
              {
                CommandPacket cmd = "PING";
                cmd.IsIgnoreAop = true;
                obj.Value.Call(cmd);
             }
              catch
              {
                  obj.ResetValue();
              }
          }
      }
  2. rdc销毁时会主动归还(复用根本)
    1. public override IRedisSocket GetRedisSocket(CommandPacket cmd)
      {
          var poolkey = GetIdleBusKey(cmd);
          var pool = _ib.Get(poolkey);
          var cli = pool.Get();
          var rds = cli.Value.Adapter.GetRedisSocket(null);
          // 每次操作完了 主动归还
          var rdsproxy = DefaultRedisSocket.CreateTempProxy(rds, () => pool.Return(cli));
          rdsproxy._poolkey = poolkey;
          rdsproxy._pool = pool;
          return rdsproxy;
      }
  3. auto-free策略(可用通过IdleTimeout控制)
    1. 作者说的这个没用

    2. public void AutoFree()
      {
          if (running == false) return;
          if (UnavailableException != null) return;
      
          var list = new List<Object<T>>();
          while (_freeObjects.TryPop(out var obj))
              list.Add(obj);
          foreach (var obj in list)
          {
              if (obj != null && obj.Value == null ||
                  obj != null && Policy.IdleTimeout > TimeSpan.Zero && DateTime.Now.Subtract(obj.LastReturnTime) > Policy.IdleTimeout)
              {
                  if (obj.Value != null)
                  {
                      Return(obj, true);
                      continue;
                  }
              }
              Return(obj);
          }
      }

2.2 重现errormessage

排除连接池嫌疑后,目光可以回到引发异常的两行代码了

try
{
    rds.Write(cmd);
    rt = rds.Read(cmd);
}

不管是写还是读,内部实现都调用了Connect方法

if (IsConnected == false) Connect();

// Connect实现
public void Connect()
{
    lock (_connectLock)
    {
        ResetHost(Host);

        EndPoint endpoint = IPAddress.TryParse(_ip, out var tryip) ?
            (EndPoint)new IPEndPoint(tryip, _port) :
            new DnsEndPoint(_ip, _port);

        var localSocket = endpoint.AddressFamily == AddressFamily.InterNetworkV6 ? 
            new Socket(AddressFamily.InterNetworkV6, SocketType.Stream, ProtocolType.Tcp):
            new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);

        try
        {
            var asyncResult = localSocket.BeginConnect(endpoint, null, null);
            if (!asyncResult.AsyncWaitHandle.WaitOne(ConnectTimeout, true))
            {
                var endpointString = endpoint.ToString();
                if (endpointString != $"{_ip}:{_port}") endpointString = $"{_ip}:{_port} -> {endpointString}";
                var debugString = "";
                if (endpoint is DnsEndPoint)
                {
                    try { debugString = $", DEBUG: Dns.GetHostEntry({_ip})={Dns.GetHostEntry(_ip)}"; }
                    catch (Exception ex) { debugString = $", DEBUG: {ex.Message}"; }
                }
                throw new TimeoutException($"Connect to redis-server({endpointString}) timeout{debugString}");
            }
            localSocket.EndConnect(asyncResult);
        }
        catch
        {
            ReleaseSocket(localSocket);
            throw;
        }
        //...
    }
}

至此我们已经还原了所有异常堆栈、message。

4 解决方案

修改free redis源码,我们暂时维护了自己的版本,后续可能会pr到free redis。

如果你的项目中已经遇到了这个问题且无法自己解决就切换我们的包吧。

5 总结

  1. free redis连接池设计有问题,大量并发下池满了且rdc没能那么快归还,导致后续的请求没有rdc可用,抛出异常,而free redis对异常的处理非常暴力:将整个rdc池的状态设为不可用(然后在后台通过一个long task thread去扫描,默认是3s),这也就导致了我们看见的高并发下大量的上述异常
  2. Free redis为了节约socket资源,做了一个空闲策略,即:IdleTimeout 时间内rdc都没被用过,rdc会被reset,也就是socket被重建,这个我认为还好,关键因素是第一点
@2881099
Copy link
Owner

2881099 commented Aug 6, 2024

并发高连接池不够用的时候,触发的 链接池Timeout 不会导致整个链接池不可用。

并发时归还不及时导致 链接池Timeout,可能是IO问题(网络或硬盘),解决办法可以设置更大的连接池数量,或者从根本解决io问题。

提醒:连接超时,连接池超时,是两个概念。

@xubinhua888
Copy link

1 异常堆栈

😭 轨迹中台测试环境异常_Error

cap-msg-id:316993851092066309

System.Exception: 【redis-test.default.svc.cluster.local:6379/6】Status unavailable, waiting for recovery. Connect to redis-server(redis-test.default.svc.cluster.local:6379 -> Unspecified/redis-test.default.svc.cluster.local:6379) timeout, DEBUG: Dns.GetHostEntry(redis-test.default.svc.cluster.local)=System.Net.IPHostEntry

---> System.TimeoutException: Connect to redis-server(redis-test.default.svc.cluster.local:6379 -> Unspecified/redis-test.default.svc.cluster.local:6379) timeout, DEBUG: Dns.GetHostEntry(redis-test.default.svc.cluster.local)=System.Net.IPHostEntry

at FreeRedis.Internal.DefaultRedisSocket.Connect()

at FreeRedis.Internal.DefaultRedisSocket.Write(CommandPacket cmd)

at FreeRedis.RedisClient.SingleInsideAdapter.<>c__DisplayClass5_0`1.b__0()

at FreeRedis.RedisClient.LogCallCtrl[T](CommandPacket cmd, Func`1 func, Boolean aopBefore, Boolean aopAfter)

at FreeRedis.RedisClient.LogCall[T](CommandPacket cmd, Func`1 func)

at FreeRedis.RedisClient.SingleInsideAdapter.AdapterCall[TValue](CommandPacket cmd, Func`2 parse)

at FreeRedis.RedisClient.Call(CommandPacket cmd)

at FreeRedis.Internal.RedisClientPoolPolicy.PrevReheatConnectionPool(ObjectPool`1 pool, Int32 minPoolSize)

--- End of inner exception stack trace ---

at FreeRedis.Internal.ObjectPool.ObjectPool`1.GetFree(Boolean checkAvailable)

at FreeRedis.Internal.ObjectPool.ObjectPool1.Get(Nullable1 timeout)

at FreeRedis.RedisClient.PoolingAdapter.GetRedisSocket(CommandPacket cmd)

at FreeRedis.RedisClient.PoolingAdapter.<>c__DisplayClass10_0`1.<b__0>d.MoveNext()

--- End of stack trace from previous location ---

at FreeRedis.RedisClient.LogCallAsync[T](CommandPacket cmd, Func`1 func)

at SJZY.Track.Application.Services.TrackPush.TrackPushCacheService.GetPushContentConfigsAsync(Int32 subSystemId, Int32 customId) in /var/lib/jenkins/workspace/trackapi-test/src/SJZY.Track.Application/Services/TrackPush/TrackPushCacheService.cs:line 24

at SJZY.Track.Application.EventHandlers.TrackPushEventHandler.ConsumerAsync(TrackPushEvent event) in /var/lib/jenkins/workspace/trackapi-test/src/SJZY.Track.Application/EventHandlers/TrackPushEventHandler.cs:line 43

at lambda_method987(Closure, Object)

at Microsoft.Extensions.Internal.ObjectMethodExecutorAwaitable.Awaiter.GetResult()

at DotNetCore.CAP.Internal.SubscribeInvoker.ExecuteWithParameterAsync(ObjectMethodExecutor executor, Object class, Object[] parameter)

at DotNetCore.CAP.Internal.SubscribeInvoker.InvokeAsync(ConsumerContext context, CancellationToken cancellationToken)

2 猜测

2.1 连接池问题

2.2 网络问题

3 问题分析

此问题和业务代码木有直接关系,业务代码只是简单访问redis。

从异常来看是网络问题,找了运维看几次也没看出问题,所以只能分析freeredis源码了。

2.1 定位异常

直接定位到freedis核心方法Get,这是我们取得redis client的方法

img

可以看见,正是因为满足UnavailableException != null,才出现了我们看见的

img

我们找到赋值UnavailableException 的地方:将对象池设置为不可用,后续 Get/GetAsync 均会报错,同时启动后台定时检查服务恢复可用

img

反推引用,会发现回到了innert exception的stack底部 AdapterCall

img

2.2 排除连接池

通过上面的定位已经可以排除连接池嫌疑了,因为get rdc的点只有一个:GetRedisSocket。

感兴趣的自己去具体了解一下,这里稍微介绍一下free redis连接池设计:任何redis操作都要从ObjectPool Get一个rdc(RedisClient)出来

// 如果木有空闲rdc 并且 池内rdc小于MaxPoolSize(默认100)
if ((_freeObjects.TryPop(out var obj) == false || obj == null) && _allObjects.Count < Policy.PoolSize)
{
    lock (_allObjectsLock)
        // DCL 如果池内rdc数量仍然小于MaxPoolSize,则添加一个rdc到池子里去
        if (_allObjects.Count < Policy.PoolSize)
            _allObjects.Add(obj = new Object<T> { Pool = this, Id = _allObjects.Count + 1 });
}

// ...

// 如果rdc释放了 或者 rdc已经超过IdleTimeout(默认20s)未接收数据
if (obj != null && obj.Value == null ||
    obj != null && Policy.IdleTimeout > TimeSpan.Zero && DateTime.Now.Subtract(obj.LastReturnTime) > Policy.IdleTimeout)
{
    try
    {
        // 重置rdc(昂贵的)
        obj.ResetValue();
    }
    catch
    {
        Return(obj);
        throw;
    }
}

而池子的容量是有限的(MaxPoolSize控制),如果在某个时间内有大量的redis访问,可能导致池满

// 如果木有空闲的rdc 也无法再创建新的rdc(池满)
if (obj == null)
{
    var queueItem = new GetSyncQueueInfo();

    _getSyncQueue.Enqueue(queueItem);
    _getQueue.Enqueue(false);

    if (timeout == null) timeout = Policy.SyncGetTimeout;

    try
    {
        // 等待SyncGetTimeout(默认10s)看看rdc池会不会归还空闲rdc
        if (queueItem.Wait.Wait(timeout.Value))
            obj = queueItem.ReturnValue;
    }
    catch { }

    if (obj == null) obj = queueItem.ReturnValue;
    if (obj == null) lock (queueItem.Lock) queueItem.IsTimeout = (obj = queueItem.ReturnValue) == null;
    if (obj == null) obj = queueItem.ReturnValue;
    if (queueItem.Exception != null) throw queueItem.Exception;

    if (obj == null)
    {
        // 如果仍然获取不到rdc,throw超时
        // 但是该异常会被我们上面看见的AdapterCall捕获住,然后设置池状态为不可用
        Policy.OnGetTimeout();
        if (Policy.IsThrowGetTimeoutException)
            throw new TimeoutException($"【{Policy.Name}】ObjectPool.Get() timeout {timeout.Value.TotalSeconds} seconds, see: https://github.com/dotnetcore/FreeSql/discussions/1081");

        return null;
    }
}

何时归还:

  1. OnGet异常,导致的主动归还(该情况会reset rdc)

    1. 默认情况下使用pooling policy是不会出现error,致使除非return的

    2. // OnGet 默认实现
      if (_pool.IsAvailable)
      {
          // 如果超过60s未使用,或者socket已经断开,尝试ping
          if (DateTime.Now.Subtract(obj.LastReturnTime).TotalSeconds > 60 || obj.Value.Adapter.GetRedisSocket(null).IsConnected == false)
          {
              try
              {
                CommandPacket cmd = "PING";
                cmd.IsIgnoreAop = true;
                obj.Value.Call(cmd);
             }
              catch
              {
                  obj.ResetValue();
              }
          }
      }
  2. rdc销毁时会主动归还(复用根本)

    1. public override IRedisSocket GetRedisSocket(CommandPacket cmd)
      {
          var poolkey = GetIdleBusKey(cmd);
          var pool = _ib.Get(poolkey);
          var cli = pool.Get();
          var rds = cli.Value.Adapter.GetRedisSocket(null);
          // 每次操作完了 主动归还
          var rdsproxy = DefaultRedisSocket.CreateTempProxy(rds, () => pool.Return(cli));
          rdsproxy._poolkey = poolkey;
          rdsproxy._pool = pool;
          return rdsproxy;
      }
  3. auto-free策略(可用通过IdleTimeout控制)

    1. 作者说的这个没用

    2. public void AutoFree()
      {
          if (running == false) return;
          if (UnavailableException != null) return;
      
          var list = new List<Object<T>>();
          while (_freeObjects.TryPop(out var obj))
              list.Add(obj);
          foreach (var obj in list)
          {
              if (obj != null && obj.Value == null ||
                  obj != null && Policy.IdleTimeout > TimeSpan.Zero && DateTime.Now.Subtract(obj.LastReturnTime) > Policy.IdleTimeout)
              {
                  if (obj.Value != null)
                  {
                      Return(obj, true);
                      continue;
                  }
              }
              Return(obj);
          }
      }

2.2 重现errormessage

排除连接池嫌疑后,目光可以回到引发异常的两行代码了

try
{
    rds.Write(cmd);
    rt = rds.Read(cmd);
}

不管是写还是读,内部实现都调用了Connect方法

if (IsConnected == false) Connect();

// Connect实现
public void Connect()
{
    lock (_connectLock)
    {
        ResetHost(Host);

        EndPoint endpoint = IPAddress.TryParse(_ip, out var tryip) ?
            (EndPoint)new IPEndPoint(tryip, _port) :
            new DnsEndPoint(_ip, _port);

        var localSocket = endpoint.AddressFamily == AddressFamily.InterNetworkV6 ? 
            new Socket(AddressFamily.InterNetworkV6, SocketType.Stream, ProtocolType.Tcp):
            new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);

        try
        {
            var asyncResult = localSocket.BeginConnect(endpoint, null, null);
            if (!asyncResult.AsyncWaitHandle.WaitOne(ConnectTimeout, true))
            {
                var endpointString = endpoint.ToString();
                if (endpointString != $"{_ip}:{_port}") endpointString = $"{_ip}:{_port} -> {endpointString}";
                var debugString = "";
                if (endpoint is DnsEndPoint)
                {
                    try { debugString = $", DEBUG: Dns.GetHostEntry({_ip})={Dns.GetHostEntry(_ip)}"; }
                    catch (Exception ex) { debugString = $", DEBUG: {ex.Message}"; }
                }
                throw new TimeoutException($"Connect to redis-server({endpointString}) timeout{debugString}");
            }
            localSocket.EndConnect(asyncResult);
        }
        catch
        {
            ReleaseSocket(localSocket);
            throw;
        }
        //...
    }
}

至此我们已经还原了所有异常堆栈、message。

4 解决方案

修改free redis源码,我们暂时维护了自己的版本,后续可能会pr到free redis。

如果你的项目中已经遇到了这个问题且无法自己解决就切换我们的包吧。

5 总结

  1. free redis连接池设计有问题,大量并发下池满了且rdc没能那么快归还,导致后续的请求没有rdc可用,抛出异常,而free redis对异常的处理非常暴力:将整个rdc池的状态设为不可用(然后在后台通过一个long task thread去扫描,默认是3s),这也就导致了我们看见的高并发下大量的上述异常
  2. Free redis为了节约socket资源,做了一个空闲策略,即:IdleTimeout 时间内rdc都没被用过,rdc会被reset,也就是socket被重建,这个我认为还好,关键因素是第一点

你们修改后的包在哪里下载

@xubinhua888
Copy link

StackExchange.Redis 有这个问题吗

@nodyang
Copy link

nodyang commented Aug 27, 2024

新的包怎么引用,我们有出现dns 解析失败。然后大面积超时

@nodyang
Copy link

nodyang commented Aug 27, 2024

{"Success":false,"Message":"【redis.time.com:6399/0】Status unavailable, waiting for recovery. Connect to redis-server(redis.time.com:6399-> Unspecifiedredis.time.com:6399) timeout, DEBUG: Dns.GetHostEntry(redis.time.com)=System.Net.IPHostEntry"}
[ELAPSED] 并发量大的时候。接口大约停顿2分钟才回复正常

@ebplayer
Copy link

我们也出现这种情况 请问新包怎么引用? @xiaolipro

@nodyang
Copy link

nodyang commented Oct 5, 2024

我们也出现这种情况 请问新包怎么引用? @xiaolipro

一直也没有回我。

@nodyang
Copy link

nodyang commented Oct 15, 2024

我们也出现这种情况 请问新包怎么引用? @xiaolipro

解决了吗?今天下午又一次崩溃了。。

@nodyang
Copy link

nodyang commented Oct 15, 2024

@xiaolipro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants