Skip to content

Commit

Permalink
Connections: detect and handle the Linux dead socket case (#2610)
Browse files Browse the repository at this point in the history
In Linux, we see 15 minute socket stalls due to OS-level TCP retries. What this means is the PhysicalConnection detects no issues on the pipe that's retrying, but is also not receiving data at all leading to long stalls in client applications. The goal here is to detect that we're timing out commands people have issued to the connection but we're getting _NOTHING_ back on the socket at all. In this case, we should assume the socket is dead and issue a reconnect so that we get out of the hung situation much faster.

For an initial go at this, we've chosen 4x the timeout interval as a threshold, but could make this configurable if needed.
  • Loading branch information
NickCraver authored Dec 8, 2023
1 parent 76f5205 commit 9a3003f
Show file tree
Hide file tree
Showing 3 changed files with 24 additions and 4 deletions.
5 changes: 3 additions & 2 deletions docs/ReleaseNotes.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,9 @@ Current package versions:

## Unreleased

- Fix [#2593](https://github.com/StackExchange/StackExchange.Redis/pull/2593): `EXPIRETIME` and `PEXPIRETIME` miscategorized as `PrimaryOnly` commands causing them to fail when issued against a read-only replica ([#2593 by slorello89](https://github.com/StackExchange/StackExchange.Redis/pull/2593))
- Fix [#2591](https://github.com/StackExchange/StackExchange.Redis/pull/2591): Add `HELLO` to Sentinel connections so they can support RESP3 ([#2601 by NickCraver](https://github.com/StackExchange/StackExchange.Redis/pull/2601))
- Fix [#2593](https://github.com/StackExchange/StackExchange.Redis/issues/2593): `EXPIRETIME` and `PEXPIRETIME` miscategorized as `PrimaryOnly` commands causing them to fail when issued against a read-only replica ([#2593 by slorello89](https://github.com/StackExchange/StackExchange.Redis/pull/2593))
- Fix [#2591](https://github.com/StackExchange/StackExchange.Redis/issues/2591): Add `HELLO` to Sentinel connections so they can support RESP3 ([#2601 by NickCraver](https://github.com/StackExchange/StackExchange.Redis/pull/2601))
- Fix [#2595](https://github.com/StackExchange/StackExchange.Redis/issues/2595): Add detection handling for dead sockets that the OS says are okay, seen especially in Linux environments (https://github.com/StackExchange/StackExchange.Redis/pull/2610)

## 2.7.4

Expand Down
13 changes: 12 additions & 1 deletion src/StackExchange.Redis/PhysicalBridge.cs
Original file line number Diff line number Diff line change
Expand Up @@ -591,7 +591,7 @@ internal void OnHeartbeat(bool ifConnectedOnly)
Interlocked.Exchange(ref connectTimeoutRetryCount, 0);
tmp.BridgeCouldBeNull?.ServerEndPoint?.ClearUnselectable(UnselectableFlags.DidNotRespond);
}
tmp.OnBridgeHeartbeat();
int timedOutThisHeartbeat = tmp.OnBridgeHeartbeat();
int writeEverySeconds = ServerEndPoint.WriteEverySeconds,
checkConfigSeconds = ServerEndPoint.ConfigCheckSeconds;

Expand Down Expand Up @@ -623,6 +623,17 @@ internal void OnHeartbeat(bool ifConnectedOnly)
// queue, test the socket
KeepAlive();
}
else if (timedOutThisHeartbeat > 0
&& tmp.LastReadSecondsAgo * 1_000 > (tmp.BridgeCouldBeNull?.Multiplexer.AsyncTimeoutMilliseconds * 4))
{
// If we've received *NOTHING* on the pipe in 4 timeouts worth of time and we're timing out commands, issue a connection failure so that we reconnect
// This is meant to address the scenario we see often in Linux configs where TCP retries will happen for 15 minutes.
// To us as a client, we'll see the socket as green/open/fine when writing but we'll bet getting nothing back.
// Since we can't depend on the pipe to fail in that case, we want to error here based on the criteria above so we reconnect broken clients much faster.
tmp.BridgeCouldBeNull?.Multiplexer.Logger?.LogWarning($"Dead socket detected, no reads in {tmp.LastReadSecondsAgo} seconds with {timedOutThisHeartbeat} timeouts, issuing disconnect");
OnDisconnected(ConnectionFailureType.SocketFailure, tmp, out _, out State oldState);
tmp.Dispose(); // Cleanup the existing connection/socket if any, otherwise it will wait reading indefinitely
}
}
break;
case (int)State.Disconnected:
Expand Down
10 changes: 9 additions & 1 deletion src/StackExchange.Redis/PhysicalConnection.cs
Original file line number Diff line number Diff line change
Expand Up @@ -248,6 +248,7 @@ private enum ReadMode : byte
private readonly WeakReference _bridge;
public PhysicalBridge? BridgeCouldBeNull => (PhysicalBridge?)_bridge.Target;

public long LastReadSecondsAgo => unchecked(Environment.TickCount - Thread.VolatileRead(ref lastReadTickCount)) / 1000;
public long LastWriteSecondsAgo => unchecked(Environment.TickCount - Thread.VolatileRead(ref lastWriteTickCount)) / 1000;

private bool IncludeDetailInExceptions => BridgeCouldBeNull?.Multiplexer.RawConfig.IncludeDetailInExceptions ?? false;
Expand Down Expand Up @@ -720,8 +721,13 @@ internal void GetStormLog(StringBuilder sb)
}
}

internal void OnBridgeHeartbeat()
/// <summary>
/// Runs on every heartbeat for a bridge, timing out any commands that are overdue and returning an integer of how many we timed out.
/// </summary>
/// <returns>How many commands were overdue and threw timeout exceptions.</returns>
internal int OnBridgeHeartbeat()
{
var result = 0;
var now = Environment.TickCount;
Interlocked.Exchange(ref lastBeatTickCount, now);

Expand All @@ -747,6 +753,7 @@ internal void OnBridgeHeartbeat()
multiplexer.OnMessageFaulted(msg, timeoutEx);
msg.SetExceptionAndComplete(timeoutEx, bridge); // tell the message that it is doomed
multiplexer.OnAsyncTimeout();
result++;
}
}
else
Expand All @@ -761,6 +768,7 @@ internal void OnBridgeHeartbeat()
}
}
}
return result;
}

internal void OnInternalError(Exception exception, [CallerMemberName] string? origin = null)
Expand Down

0 comments on commit 9a3003f

Please sign in to comment.