Skip to content

Commit

Permalink
Merge pull request #109724 from cockroachdb/blathers/backport-release…
Browse files Browse the repository at this point in the history
…-23.1.9-rc-109715

release-23.1.9-rc: rpc: increase gRPC server keepalive interval to 2x PingInterval
  • Loading branch information
erikgrinaker authored Aug 30, 2023
2 parents 939c497 + 7b47f78 commit efe33a4
Show file tree
Hide file tree
Showing 2 changed files with 38 additions and 17 deletions.
8 changes: 4 additions & 4 deletions pkg/base/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -148,10 +148,10 @@ var (
// 2 * NetworkTimeout is sufficient.
DialTimeout = envutil.EnvOrDefaultDuration("COCKROACH_RPC_DIAL_TIMEOUT", 2*NetworkTimeout)

// PingInterval is the interval between network heartbeat pings. It is used
// both for RPC heartbeat intervals and gRPC server keepalive pings. It is
// set to 1 second in order to fail fast, but with large default timeouts
// to tolerate high-latency multiregion clusters.
// PingInterval is the interval between RPC heartbeat pings. It is set to 1
// second in order to fail fast, but with large default timeouts to tolerate
// high-latency multiregion clusters. The gRPC server keepalive interval is
// also affected by this.
PingInterval = envutil.EnvOrDefaultDuration("COCKROACH_PING_INTERVAL", time.Second)

// defaultRangeLeaseDuration specifies the default range lease duration.
Expand Down
47 changes: 34 additions & 13 deletions pkg/rpc/keepalive.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,20 +18,40 @@ import (
"google.golang.org/grpc/keepalive"
)

// serverTimeout is how long the server will wait for a TCP send to receive a
// TCP ack before automatically closing the connection. gRPC enforces this via
// the OS TCP stack by setting TCP_USER_TIMEOUT on the network socket.
// serverTimeout is how long the server will wait for a send to be acknowledged
// before automatically closing the connection. This is enforced at two levels:
// by gRPC itself as the timeout for keepalive pings on idle connections, and
// via the OS TCP stack by gRPC setting TCP_USER_TIMEOUT on the network socket.
//
// While NetworkTimeout should be sufficient here, we have seen instances where
// this is affected by node load or other factors, so we set it to 2x
// NetworkTimeout to avoid spurious closed connections. An aggressive timeout is
// not particularly beneficial here, because the client-side timeout (in our
// case the CRDB RPC heartbeat) is what matters for recovery time following
// network or node outages -- the server side doesn't really care if the
// connection remains open for a bit longer.
// NetworkTimeout should be sufficient here for TCP_USER_TIMEOUT, but the
// keepalive pings are processed above the Go runtime and are therefore more
// sensitive to node overload (e.g. scheduling latencies). We therefore set it
// to 2x NetworkTimeout to avoid spurious closed connections.
//
// An aggressive timeout is not particularly beneficial here anyway, because the
// client-side timeout (in our case the CRDB RPC heartbeat) is what mostly
// matters for recovery time following network or node outages -- the server
// side doesn't really care if the connection remains open for a bit longer. The
// exception is an asymmetric partition where inbound packets are dropped but
// outbound packets go through -- in that case, an aggressive connection close
// would allow the client to fail and try a different node before waiting out
// the RPC heartbeat timeout. However, this is a rarer failure mode, and does
// not justify the increased stability risk during node overload.
var serverTimeout = envutil.EnvOrDefaultDuration(
"COCKROACH_RPC_SERVER_TIMEOUT", 2*base.NetworkTimeout)

// serverKeepaliveInterval is the interval between server keepalive pings.
// These are used both to keep the connection alive, and to detect and close
// failed connections from the server-side. Keepalive pings are only sent and
// checked if there is no other activity on the connection.
//
// We set this to 2x PingInterval, since we expect RPC heartbeats to be sent
// regularly (obviating the need for the keepalive ping), and there's no point
// sending keepalive pings until we've waited long enough for the RPC heartbeat
// to show up.
var serverKeepaliveInterval = envutil.EnvOrDefaultDuration(
"COCKROACH_RPC_SERVER_KEEPALIVE_INTERVAL", 2*base.PingInterval)

// 10 seconds is the minimum keepalive interval permitted by gRPC.
// Setting it to a value lower than this will lead to gRPC adjusting to this
// value and annoyingly logging "Adjusting keepalive ping interval to minimum
Expand All @@ -53,9 +73,10 @@ var clientKeepalive = keepalive.ClientParameters{
}
var serverKeepalive = keepalive.ServerParameters{
// Send periodic pings on the connection when there is no other traffic.
Time: base.PingInterval,
// Close the connection if a TCP send (including a ping) does not receive a
// TCP ack within the given timeout. Enforced by the OS via TCP_USER_TIMEOUT.
Time: serverKeepaliveInterval,
// Close the connection if either a keepalive ping doesn't receive a response
// within the timeout, or a TCP send doesn't receive a TCP ack within the
// timeout (enforced by the OS via TCP_USER_TIMEOUT).
Timeout: serverTimeout,
}

Expand Down

0 comments on commit efe33a4

Please sign in to comment.