Skip to content

Commit

Permalink
Merge #109715
Browse files Browse the repository at this point in the history
109715: rpc: increase gRPC server keepalive interval to 2x PingInterval r=erikgrinaker a=erikgrinaker

This patch increases the gRPC server keepalive interval from 1x to 2x PingInterval. These keepalive pings are used both to keep the connection alive, and to detect and close failed connections from the server-side. Keepalive pings are only sent and checked if there is no other activity on the connection.

We set this to 2x PingInterval, since we expect RPC heartbeats to be sent regularly (obviating the need for the keepalive ping), and there's no point sending keepalive pings until we've waited long enough for the RPC heartbeat to show up.

An environment variable COCKROACH_RPC_SERVER_KEEPALIVE_INTERVAL is also added to tune this.

Touches #109317.
Epic: none
Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
  • Loading branch information
craig[bot] and erikgrinaker committed Aug 30, 2023
2 parents 36bd829 + 2e6eb2c commit 13bf3fa
Show file tree
Hide file tree
Showing 2 changed files with 38 additions and 17 deletions.
8 changes: 4 additions & 4 deletions pkg/base/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -148,10 +148,10 @@ var (
// 2 * NetworkTimeout is sufficient.
DialTimeout = envutil.EnvOrDefaultDuration("COCKROACH_RPC_DIAL_TIMEOUT", 2*NetworkTimeout)

// PingInterval is the interval between network heartbeat pings. It is used
// both for RPC heartbeat intervals and gRPC server keepalive pings. It is
// set to 1 second in order to fail fast, but with large default timeouts
// to tolerate high-latency multiregion clusters.
// PingInterval is the interval between RPC heartbeat pings. It is set to 1
// second in order to fail fast, but with large default timeouts to tolerate
// high-latency multiregion clusters. The gRPC server keepalive interval is
// also affected by this.
PingInterval = envutil.EnvOrDefaultDuration("COCKROACH_PING_INTERVAL", time.Second)

// defaultRangeLeaseDuration specifies the default range lease duration.
Expand Down
47 changes: 34 additions & 13 deletions pkg/rpc/keepalive.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,20 +18,40 @@ import (
"google.golang.org/grpc/keepalive"
)

// serverTimeout is how long the server will wait for a TCP send to receive a
// TCP ack before automatically closing the connection. gRPC enforces this via
// the OS TCP stack by setting TCP_USER_TIMEOUT on the network socket.
// serverTimeout is how long the server will wait for a send to be acknowledged
// before automatically closing the connection. This is enforced at two levels:
// by gRPC itself as the timeout for keepalive pings on idle connections, and
// via the OS TCP stack by gRPC setting TCP_USER_TIMEOUT on the network socket.
//
// While NetworkTimeout should be sufficient here, we have seen instances where
// this is affected by node load or other factors, so we set it to 2x
// NetworkTimeout to avoid spurious closed connections. An aggressive timeout is
// not particularly beneficial here, because the client-side timeout (in our
// case the CRDB RPC heartbeat) is what matters for recovery time following
// network or node outages -- the server side doesn't really care if the
// connection remains open for a bit longer.
// NetworkTimeout should be sufficient here for TCP_USER_TIMEOUT, but the
// keepalive pings are processed above the Go runtime and are therefore more
// sensitive to node overload (e.g. scheduling latencies). We therefore set it
// to 2x NetworkTimeout to avoid spurious closed connections.
//
// An aggressive timeout is not particularly beneficial here anyway, because the
// client-side timeout (in our case the CRDB RPC heartbeat) is what mostly
// matters for recovery time following network or node outages -- the server
// side doesn't really care if the connection remains open for a bit longer. The
// exception is an asymmetric partition where inbound packets are dropped but
// outbound packets go through -- in that case, an aggressive connection close
// would allow the client to fail and try a different node before waiting out
// the RPC heartbeat timeout. However, this is a rarer failure mode, and does
// not justify the increased stability risk during node overload.
var serverTimeout = envutil.EnvOrDefaultDuration(
"COCKROACH_RPC_SERVER_TIMEOUT", 2*base.NetworkTimeout)

// serverKeepaliveInterval is the interval between server keepalive pings.
// These are used both to keep the connection alive, and to detect and close
// failed connections from the server-side. Keepalive pings are only sent and
// checked if there is no other activity on the connection.
//
// We set this to 2x PingInterval, since we expect RPC heartbeats to be sent
// regularly (obviating the need for the keepalive ping), and there's no point
// sending keepalive pings until we've waited long enough for the RPC heartbeat
// to show up.
var serverKeepaliveInterval = envutil.EnvOrDefaultDuration(
"COCKROACH_RPC_SERVER_KEEPALIVE_INTERVAL", 2*base.PingInterval)

// 10 seconds is the minimum keepalive interval permitted by gRPC.
// Setting it to a value lower than this will lead to gRPC adjusting to this
// value and annoyingly logging "Adjusting keepalive ping interval to minimum
Expand All @@ -53,9 +73,10 @@ var clientKeepalive = keepalive.ClientParameters{
}
var serverKeepalive = keepalive.ServerParameters{
// Send periodic pings on the connection when there is no other traffic.
Time: base.PingInterval,
// Close the connection if a TCP send (including a ping) does not receive a
// TCP ack within the given timeout. Enforced by the OS via TCP_USER_TIMEOUT.
Time: serverKeepaliveInterval,
// Close the connection if either a keepalive ping doesn't receive a response
// within the timeout, or a TCP send doesn't receive a TCP ack within the
// timeout (enforced by the OS via TCP_USER_TIMEOUT).
Timeout: serverTimeout,
}

Expand Down

0 comments on commit 13bf3fa

Please sign in to comment.