rpc: avoid RPC heartbeat head-of-line blocking #93397

erikgrinaker · 2022-12-10T21:22:37Z

RPC heartbeats are important to detect peer failures and fail over to other nodes. However, they currently need to have very high timeouts (6 seconds) because they can be head-of-line blocked by other RPC traffic. For example, an experiment with a 500ms RTT cluster running a TPCC import would frequently hit the 6 second heartbeat timeout, even though the network latency was a fraction of this.

Furthermore, on idle clusters heartbeats were occasionally seen to take 3 RTTs rather than 1, long after the connection had initially been established (which takes 3 RTTs for the handshake). It's unclear what the cause of this is -- packet dumps showed that the TCP connection was intact throughout, so further analysis is needed.

We should avoid head-of-line blocking and other interference with RPC heartbeats, to get them closer to the basic network RTT, such that we can reduce the heartbeat timeout further. This may require switching the gRPC transport to e.g. QUIC.

Jira issue: CRDB-22311

erikgrinaker · 2022-12-11T18:54:01Z

One alternative here could be to use a lower heartbeat timeout for the SystemClass connection, which is less susceptible to network congestion, and terminate all RPC connections to the node when the system connection fails.

93399: rpc: tweak heartbeat intervals and timeouts r=erikgrinaker a=erikgrinaker The RPC heartbeat interval and timeout were recently reduced to 2 seconds (`base.NetworkTimeout`), with the assumption that heartbeats require a single network roundtrip and 2 seconds would therefore be more than enough. However, high-latency experiments showed that clusters under TPCC import load were very unstable even with a relatively moderate 400ms RTT, showing frequent RPC heartbeat timeouts because RPC `Ping` requests are head-of-line blocked by other RPC traffic. This patch therefore reverts the RPC heartbeat timeout back to the previous 6 second value, which is stable under TPCC import load with 400ms RTT, but struggles under 500ms RTT (which is also the case for 22.2). However, the RPC heartbeat interval and gRPC keepalive ping intervals have been split out to a separate setting `PingInterval` (`COCKROACH_PING_INTERVAL`), with a default value of 1 second, to fail faster despite the very high timeout. Unfortunately, this increases the maximum lease recovery time during network outages from 9.7 seconds to 14.0 seconds (as measured by the `failover/non-system/blackhole` roachtest), but that's still better than the 18.1 seconds in 22.2. Touches #79494. Touches #92542. Touches #93397. Epic: none Release note (ops change): The RPC heartbeat and gRPC keepalive ping intervals have been reduced to 1 second, to detect failures faster. This is adjustable via the new `COCKROACH_PING_INTERVAL` environment variable. The timeouts remain unchanged. Co-authored-by: Erik Grinaker <[email protected]>

erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-server Relating to the KV-level RPC server T-kv KV Team labels Dec 10, 2022

This was referenced Dec 10, 2022

rpc: tweak heartbeat intervals and timeouts #93399

Merged

base: reduce Raft election timeout and lease interval #91947

Merged

erikgrinaker mentioned this issue Jan 10, 2023

server: configurable clock offset verification threshold #94999

Closed

erikgrinaker mentioned this issue Aug 23, 2023

kvserver: reconsider gRPC server-side keepalive timeout #109317

Open

This was referenced Sep 6, 2023

kvserver: improve severe packet loss handling #110099

Open

base: consider setting NetworkTimeout to 1 second #93007

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: avoid RPC heartbeat head-of-line blocking #93397

rpc: avoid RPC heartbeat head-of-line blocking #93397

erikgrinaker commented Dec 10, 2022 •

edited

Loading

erikgrinaker commented Dec 11, 2022

rpc: avoid RPC heartbeat head-of-line blocking #93397

rpc: avoid RPC heartbeat head-of-line blocking #93397

Comments

erikgrinaker commented Dec 10, 2022 • edited Loading

erikgrinaker commented Dec 11, 2022

erikgrinaker commented Dec 10, 2022 •

edited

Loading