-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: avoid RPC heartbeat head-of-line blocking #93397
Labels
A-kv-server
Relating to the KV-level RPC server
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
Comments
erikgrinaker
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
A-kv-server
Relating to the KV-level RPC server
T-kv
KV Team
labels
Dec 10, 2022
erikgrinaker
added
C-performance
Perf of queries or internals. Solution not expected to change functional behavior.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
and removed
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
C-performance
Perf of queries or internals. Solution not expected to change functional behavior.
labels
Dec 10, 2022
This was referenced Dec 10, 2022
One alternative here could be to use a lower heartbeat timeout for the SystemClass connection, which is less susceptible to network congestion, and terminate all RPC connections to the node when the system connection fails. |
craig bot
pushed a commit
that referenced
this issue
Dec 12, 2022
93399: rpc: tweak heartbeat intervals and timeouts r=erikgrinaker a=erikgrinaker The RPC heartbeat interval and timeout were recently reduced to 2 seconds (`base.NetworkTimeout`), with the assumption that heartbeats require a single network roundtrip and 2 seconds would therefore be more than enough. However, high-latency experiments showed that clusters under TPCC import load were very unstable even with a relatively moderate 400ms RTT, showing frequent RPC heartbeat timeouts because RPC `Ping` requests are head-of-line blocked by other RPC traffic. This patch therefore reverts the RPC heartbeat timeout back to the previous 6 second value, which is stable under TPCC import load with 400ms RTT, but struggles under 500ms RTT (which is also the case for 22.2). However, the RPC heartbeat interval and gRPC keepalive ping intervals have been split out to a separate setting `PingInterval` (`COCKROACH_PING_INTERVAL`), with a default value of 1 second, to fail faster despite the very high timeout. Unfortunately, this increases the maximum lease recovery time during network outages from 9.7 seconds to 14.0 seconds (as measured by the `failover/non-system/blackhole` roachtest), but that's still better than the 18.1 seconds in 22.2. Touches #79494. Touches #92542. Touches #93397. Epic: none Release note (ops change): The RPC heartbeat and gRPC keepalive ping intervals have been reduced to 1 second, to detect failures faster. This is adjustable via the new `COCKROACH_PING_INTERVAL` environment variable. The timeouts remain unchanged. Co-authored-by: Erik Grinaker <[email protected]>
This was referenced Sep 6, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-kv-server
Relating to the KV-level RPC server
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
RPC heartbeats are important to detect peer failures and fail over to other nodes. However, they currently need to have very high timeouts (6 seconds) because they can be head-of-line blocked by other RPC traffic. For example, an experiment with a 500ms RTT cluster running a TPCC import would frequently hit the 6 second heartbeat timeout, even though the network latency was a fraction of this.
Furthermore, on idle clusters heartbeats were occasionally seen to take 3 RTTs rather than 1, long after the connection had initially been established (which takes 3 RTTs for the handshake). It's unclear what the cause of this is -- packet dumps showed that the TCP connection was intact throughout, so further analysis is needed.
We should avoid head-of-line blocking and other interference with RPC heartbeats, to get them closer to the basic network RTT, such that we can reduce the heartbeat timeout further. This may require switching the gRPC transport to e.g. QUIC.
Jira issue: CRDB-22311
The text was updated successfully, but these errors were encountered: