rpc: increase gRPC server timeout from 1x to 2x NetworkTimeout #109578

erikgrinaker · 2023-08-28T08:56:02Z

This is intended as a conservative backport that changes as little as possible. For 23.2, we should restructure these settings a bit, possibly by removing NetworkTimeout and using independent timeouts for each component/parameter, since they have unique considerations (e.g. whether they are enforced above the Go runtime or by the OS, to what extent they are subject to RPC head-of-line blocking, etc).

This patch increases the gRPC server timeout from 1x to 2x NetworkTimeout. This timeout determines how long the server will wait for a TCP send to receive a TCP ack before automatically closing the connection. gRPC enforces this via the OS TCP stack by setting TCP_USER_TIMEOUT on the network socket.

While NetworkTimeout should be sufficient here, we have seen instances where this is affected by node load or other factors, so we set it to 2x NetworkTimeout to avoid spurious closed connections. An aggressive timeout is not particularly beneficial here, because the client-side timeout (in our case the CRDB RPC heartbeat) is what matters for recovery time following network or node outages -- the server side doesn't really care if the connection remains open for a bit longer.

Touches #109317.

Epic: none
Release note (ops change): The default gRPC server-side send timeout has been increased from 2 seconds to 4 seconds (1x to 2x of COCKROACH_NETWORK_TIMEOUT), to avoid spurious connection failures in certain scenarios. This can be controlled via the new environment variable COCKROACH_RPC_SERVER_TIMEOUT.

This patch increases the gRPC server timeout from 1x to 2x NetworkTimeout. This timeout determines how long the server will wait for a TCP send to receive a TCP ack before automatically closing the connection. gRPC enforces this via the OS TCP stack by setting TCP_USER_TIMEOUT on the network socket. While NetworkTimeout should be sufficient here, we have seen instances where this is affected by node load or other factors, so we set it to 2x NetworkTimeout to avoid spurious closed connections. An aggressive timeout is not particularly beneficial here, because the client-side timeout (in our case the CRDB RPC heartbeat) is what matters for recovery time following network or node outages -- the server side doesn't really care if the connection remains open for a bit longer. Epic: none Release note (ops change): The default gRPC server-side send timeout has been increased from 2 seconds to 4 seconds (1x to 2x of COCKROACH_NETWORK_TIMEOUT), to avoid spurious connection failures in certain scenarios. This can be controlled via the new environment variable COCKROACH_RPC_SERVER_TIMEOUT.

blathers-crl · 2023-08-28T08:56:06Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-08-28T08:56:11Z

This change is

erikgrinaker · 2023-08-28T11:23:22Z

I tried this out on cdc/scan/catchup/nodes=5/cpu=16/rows=1G/ranges=100k/protocol=rangefeed/format=json/sink=null, a changefeed benchmark with 100k ranges where we've previously seen connection closures under severe node overload. With this change, we no longer see such closures (the benchmark still fails to complete due to the overload though, but that's orthogonal).

sean-

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)

andrewbaptist

erikgrinaker · 2023-08-28T15:42:53Z

TFTR! Going to look at the overnight failover test results after this merges before backporting, just to make sure we're not regressing on something unexpected.

bors r+

andrewbaptist · 2023-08-28T18:42:55Z

bors r+

craig · 2023-08-28T19:22:00Z

Build succeeded:

Bazel Essential CI (Cockroach)

erikgrinaker · 2023-08-30T09:46:27Z

I think this results in a recovery time regression for certains kinds of asymmetric partitions, see #109317 (comment). The change is still justified, considering we found in #109317 (comment) that this timeout is also sensitive to node overload.

erikgrinaker added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Aug 28, 2023

erikgrinaker requested review from andrewbaptist and a team August 28, 2023 08:56

erikgrinaker self-assigned this Aug 28, 2023

erikgrinaker requested a review from a team as a code owner August 28, 2023 08:56

erikgrinaker added the backport-23.1.9-rc label Aug 28, 2023

sean- approved these changes Aug 28, 2023

View reviewed changes

andrewbaptist approved these changes Aug 28, 2023

View reviewed changes

craig bot merged commit 6a31120 into cockroachdb:master Aug 28, 2023

This was referenced Aug 28, 2023

release-23.1: rpc: increase gRPC server timeout from 1x to 2x NetworkTimeout #109620

Merged

release-23.1.9-rc: rpc: increase gRPC server timeout from 1x to 2x NetworkTimeout #109621

Merged

cockroach-teamcity mentioned this pull request Aug 29, 2023

PR #109578 - rpc: increase gRPC server timeout from 1x to 2x NetworkTimeout cockroachdb/docs#17771

Closed

erikgrinaker mentioned this pull request Aug 30, 2023

kvserver: reconsider gRPC server-side keepalive timeout #109317

Open

erikgrinaker mentioned this pull request Sep 7, 2023

changefeedccl: Emit span resolved event when end time reached #109439

Merged

andrewbaptist mentioned this pull request Sep 12, 2023

rpc: TestGRPCKeepaliveFailureFailsInflightRPCs failed #108998

Closed

erikgrinaker mentioned this pull request Sep 12, 2023

base: consider setting NetworkTimeout to 1 second #93007

Closed

erikgrinaker deleted the grpc-server-timeout branch November 14, 2023 10:39

pav-kv mentioned this pull request Jan 16, 2024

rangefeed: enable and improve rangefeed connection class #108992

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: increase gRPC server timeout from 1x to 2x NetworkTimeout #109578

rpc: increase gRPC server timeout from 1x to 2x NetworkTimeout #109578

erikgrinaker commented Aug 28, 2023 •

edited

Loading

blathers-crl bot commented Aug 28, 2023

cockroach-teamcity commented Aug 28, 2023

erikgrinaker commented Aug 28, 2023

sean- left a comment

andrewbaptist left a comment •

edited by cockroach-dev-inf

Loading

erikgrinaker commented Aug 28, 2023

andrewbaptist commented Aug 28, 2023

craig bot commented Aug 28, 2023

erikgrinaker commented Aug 30, 2023

rpc: increase gRPC server timeout from 1x to 2x NetworkTimeout #109578

rpc: increase gRPC server timeout from 1x to 2x NetworkTimeout #109578

Conversation

erikgrinaker commented Aug 28, 2023 • edited Loading

blathers-crl bot commented Aug 28, 2023

cockroach-teamcity commented Aug 28, 2023

erikgrinaker commented Aug 28, 2023

sean- left a comment

Choose a reason for hiding this comment

andrewbaptist left a comment • edited by cockroach-dev-inf Loading

Choose a reason for hiding this comment

erikgrinaker commented Aug 28, 2023

andrewbaptist commented Aug 28, 2023

craig bot commented Aug 28, 2023

erikgrinaker commented Aug 30, 2023

erikgrinaker commented Aug 28, 2023 •

edited

Loading

andrewbaptist left a comment •

edited by cockroach-dev-inf

Loading