Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: increase gRPC server timeout from 1x to 2x NetworkTimeout #109578

Merged
merged 1 commit into from
Aug 28, 2023

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Aug 28, 2023

This is intended as a conservative backport that changes as little as possible. For 23.2, we should restructure these settings a bit, possibly by removing NetworkTimeout and using independent timeouts for each component/parameter, since they have unique considerations (e.g. whether they are enforced above the Go runtime or by the OS, to what extent they are subject to RPC head-of-line blocking, etc).


This patch increases the gRPC server timeout from 1x to 2x NetworkTimeout. This timeout determines how long the server will wait for a TCP send to receive a TCP ack before automatically closing the connection. gRPC enforces this via the OS TCP stack by setting TCP_USER_TIMEOUT on the network socket.

While NetworkTimeout should be sufficient here, we have seen instances where this is affected by node load or other factors, so we set it to 2x NetworkTimeout to avoid spurious closed connections. An aggressive timeout is not particularly beneficial here, because the client-side timeout (in our case the CRDB RPC heartbeat) is what matters for recovery time following network or node outages -- the server side doesn't really care if the connection remains open for a bit longer.

Touches #109317.

Epic: none
Release note (ops change): The default gRPC server-side send timeout has been increased from 2 seconds to 4 seconds (1x to 2x of COCKROACH_NETWORK_TIMEOUT), to avoid spurious connection failures in certain scenarios. This can be controlled via the new environment variable COCKROACH_RPC_SERVER_TIMEOUT.

This patch increases the gRPC server timeout from 1x to 2x
NetworkTimeout. This timeout determines how long the server will wait
for a TCP send to receive a TCP ack before automatically closing the
connection. gRPC enforces this via the OS TCP stack by setting
TCP_USER_TIMEOUT on the network socket.

While NetworkTimeout should be sufficient here, we have seen instances
where this is affected by node load or other factors, so we set it to 2x
NetworkTimeout to avoid spurious closed connections. An aggressive
timeout is not particularly beneficial here, because the client-side
timeout (in our case the CRDB RPC heartbeat) is what matters for
recovery time following network or node outages -- the server side
doesn't really care if the connection remains open for a bit longer.

Epic: none
Release note (ops change): The default gRPC server-side send timeout has
been increased from 2 seconds to 4 seconds (1x to 2x of
COCKROACH_NETWORK_TIMEOUT), to avoid spurious connection failures in
certain scenarios. This can be controlled via the new environment
variable COCKROACH_RPC_SERVER_TIMEOUT.
@erikgrinaker erikgrinaker added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Aug 28, 2023
@erikgrinaker erikgrinaker requested review from andrewbaptist and a team August 28, 2023 08:56
@erikgrinaker erikgrinaker self-assigned this Aug 28, 2023
@erikgrinaker erikgrinaker requested a review from a team as a code owner August 28, 2023 08:56
@blathers-crl
Copy link

blathers-crl bot commented Aug 28, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker
Copy link
Contributor Author

I tried this out on cdc/scan/catchup/nodes=5/cpu=16/rows=1G/ranges=100k/protocol=rangefeed/format=json/sink=null, a changefeed benchmark with 100k ranges where we've previously seen connection closures under severe node overload. With this change, we no longer see such closures (the benchmark still fails to complete due to the overload though, but that's orthogonal).

Copy link
Collaborator

@sean- sean- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)

Copy link
Collaborator

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@erikgrinaker
Copy link
Contributor Author

TFTR! Going to look at the overnight failover test results after this merges before backporting, just to make sure we're not regressing on something unexpected.

bors r+

@andrewbaptist
Copy link
Collaborator

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 28, 2023

Build succeeded:

@erikgrinaker
Copy link
Contributor Author

I think this results in a recovery time regression for certains kinds of asymmetric partitions, see #109317 (comment). The change is still justified, considering we found in #109317 (comment) that this timeout is also sensitive to node overload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants