Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keepalive: apply minimum ping time of 10s to client and 1s to server #2642

Merged
merged 2 commits into from
Feb 21, 2019

Conversation

dfawley
Copy link
Member

@dfawley dfawley commented Feb 13, 2019

Also fix tests that relied upon setting lower values.

This is in preparation for the fix to issue #2638

@dfawley dfawley added the Type: Behavior Change Behavior changes not categorized as bugs label Feb 13, 2019
@dfawley dfawley added this to the 1.19 Release milestone Feb 13, 2019
@dfawley dfawley requested a review from lyuxuan February 13, 2019 22:59
keepalive/keepalive.go Show resolved Hide resolved
test/channelz_test.go Outdated Show resolved Hide resolved
server.go Outdated Show resolved Hide resolved
test/channelz_test.go Outdated Show resolved Hide resolved
clientconn_test.go Outdated Show resolved Hide resolved
@lyuxuan lyuxuan merged commit ed70822 into grpc:master Feb 21, 2019
@JelteF
Copy link
Contributor

JelteF commented Mar 11, 2019

@dfawley Could you explain why it's necessary to disallow any lower ping than 10s/1s? We've been using a keep alive interval of 100ms (both Timeout and Time) successfully for a long time now. We use it detect connection failures quickly. To give you some indication of our setup, it's within one datacenter with a limited number of clients. This change is blocking for us to upgrade to grpc 1.19.0.

@dfawley
Copy link
Member Author

dfawley commented Mar 11, 2019

@JelteF The 10s client-side limit is a fix to bring us in line with the gRPC spec for keepalives here:

https://github.com/grpc/proposal/blob/master/A8-client-side-keepalive.md#extending-for-basic-health-checking

Server-side, there is no minimum defined in the spec. However, we intend to implement second-level precision as part of a fix for #2638, which would mean sub-second times could not be honored.

@JelteF
Copy link
Contributor

JelteF commented Mar 11, 2019

Thanks for the quick response. To clear up one thing, we use 100ms as the client keep alive. This is to very quickly detect a broken server process so we can failover to another server.

It makes sense to follow the spec better, but in this case it would be really nice if the minimum keepalive could be configured by the library user as well. Especially since lower keep alives were possible before. And also because the server already has the minimum time enforcement policy.

@dfawley
Copy link
Member Author

dfawley commented Mar 12, 2019

@JelteF We could potentially allow a lower limit to be configured, given that we had unintentional support for this in the past. If this is a valid use case, however, we should think about how we would support similar scenarios in Java and C, which currently also restrict clients to 10s keepalive ping intervals. We have another client-side health checking mechanism (gRFC), but that will only quickly detect a problem where the server is still alive but reports itself as unhealthy, not in the case where the server becomes unavailable.

cc @ejona86 who designed client-side keepalives. He had concerns about this being used to detect network outages (where 100ms is probably too low to be reliable), but it sounds like you are mostly concerned about the server itself becoming wedged.

@ejona86
Copy link
Member

ejona86 commented Mar 12, 2019

Our servers have a minimum time enforcement policy, but not all do. The worry is that people really like setting "really low" values without understanding the network/load implications. For example, if the server is overloaded, you don't want the client to disconnect and reconnect. It is a dangerous tool to wield at 100ms, as it can amplify problems.

@dfawley dfawley deleted the minpingtime branch May 30, 2019 18:26
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Jul 25, 2019
This PR upgrades gRPC from 1.13.0 to 1.21.2. The primary motivation for this
upgrade is to eliminate the disconnections caused by
grpc/grpc-go#1882. These failures manifest themselves
as the following set of errors:

```
ajwerner-test-0001> I190722 22:15:01.203008 12054 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: rpc [::]:26257 [n2] tripped: failed to check for ready connection to n2 at ajwerner-test-0002:26257: connection not ready: TRANSIENT_FAILURE
```
Which then lead to tripped breakers and general badness. I suspect that there
are several other good bug fixes in here, including some purported leaks and
correctness fixes on shutdown.

I have verified that with this upgrade I no longer see connections break in
overload scenarios which reliably reproduced the situation in the above log.

This commit removes one condition from grpcutil.IsClosedConnection which should
be subsumed by the status check above. The `transport` subpackage has not been
around for many releases.

This does not upgrade to the current release 1.22.0 because the maintainer
mentions that it contains a bug
(grpc/grpc-go#2663 (comment)).

This change also unfortunately updates the keepalive behavior to be more spec
compliant (grpc/grpc-go#2642). This change mandates
a minimum ping time of 10s to the client. Given grpc/grpc-go#2638
this means that the rpc test for keepalives now takes over 20s.

I would be okay skipping it but leave that discussion for review.

Also updated the acceptance test to look out for an HTTP/2.0 header because
grpc now does not send RPCs until after the HTTP handshake has completed
(see grpc/grpc-go#2406).

Release note (bug fix): Upgrade grpc library to fix connection state management
bug.
@lock lock bot locked as resolved and limited conversation to collaborators Nov 26, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Behavior Change Behavior changes not categorized as bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants