Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Experienced on Cloud Run) - a connection terminated on the TCP level leaves a gRPC client channel in a zombie state #2397

Closed
SoftMemes opened this issue Mar 17, 2023 · 2 comments

Comments

@SoftMemes
Copy link

Problem description

I have experienced the problem in a production Cloud Run deployment that used gRPC to speak between Cloud Run services. In my case, service A calls service B as a part of fulfilling inbound requests to A. On a number of occasions, I have seen service A end up in a zombie state where inbound requests start to get processed but outbound calls to B end up never receiving a response.

The logs show a "read ECONNRESET" error for one call. This is somewhat expected as Cloud Run reserves the right to terminate connections at will (https://cloud.google.com/run/docs/container-contract#outbound-reset), however I would have expected the gRPC channel with default settings to detect and recover from the socket being killed.

With the current behaviour and default channel options, the client channel still attempts to send out requests to the other service, but never gets responses back - resulting in all calls timing out, until the service is restarted.

Reproduction steps

  • Deploy a pair of node based gRPC services A and B to cloud run, with a method in A that calls another method in B. For my settings, I have CPU throttling enabled, but require one instance to be kept alive. The result may be atypical in that CPU is severely throttled in between requests actively being handled. All communication is done over http/2.
  • Regularly call the method in A (in our case, this is user activity, but could presumably be a scheduled job) until Cloud Run decides to terminate the connection
  • Observe that all calls to A now start to time out, and that A stops receiving all responses from B until the A service is manually restarted.

Environment

  • Google Cloud Run, Docker image based on node:16.14.2-alpine3.15
  • grpc-js 1.8.12 (via nice-grpc)

Additional context

Based on other issues here, I have tried to enable keepalives every 60 seconds with a 5 second timeout, to see if these settings allow the channel to be detected as dead.

Is it expected that a client gRPC channel would be unable to recover from the underlying socket being terminated with the default channel settings?

@raythurnvoid
Copy link

raythurnvoid commented Aug 30, 2023

Experiencing the same, but we spotted it in mid july, according to the logs from GRPC_TRACE, seems like the cb passed to http2Stream.write is never called.

@murgatroid99
Copy link
Member

I didn't realize this before, but I think this is the same bug as #2502. Lets consolidate discussion of it there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants