(Experienced on Cloud Run) - a connection terminated on the TCP level leaves a gRPC client channel in a zombie state #2397

SoftMemes · 2023-03-17T11:57:41Z

Problem description

I have experienced the problem in a production Cloud Run deployment that used gRPC to speak between Cloud Run services. In my case, service A calls service B as a part of fulfilling inbound requests to A. On a number of occasions, I have seen service A end up in a zombie state where inbound requests start to get processed but outbound calls to B end up never receiving a response.

The logs show a "read ECONNRESET" error for one call. This is somewhat expected as Cloud Run reserves the right to terminate connections at will (https://cloud.google.com/run/docs/container-contract#outbound-reset), however I would have expected the gRPC channel with default settings to detect and recover from the socket being killed.

With the current behaviour and default channel options, the client channel still attempts to send out requests to the other service, but never gets responses back - resulting in all calls timing out, until the service is restarted.

Reproduction steps

Deploy a pair of node based gRPC services A and B to cloud run, with a method in A that calls another method in B. For my settings, I have CPU throttling enabled, but require one instance to be kept alive. The result may be atypical in that CPU is severely throttled in between requests actively being handled. All communication is done over http/2.
Regularly call the method in A (in our case, this is user activity, but could presumably be a scheduled job) until Cloud Run decides to terminate the connection
Observe that all calls to A now start to time out, and that A stops receiving all responses from B until the A service is manually restarted.

Environment

Google Cloud Run, Docker image based on node:16.14.2-alpine3.15
grpc-js 1.8.12 (via nice-grpc)

Additional context

Based on other issues here, I have tried to enable keepalives every 60 seconds with a 5 second timeout, to see if these settings allow the channel to be detected as dead.

Is it expected that a client gRPC channel would be unable to recover from the underlying socket being terminated with the default channel settings?

raythurnvoid · 2023-08-30T21:06:26Z

Experiencing the same, but we spotted it in mid july, according to the logs from GRPC_TRACE, seems like the cb passed to http2Stream.write is never called.

murgatroid99 · 2023-08-30T22:32:38Z

I didn't realize this before, but I think this is the same bug as #2502. Lets consolidate discussion of it there.

murgatroid99 added the package: @grpc/grpc-js label Apr 5, 2023

murgatroid99 closed this as completed Aug 30, 2023

jokesterfr mentioned this issue Nov 21, 2023

chore(bump): grpc-js to 1.9.11 EventStore/EventStore-Client-NodeJS#350

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Experienced on Cloud Run) - a connection terminated on the TCP level leaves a gRPC client channel in a zombie state #2397

(Experienced on Cloud Run) - a connection terminated on the TCP level leaves a gRPC client channel in a zombie state #2397

SoftMemes commented Mar 17, 2023

raythurnvoid commented Aug 30, 2023 •

edited

Loading

murgatroid99 commented Aug 30, 2023

(Experienced on Cloud Run) - a connection terminated on the TCP level leaves a gRPC client channel in a zombie state #2397

(Experienced on Cloud Run) - a connection terminated on the TCP level leaves a gRPC client channel in a zombie state #2397

Comments

SoftMemes commented Mar 17, 2023

Problem description

Reproduction steps

Environment

Additional context

raythurnvoid commented Aug 30, 2023 • edited Loading

murgatroid99 commented Aug 30, 2023

raythurnvoid commented Aug 30, 2023 •

edited

Loading