You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have experienced the problem in a production Cloud Run deployment that used gRPC to speak between Cloud Run services. In my case, service A calls service B as a part of fulfilling inbound requests to A. On a number of occasions, I have seen service A end up in a zombie state where inbound requests start to get processed but outbound calls to B end up never receiving a response.
The logs show a "read ECONNRESET" error for one call. This is somewhat expected as Cloud Run reserves the right to terminate connections at will (https://cloud.google.com/run/docs/container-contract#outbound-reset), however I would have expected the gRPC channel with default settings to detect and recover from the socket being killed.
With the current behaviour and default channel options, the client channel still attempts to send out requests to the other service, but never gets responses back - resulting in all calls timing out, until the service is restarted.
Reproduction steps
Deploy a pair of node based gRPC services A and B to cloud run, with a method in A that calls another method in B. For my settings, I have CPU throttling enabled, but require one instance to be kept alive. The result may be atypical in that CPU is severely throttled in between requests actively being handled. All communication is done over http/2.
Regularly call the method in A (in our case, this is user activity, but could presumably be a scheduled job) until Cloud Run decides to terminate the connection
Observe that all calls to A now start to time out, and that A stops receiving all responses from B until the A service is manually restarted.
Environment
Google Cloud Run, Docker image based on node:16.14.2-alpine3.15
grpc-js 1.8.12 (via nice-grpc)
Additional context
Based on other issues here, I have tried to enable keepalives every 60 seconds with a 5 second timeout, to see if these settings allow the channel to be detected as dead.
Is it expected that a client gRPC channel would be unable to recover from the underlying socket being terminated with the default channel settings?
The text was updated successfully, but these errors were encountered:
Experiencing the same, but we spotted it in mid july, according to the logs from GRPC_TRACE, seems like the cb passed to http2Stream.write is never called.
Problem description
I have experienced the problem in a production Cloud Run deployment that used gRPC to speak between Cloud Run services. In my case, service A calls service B as a part of fulfilling inbound requests to A. On a number of occasions, I have seen service A end up in a zombie state where inbound requests start to get processed but outbound calls to B end up never receiving a response.
The logs show a "read ECONNRESET" error for one call. This is somewhat expected as Cloud Run reserves the right to terminate connections at will (https://cloud.google.com/run/docs/container-contract#outbound-reset), however I would have expected the gRPC channel with default settings to detect and recover from the socket being killed.
With the current behaviour and default channel options, the client channel still attempts to send out requests to the other service, but never gets responses back - resulting in all calls timing out, until the service is restarted.
Reproduction steps
Environment
Additional context
Based on other issues here, I have tried to enable keepalives every 60 seconds with a 5 second timeout, to see if these settings allow the channel to be detected as dead.
Is it expected that a client gRPC channel would be unable to recover from the underlying socket being terminated with the default channel settings?
The text was updated successfully, but these errors were encountered: