-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gRPC Server RST_STREAM issues with Envoy proxy #8041
Comments
@inrdkec I spent sometime understanding the issue here. "Received RST_STREAM with code 0" typically means the stream was abruptly closed, possibly due to the client or server resetting the connection. The logs show the server sending a RST_STREAM after sending DATA and HEADERS. The RST_STREAM with NO_ERROR might indicate that the server thinks the stream is done, but the client is expecting more data. Maybe the server is closing the stream too early. The protocol errors about missing messages or byte counts indicate that the gRPC messages might not be framed correctly, or parts of the message are getting lost, which could happen if the stream is reset before all data is sent. You mentioned without Istio, everything works fine, so the issue is likely related to Istio/Envoy configuration. The fact that the problem occurs under concurrency and with larger data sizes suggests it's related to flow control, buffer limits, or timeouts in Envoy. Not sure, if you tried to play around with these settings.
this could be server Envoy is having trouble communicating with the upstream service, maybe due to timeouts or resource exhaustion. Without a repro, its hard to debug further and provide help. I will ask other maintainers what they think about this. |
Hi @purnesh42H, thanks for taking a look so quickly.
Taking the failed request, correct me if I am wrong, the server was in open state during the sending of DATA frame and HEADER frame.
Afterwards, it seems there are 2 options to explain the RST_STREAM:
In either way, do you think the server started the RST_STREAM? or was it envoy?
Yes I did. For instance, 1 pod on both side, vanilla Istio enabled (no restrictions), enough resources (cpu and mem) to scale. A test of 100 requests with 20 concurrently with a payload of 250KB. This test showed 2-4 failed requests. grpc timeouts is 2s, timeout on client and server 5s. req/res within 50-80ms. Also the flow control buffers on envoy for http2 connections seems to be enough: https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/protocol.proto#envoy-v3-api-field-config-core-v3-http2protocoloptions-initial-connection-window-size
I can confirm that it is not the case as pointed out above.
Ok thanks. Much appreciated. |
@inrdkec i got a chance to discuss with other maintainers and we think that grpc server is doing the right thing here. As i mentioned earlier, the server thinks the stream is done so it sends the trailers but since the client didn't close, server sent the RST_STREAM with NO_ERROR indicating stream is concluded. We only send RST_STREAM if the client hasn't closed. So, it looks like client didn't see the trailers? That means its dropped somewhere and seems to be related to the envoy bug of not relaying the trailers. In the success logs, it shows the client sending END_STREAM on its final request which suggest that client did see the trailers. So no RST_STREAM in success case. So, i guess we need to figure what triggers Envoy's broken behavior, so someone can get Envoy to fix it. You mentioned it happens when amount of data fetched is not limited. I think providing envoy this repro should get them to understand and help further. |
Hi @purnesh42H , thank a lot for the clarifications and the time invested, much appreciated. Regards. |
NOTE: if you are reporting is a potential security vulnerability or a crash,
please follow our CVE process at
https://github.com/grpc/proposal/blob/master/P4-grpc-cve-process.md instead of
filing an issue here.
Please see the FAQ in our main README.md, then answer the questions below
before submitting your issue.
What version of gRPC are you using?
google.golang.org/grpc v1.69.2
What version of Go are you using (
go version
)?go 1.23
What operating system (Linux, Windows, …) and version?
Linux (Kubernetes with Alpine 3.20 image)
What did you do?
If possible, provide a recipe for reproducing the error.
We experienced the same situation as this issue: #7623
Client unary requests (NodeJS clients) failed randomly while fetching data from a Golang gRPC server. Both use Istio(envoy proxies). Issues are randomly happening typically around 1-10%.
With Istio(envoy) we see issues.
Without Istio(envoy) we do not experience issues.
With Istio(envoy) but limiting the amount of data fetched it seems we do not experience the problems.
We noted that issues are popping up easily if the requests are concurrently and the amount of data fetched is not limited.
Client errors:
A) grpc-js library:
Error: 13 INTERNAL: Received RST_STREAM with code 0 (Call ended without gRPC status)
at callErrorFromStatus (/usr/src/app/node_modules/@grpc/grpc-js/build/src/call.js:31:19)
at Object.onReceiveStatus (/usr/src/app/node_modules/@grpc/grpc-js/build/src/client.js:193:76)
B) connectrpc library:
connecterror: [invalid_argument] protocol error: missing output message for unary method
or
connecterror: [internal] protocol error: missing status
at validateTrailer (/usr/src/app/node_modules/@connectrpc/connect/dist/cjs/protocol-grpc/validate-trailer.js:42:15)
at next (/usr/src/app/node_modules/@connectrpc/connect/dist/cjs/protocol-grpc/transport.js:114:63)
or
ConnectError: [invalid_argument] protocol error: promised 11117 bytes in enveloped message, got 9920 bytes
at /usr/src/app/node_modules/@connectrpc/connect/dist/cjs/protocol/async-iterable.js:633:23
at Generator.next ()
Envoy errors:
No errors were reported either on the envoy client side nor in the envoy server side.
Server errors:
No errors were reported.
Example of test:
Complete requests: 500
Concurrency Level: 50
Failed requests: 41
Typically the data to fetch is around 250 KB and it is served in 1 ms.
First we thought the issue was on the client side:
@grpc/grpc-js
throw 'Received RST_STREAM with code 0' with retry enabled grpc-node#2569 (comment)@grpc/grpc-js
throw 'Received RST_STREAM with code 0' with retry enabled grpc-node#2569 (comment)But it was confirmed that the issue is likely on envoy or server side.
Later we tried a new feature on the envoy side regarding half-close which apparently fixed an issue of not sending the trailers: envoyproxy/envoy#30149
But this did not help.
Checking envoy metrics we can see some upstream resets:
Response flags by GRPC code:
Code Flags % Req
0 - 18.3
0 UR. 77.8 (UR - Upstream Remote reset in addition to 503 response code)
2 DR. 3.9 (DR - DownstreamRemoteReset - The response details are http2.remote_reset or http2.remote_refuse.)
Debug logs from the gRPC golang server:
Example of a successful request:
Example of a failed request
Any idea what it could cause this? Any help would be much appreciated.
What did you expect to see?
No failed requests.
What did you see instead?
Randomly requests failed, magnitude typically 1%.
The text was updated successfully, but these errors were encountered: