Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC Server RST_STREAM issues with Envoy proxy #8041

Closed
inrdkec opened this issue Jan 27, 2025 · 4 comments
Closed

gRPC Server RST_STREAM issues with Envoy proxy #8041

inrdkec opened this issue Jan 27, 2025 · 4 comments
Assignees
Labels
Area: Server Includes Server, Streams and Server Options. Type: Bug

Comments

@inrdkec
Copy link

inrdkec commented Jan 27, 2025

NOTE: if you are reporting is a potential security vulnerability or a crash,
please follow our CVE process at
https://github.com/grpc/proposal/blob/master/P4-grpc-cve-process.md instead of
filing an issue here.

Please see the FAQ in our main README.md, then answer the questions below
before submitting your issue.

What version of gRPC are you using?

google.golang.org/grpc v1.69.2

What version of Go are you using (go version)?

go 1.23

What operating system (Linux, Windows, …) and version?

Linux (Kubernetes with Alpine 3.20 image)

What did you do?

If possible, provide a recipe for reproducing the error.

We experienced the same situation as this issue: #7623

Client unary requests (NodeJS clients) failed randomly while fetching data from a Golang gRPC server. Both use Istio(envoy proxies). Issues are randomly happening typically around 1-10%.
With Istio(envoy) we see issues.
Without Istio(envoy) we do not experience issues.
With Istio(envoy) but limiting the amount of data fetched it seems we do not experience the problems.
We noted that issues are popping up easily if the requests are concurrently and the amount of data fetched is not limited.

Client errors:
A) grpc-js library:
Error: 13 INTERNAL: Received RST_STREAM with code 0 (Call ended without gRPC status)
at callErrorFromStatus (/usr/src/app/node_modules/@grpc/grpc-js/build/src/call.js:31:19)
at Object.onReceiveStatus (/usr/src/app/node_modules/@grpc/grpc-js/build/src/client.js:193:76)

B) connectrpc library:
connecterror: [invalid_argument] protocol error: missing output message for unary method
or
connecterror: [internal] protocol error: missing status
at validateTrailer (/usr/src/app/node_modules/@connectrpc/connect/dist/cjs/protocol-grpc/validate-trailer.js:42:15)
at next (/usr/src/app/node_modules/@connectrpc/connect/dist/cjs/protocol-grpc/transport.js:114:63)
or
ConnectError: [invalid_argument] protocol error: promised 11117 bytes in enveloped message, got 9920 bytes
at /usr/src/app/node_modules/@connectrpc/connect/dist/cjs/protocol/async-iterable.js:633:23
at Generator.next ()

Envoy errors:
No errors were reported either on the envoy client side nor in the envoy server side.

Server errors:
No errors were reported.

Example of test:
Complete requests: 500
Concurrency Level: 50
Failed requests: 41

Typically the data to fetch is around 250 KB and it is served in 1 ms.

First we thought the issue was on the client side:

Later we tried a new feature on the envoy side regarding half-close which apparently fixed an issue of not sending the trailers: envoyproxy/envoy#30149
But this did not help.

Checking envoy metrics we can see some upstream resets:
Response flags by GRPC code:
Code Flags % Req
0 - 18.3
0 UR. 77.8 (UR - Upstream Remote reset in addition to 503 response code)
2 DR. 3.9 (DR - DownstreamRemoteReset - The response details are http2.remote_reset or http2.remote_refuse.)

Debug logs from the gRPC golang server:
Example of a successful request:

2025-01-22T12:59:33.929200238Z,"http2: Framer 0xc001192000: wrote HEADERS flags=END_STREAM|END_HEADERS stream=425 len=2"
2025-01-22T12:59:33.929118668Z,"http2: Framer 0xc001192000: wrote DATA stream=425 len=11122 data=""\x00\x00\x00xxxxx"" (10866 bytes omitted)"
2025-01-22T12:59:33.929092908Z,"http2: Framer 0xc001192000: wrote HEADERS flags=END_HEADERS stream=425 len=2"
2025-01-22T12:59:33.925353278Z,"http2: Framer 0xc001192000: read PING flags=ACK len=8 ping=""\x02\x04\x10\x10\t\x0e\a\a""
2025-01-22T12:59:33.924593249Z,"http2: Framer 0xc001192000: wrote PING len=8 ping=""\x02\x04\x10\x10\t\x0e\a\a
2025-01-22T12:59:33.924577269Z,"http2: Framer 0xc001192000: wrote WINDOW_UPDATE len=4 (conn) incr=144"
2025-01-22T12:59:33.924501198Z,"http2: Framer 0xc001192000: read DATA flags=END_STREAM stream=425 len=0 data=""
2025-01-22T12:59:33.924444109Z,"http2: Framer 0xc001192000: read DATA stream=425 len=16 data=""\x00\x00\x00xxxx
2025-01-22T12:59:33.924386618Z,"http2: decoded hpack field header field ""x-envoy-attempt-count"" = ""1
2025-01-22T12:59:33.924383198Z,"http2: decoded hpack field header field ""x-request-id"" = ""xxx
2025-01-22T12:59:33.924378498Z,"http2: decoded hpack field header field ""x-forwarded-proto"" = ""http
2025-01-22T12:59:33.924375409Z,"http2: decoded hpack field header field ""te"" = ""trailers
2025-01-22T12:59:33.924371898Z,"http2: decoded hpack field header field ""content-type"" = ""application/grpc
2025-01-22T12:59:33.924368589Z,"http2: decoded hpack field header field ""user-agent"" = ""grpc-node-js/1.12.5
2025-01-22T12:59:33.924365389Z,"http2: decoded hpack field header field ""grpc-timeout"" = ""2000m
2025-01-22T12:59:33.924362558Z,"http2: decoded hpack field header field ""accept-encoding"" = ""identity
2025-01-22T12:59:33.924358429Z,"http2: decoded hpack field header field ""grpc-accept-encoding"" = ""identity,deflate,gzip
2025-01-22T12:59:33.924353909Z,"http2: decoded hpack field header field "":scheme"" = ""http
2025-01-22T12:59:33.924350618Z,"http2: decoded hpack field header field "":path"" = ""/api.xxx_backend.xxxDataService/GetxxxData
2025-01-22T12:59:33.924345129Z,"http2: decoded hpack field header field "":method"" = ""POST
2025-01-22T12:59:33.924341729Z,"http2: decoded hpack field header field "":authority"" = ""xxx.xxx-backend:6002

Example of a failed request

2025-01-22T12:59:33.868642633Z,"http2: Framer 0xc001192000: wrote RST_STREAM stream=423 len=4 ErrCode=NO_ERROR"
2025-01-22T12:59:33.868631073Z,"http2: Framer 0xc001192000: wrote HEADERS flags=END_STREAM|END_HEADERS stream=423 len=2"
2025-01-22T12:59:33.868614723Z,"http2: Framer 0xc001192000: wrote DATA stream=423 len=11122 data=""\x00\x00\x00xxxxx"" (10866 bytes omitted)"
2025-01-22T12:59:33.868556433Z,"http2: Framer 0xc001192000: wrote HEADERS flags=END_HEADERS stream=423 len=2"
2025-01-22T12:59:33.863568134Z,"http2: Framer 0xc001192000: read DATA stream=423 len=16 data=""\x00\x00\x00xxxx""
2025-01-22T12:59:33.863273354Z,"http2: Framer 0xc001192000: read HEADERS flags=END_HEADERS stream=423 len=213"
2025-01-22T12:59:33.863196083Z,"http2: decoded hpack field header field ""x-forwarded-client-cert"" = ""xxxxxxxxxxxxx
2025-01-22T12:59:33.863187903Z,"http2: decoded hpack field header field ""x-envoy-attempt-count"" = ""1
2025-01-22T12:59:33.863176054Z,"http2: decoded hpack field header field ""x-forwarded-proto"" = ""http
2025-01-22T12:59:33.863170943Z,"http2: decoded hpack field header field ""te"" = ""trailers
2025-01-22T12:59:33.863165394Z,"http2: decoded hpack field header field ""content-type"" = ""application/grpc
2025-01-22T12:59:33.863160394Z,"http2: decoded hpack field header field ""user-agent"" = ""grpc-node-js/1.12.5
2025-01-22T12:59:33.863155023Z,"http2: decoded hpack field header field ""grpc-timeout"" = ""1999m
2025-01-22T12:59:33.863148454Z,"http2: decoded hpack field header field ""accept-encoding"" = ""identity
2025-01-22T12:59:33.863143443Z,"http2: decoded hpack field header field ""grpc-accept-encoding"" = ""identity,deflate,gzip
2025-01-22T12:59:33.863137983Z,"http2: decoded hpack field header field "":scheme"" = ""http
2025-01-22T12:59:33.863132514Z,"http2: decoded hpack field header field "":path"" = ""/api.xxx_backend.xxxDataService/GetxxxData
2025-01-22T12:59:33.863125494Z,"http2: decoded hpack field header field "":method"" = ""POST
2025-01-22T12:59:33.863120914Z,"http2: decoded hpack field header field "":authority"" = ""xxx.xxx-backend:6002  

Any idea what it could cause this? Any help would be much appreciated.

What did you expect to see?

No failed requests.

What did you see instead?

Randomly requests failed, magnitude typically 1%.

@purnesh42H purnesh42H self-assigned this Jan 28, 2025
@purnesh42H purnesh42H added the Area: Server Includes Server, Streams and Server Options. label Jan 28, 2025
@purnesh42H
Copy link
Contributor

purnesh42H commented Jan 30, 2025

@inrdkec I spent sometime understanding the issue here. "Received RST_STREAM with code 0" typically means the stream was abruptly closed, possibly due to the client or server resetting the connection. The logs show the server sending a RST_STREAM after sending DATA and HEADERS. The RST_STREAM with NO_ERROR might indicate that the server thinks the stream is done, but the client is expecting more data. Maybe the server is closing the stream too early.

The protocol errors about missing messages or byte counts indicate that the gRPC messages might not be framed correctly, or parts of the message are getting lost, which could happen if the stream is reset before all data is sent.

You mentioned without Istio, everything works fine, so the issue is likely related to Istio/Envoy configuration. The fact that the problem occurs under concurrency and with larger data sizes suggests it's related to flow control, buffer limits, or timeouts in Envoy. Not sure, if you tried to play around with these settings.

0 UR. 77.8 (UR - Upstream Remote reset in addition to 503 response code)

this could be server Envoy is having trouble communicating with the upstream service, maybe due to timeouts or resource exhaustion.

Without a repro, its hard to debug further and provide help. I will ask other maintainers what they think about this.

@inrdkec
Copy link
Author

inrdkec commented Jan 30, 2025

Hi @purnesh42H,

thanks for taking a look so quickly.

The logs show the server sending a RST_STREAM after sending DATA and HEADERS. The RST_STREAM with NO_ERROR might indicate that the server thinks the stream is done, but the client is expecting more data. Maybe the server is closing the stream too early.

Taking the failed request, correct me if I am wrong, the server was in open state during the sending of DATA frame and HEADER frame.

2025-01-22T12:59:33.868631073Z,"http2: Framer 0xc001192000: wrote HEADERS flags=END_STREAM|END_HEADERS stream=423 len=2"
2025-01-22T12:59:33.868614723Z,"http2: Framer 0xc001192000: wrote DATA stream=423 len=11122 data=""\x00\x00\x00xxxxx"" (10866 bytes omitted)"
2025-01-22T12:59:33.868556433Z,"http2: Framer 0xc001192000: wrote HEADERS flags=END_HEADERS stream=423 len=2"

Afterwards, it seems there are 2 options to explain the RST_STREAM:

2025-01-22T12:59:33.868642633Z,"http2: Framer 0xc001192000: wrote RST_STREAM stream=423 len=4 ErrCode=NO_ERROR"

  1. HEADER=ENDSTREAM: the server transitioned to half-close local stream state. Later it did not recv ES (ENDSTREAM) nor a R(RST_STREAM) from the client or envoy, instead the server sent a R(RST_STREAM) to close. I guess this match your theory that the server may have closed to fast (any idea what it could cause it)?
  2. HEADER=END_HEADERS: the server keeps in open state and for any unknown reason it moved to close by sending a RST_STREAM abruptly.

In either way, do you think the server started the RST_STREAM? or was it envoy?
Any way to debug with more details (code and conditions)?

The protocol errors about missing messages or byte counts indicate that the gRPC messages might not be framed correctly, or parts of the message are getting lost, which could happen if the stream is reset before all data is sent.
You mentioned without Istio, everything works fine, so the issue is likely related to Istio/Envoy configuration. The fact that the problem occurs under concurrency and with larger data sizes suggests it's related to flow control, buffer limits, or timeouts in Envoy. Not sure, if you tried to play around with these settings.

Yes I did. For instance, 1 pod on both side, vanilla Istio enabled (no restrictions), enough resources (cpu and mem) to scale. A test of 100 requests with 20 concurrently with a payload of 250KB. This test showed 2-4 failed requests. grpc timeouts is 2s, timeout on client and server 5s. req/res within 50-80ms. Also the flow control buffers on envoy for http2 connections seems to be enough: https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/protocol.proto#envoy-v3-api-field-config-core-v3-http2protocoloptions-initial-connection-window-size

0 UR. 77.8 (UR - Upstream Remote reset in addition to 503 response code)
this could be server Envoy is having trouble communicating with the upstream service, maybe due to timeouts or resource exhaustion.

I can confirm that it is not the case as pointed out above.

Without a repro, its hard to debug further and provide help. I will ask other maintainers what they think about this.

Ok thanks. Much appreciated.

@purnesh42H
Copy link
Contributor

@inrdkec i got a chance to discuss with other maintainers and we think that grpc server is doing the right thing here. As i mentioned earlier, the server thinks the stream is done so it sends the trailers but since the client didn't close, server sent the RST_STREAM with NO_ERROR indicating stream is concluded. We only send RST_STREAM if the client hasn't closed. So, it looks like client didn't see the trailers? That means its dropped somewhere and seems to be related to the envoy bug of not relaying the trailers.

In the success logs, it shows the client sending END_STREAM on its final request which suggest that client did see the trailers. So no RST_STREAM in success case.

So, i guess we need to figure what triggers Envoy's broken behavior, so someone can get Envoy to fix it. You mentioned it happens when amount of data fetched is not limited. I think providing envoy this repro should get them to understand and help further.

@inrdkec
Copy link
Author

inrdkec commented Feb 6, 2025

Hi @purnesh42H ,

thank a lot for the clarifications and the time invested, much appreciated.
I will check with envoy folk (envoyproxy/envoy#38206).

Regards.

@inrdkec inrdkec closed this as completed Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Server Includes Server, Streams and Server Options. Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants