Skip to content

Commit

Permalink
retry gRPC requests are immediately terminated by trailers (#1706)
Browse files Browse the repository at this point in the history
Currently, gRPC responses are classified for metrics based on the value
of the `grpc-status` header in either the initial `HEADERS` frame *or*
in the `TRAILERS` frame that terminates the response body. However, this
is *not* the case for retries. Because we don't want to buffer the
entire response body before forwarding it to the client (introducing
latency, and potentially breaking client behavior in the case of
long-running streams), we make retry decisions for gRPC based only on
`grpc-status` in the initial `HEADERS` frame.

Unfortunately, some gRPC implementations (such as Akka) will send the
`grpc-status` header in a `TRAILERS` frame even when the body is empty
(see linkerd/linkerd2#7701). In that case, Linkerd won't retry the
request, even though it wouldn't actually have to wait to buffer a long
body before making a retry decision.

This branch resolves this issue by changing the retry implementation so
that the proxy will wait for a single additional frame after the first
`HEADERS` frame before making a retry decision. If the next frame is a
`TRAILERS` frame that terminates the response stream, we can include the
value of the `grpc-status` in that `TRAILERS` frame in the retry
decision. Otherwise, if the next frame is a `DATA` frame, we buffer only
a single frame, give up on trying to retry, and forward that frame to
the client, followed by the rest of the body.

Unfortunately, this change is a bit complicated, as it was necessary to
introduce the ability for retries to modify the response body as well as
the request body. This then implies that we need to be able to erase
response body types as well, when they may differ based on whether or
not retries are enabled for a route. I tried a number of different ways
of structuring this change, such as passing an additional `Layer` into
the `Retry` service that wraps the service constructed for retries, but
this was kind of a mess. Ultimately, I felt like the cleanest way of
doing this was changing the `linkerd_retry::PrepareRequest` trait to a
`PrepareRetry` trait that can also modify response bodies.

I'm open to suggestions for a nicer way of structuing this change. I
also considered the alternative of giving up on `tower::retry` entirely
and writing our own retry service from scratch, but I wanted to try to
avoid that if it was possible.

Depends on #1705 and #1726.

Fixes linkerd/linkerd2#7701.

Signed-off-by: Eliza Weisman <[email protected]>
  • Loading branch information
hawkw authored Jun 22, 2022
1 parent d013345 commit 7f817b5
Show file tree
Hide file tree
Showing 10 changed files with 639 additions and 95 deletions.
1 change: 1 addition & 0 deletions Cargo.lock
Original file line number Diff line number Diff line change
Expand Up @@ -1132,6 +1132,7 @@ dependencies = [
"http-body",
"hyper",
"linkerd-error",
"linkerd-stack",
"linkerd-tracing",
"parking_lot",
"thiserror",
Expand Down
Loading

0 comments on commit 7f817b5

Please sign in to comment.