Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
retry gRPC requests are immediately terminated by trailers (#1706)
Currently, gRPC responses are classified for metrics based on the value of the `grpc-status` header in either the initial `HEADERS` frame *or* in the `TRAILERS` frame that terminates the response body. However, this is *not* the case for retries. Because we don't want to buffer the entire response body before forwarding it to the client (introducing latency, and potentially breaking client behavior in the case of long-running streams), we make retry decisions for gRPC based only on `grpc-status` in the initial `HEADERS` frame. Unfortunately, some gRPC implementations (such as Akka) will send the `grpc-status` header in a `TRAILERS` frame even when the body is empty (see linkerd/linkerd2#7701). In that case, Linkerd won't retry the request, even though it wouldn't actually have to wait to buffer a long body before making a retry decision. This branch resolves this issue by changing the retry implementation so that the proxy will wait for a single additional frame after the first `HEADERS` frame before making a retry decision. If the next frame is a `TRAILERS` frame that terminates the response stream, we can include the value of the `grpc-status` in that `TRAILERS` frame in the retry decision. Otherwise, if the next frame is a `DATA` frame, we buffer only a single frame, give up on trying to retry, and forward that frame to the client, followed by the rest of the body. Unfortunately, this change is a bit complicated, as it was necessary to introduce the ability for retries to modify the response body as well as the request body. This then implies that we need to be able to erase response body types as well, when they may differ based on whether or not retries are enabled for a route. I tried a number of different ways of structuring this change, such as passing an additional `Layer` into the `Retry` service that wraps the service constructed for retries, but this was kind of a mess. Ultimately, I felt like the cleanest way of doing this was changing the `linkerd_retry::PrepareRequest` trait to a `PrepareRetry` trait that can also modify response bodies. I'm open to suggestions for a nicer way of structuing this change. I also considered the alternative of giving up on `tower::retry` entirely and writing our own retry service from scratch, but I wanted to try to avoid that if it was possible. Depends on #1705 and #1726. Fixes linkerd/linkerd2#7701. Signed-off-by: Eliza Weisman <[email protected]>
- Loading branch information