-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
errors: Support contextual error handling strategies #1246
Conversation
Our error handlers need to account for the fact that HTTP/1 requests may fail with HTTP/2 errors due to orig-proto upgrading. To simplify error handling (in an upcoming change), the `orig_proto::Upgrade` service introduces a new error type, `orig_proto::DowngradedH2Error` that hides the original error source so that these errors are not detected as being an `h2::Error`.
Currently `app_core::errors` handles synthesis of all error responses. This means that this module needs to know about all possible error types. To work around this, we wrap some errors with a special `HttpError` type so this module can create a synthetic response. Furthermore, this means that we can't handle errors differently in inbound/outbound contexts. This change modies the `error::respond` module with a new strategy trait--`HttpRescue`--that is responsible for producing a `SyntheticHttpResponse` if the error can be handled and returning an error if it should not be handled gracefully. A `DefaultHttpRescue` type encodes default behavior and inbound- and outbound-specific types extend this functionality with appropriate error handling. This change will enable us to perform error handling in more places in the stack so that we can be more tactical about which errors are handled in various parts of the stack--for example, so that error responses are included in respose metrics. This change modifies HTTP/1 error responses to include a `connection: close` header when the server-side connection is being torn down.
When an HTTP client throws an error--usually and I/O or connection timeout error, these metrics are not recorded with the appropriate 5XX status or reported to Tap. This change synthesizes responses for these errors before incrementing HTTP response counters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. Have two smaller questions before approving.
return Ok(errors::SyntheticHttpResponse { | ||
http_status: http::StatusCode::GATEWAY_TIMEOUT, | ||
grpc_status: errors::Grpc::DeadlineExceeded, | ||
close_connection: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like the only rescue that doesn't close the connection. Is this because we expect the proxy to retry in this case and possibly establish a connection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other errors can be indicative of the target going away. with response timeouts, we're dealing with some user-configured timeout. there doesn't seem to be any reason to tear down the connection in this case--the request has probably reached the application, it's just slow.
… by internally generated errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall, this looks good to me! i picked a bunch of obnoxious minor nits, but you can feel free to ignore them!
pub grpc_status: tonic::Code, | ||
pub http_status: http::StatusCode, | ||
pub close_connection: bool, | ||
pub message: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (possibly a premature optimization): I notice sometimes these strings are formatted dynamically, but sometimes they are fixed static strings that are converted to String
s. I wonder if we want to use Cow<'static, str>
here so that we don't have to allocate a bunch of copies of the same string when it isn't dynamic?
we would then match on the Cow
and call HeaderValue::from_static
when the Cow
has a static string in it (as a side note, i floated the idea to Sean of adding a TryFrom<Cow<'static, str>>
to http::header::HeaderValue
that would do that for us).
fn grpc_response<B: Default>(code: tonic::Code, message: &str) -> http::Response<B> { | ||
info!(%code, %message, "Handling error on gRPC connection"); | ||
|
||
http::Response::builder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit/TIOLI (possibly a premature optimization): we always add exactly 5 headers to the headermap here, so I wonder if we want to consider pre-allocating enough capacity for them? that way, we know it won't allocate multiple times...
http::Response::builder() | |
let mut builder = http::Response::builder(); | |
// We always add 5 headers, so preallocate enough capacity for them. | |
builder | |
.headers_mut() | |
.expect("builder will not have an error") | |
.reserve(5); | |
builder |
This release improves error handling so that HTTP metrics include 5XX responses for common errors. Logging has also been improved to ensure `inbound` and `outbound` spans are always present in log messages. Outbound tap has been fixed to include route labels when service profiles are configured. --- * Set tracing spans on policy client (linkerd/linkerd2-proxy#1241) * build(deps): bump tokio-util from 0.6.7 to 0.6.8 (linkerd/linkerd2-proxy#1240) * core: fix missing spans in `serve` tasks (linkerd/linkerd2-proxy#1243) * build(deps): bump thiserror from 1.0.28 to 1.0.29 (linkerd/linkerd2-proxy#1244) * orig-proto: Avoid emiting HTTP/2 errors for upgraded requests (linkerd/linkerd2-proxy#1245) * Fix route labels on outbound tap metadata (linkerd/linkerd2-proxy#1247) * errors: Support contextual error handling strategies (linkerd/linkerd2-proxy#1246)
This release improves error handling so that HTTP metrics include 5XX responses for common errors. Logging has also been improved to ensure `inbound` and `outbound` spans are always present in log messages. Outbound tap has been fixed to include route labels when service profiles are configured. --- * Set tracing spans on policy client (linkerd/linkerd2-proxy#1241) * build(deps): bump tokio-util from 0.6.7 to 0.6.8 (linkerd/linkerd2-proxy#1240) * core: fix missing spans in `serve` tasks (linkerd/linkerd2-proxy#1243) * build(deps): bump thiserror from 1.0.28 to 1.0.29 (linkerd/linkerd2-proxy#1244) * orig-proto: Avoid emiting HTTP/2 errors for upgraded requests (linkerd/linkerd2-proxy#1245) * Fix route labels on outbound tap metadata (linkerd/linkerd2-proxy#1247) * errors: Support contextual error handling strategies (linkerd/linkerd2-proxy#1246)
`linkerd-app-core` includes an error recovery body middleware. this middleware will gracefully catch and report errors encountered when polling an inner body, and via an `R`-typed recovery strategy provided by the caller, will attempt to map the error to a gRPC status code denoting an error. before we upgrade to hyper 1.0 in service of linkerd/linkerd2#8733, we add some test coverage to ensure that we preserve the behavior of this middleware. see: * linkerd/linkerd2#8733 * #3614. for historical context on this tower layer, see: * #222 * #1246 * #1282 --- * refactor(http/retry): outline `ForwardCompatibleBody<B>` in #3559 (4b53081), we introduced a backported `Frame<T>` type, and a `ForwardCompatibleBody<B>` type that allows us to interact with a `http_body::Body` circa 0.4.6 in terms of frame-based interfaces that match those of the 1.0 interface. see linkerd/linkerd2#8733 for more information on upgrading hyper. in #3559, we narrowly added this as an internal submodule of the `linkerd-http-retry` library. these facilities however, would have utility in other places such as `linkerd-app-core`. this commit pulls these compatibility shims out into a `linkerd-http-body-compat` library so that they can be imported and reused elsewhere. Signed-off-by: katelyn martin <[email protected]> * nit(http/body-compat): tidy `combinators` imports Signed-off-by: katelyn martin <[email protected]> * refactor(app/core): hoist `errors::code_header` helper Signed-off-by: katelyn martin <[email protected]> * refactor(app/core): `l5d-*` constants are headers these are header values. `http::HeaderName` has a const fn constructor, so let's use that. Signed-off-by: katelyn martin <[email protected]> * refactor(app/core): grpc constants are headers Signed-off-by: katelyn martin <[email protected]> * refactor(app/core): hoist `l5d-` and `grpc-` constants Signed-off-by: katelyn martin <[email protected]> * refactor(app/core): outline `ResponseBody` middleware we'll add a few tests for this middleware shortly. this commit moves this middleware out into its own submodule. Signed-off-by: katelyn martin <[email protected]> * refactor(app/core): encapsulate `ResponseBody` enum for other body middleware, we hide inner enum variants and their constituent members by using the "inner" pattern. this commit tweaks `ResponseBody` to follow suit, such that it now holds an `Inner`, but does not expose its passthrough and rescue variants to callers. Signed-off-by: katelyn martin <[email protected]> * docs(app/core): document `ResponseBody<R, B>` this adds a small documentation comment describing what this type does. Signed-off-by: katelyn martin <[email protected]> * refactor(app/core): a unit test suite for rescue body this commit introduces a test suite for our error recovery middleware. this body middleware provides a mechanism to "rescue" errors, gracefully mapping an error encountered when polling a gRPC body into e.g. trailers with a gRPC status code. before we upgrade this middleware in service of linkerd/linkerd2#8733, we add some test coverage to ensure that we preserve this middleware. Signed-off-by: katelyn martin <[email protected]> --------- Signed-off-by: katelyn martin <[email protected]>
Currently
app_core::errors
handles synthesis of all error responses.This means that this module needs to know about all possible error
types. To work around this, we wrap some errors with a special
HttpError
type so this module can create a synthetic response.Furthermore, this means that we can't handle errors differently in
inbound/outbound contexts.
This change modies the
error::respond
module with a new strategytrait--
HttpRescue
--that is responsible for producing aSyntheticHttpResponse
if the error can be handled and returning anerror if it should not be handled gracefully. A
DefaultHttpRescue
typeencodes default behavior and inbound- and outbound-specific types extend
this functionality with appropriate error handling.
This change enables us to perform narrower error handling in more places
in the stack--for example, so that some error responses are included in
response metrics.
This change modifies HTTP/1 error responses to include a
connection: close
header when the server-side connection is being torndown.