Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

errors: Support contextual error handling strategies #1246

Merged
merged 28 commits into from
Sep 9, 2021
Merged

Conversation

olix0r
Copy link
Member

@olix0r olix0r commented Sep 7, 2021

Currently app_core::errors handles synthesis of all error responses.
This means that this module needs to know about all possible error
types. To work around this, we wrap some errors with a special
HttpError type so this module can create a synthetic response.
Furthermore, this means that we can't handle errors differently in
inbound/outbound contexts.

This change modies the error::respond module with a new strategy
trait--HttpRescue--that is responsible for producing a
SyntheticHttpResponse if the error can be handled and returning an
error if it should not be handled gracefully. A DefaultHttpRescue type
encodes default behavior and inbound- and outbound-specific types extend
this functionality with appropriate error handling.

This change enables us to perform narrower error handling in more places
in the stack--for example, so that some error responses are included in
response metrics.

This change modifies HTTP/1 error responses to include a
connection: close header when the server-side connection is being torn
down.

Our error handlers need to account for the fact that HTTP/1 requests may
fail with HTTP/2 errors due to orig-proto upgrading.

To simplify error handling (in an upcoming change), the
`orig_proto::Upgrade` service introduces a new error type,
`orig_proto::DowngradedH2Error` that hides the original error source so
that these errors are not detected as being an `h2::Error`.
Currently `app_core::errors` handles synthesis of all error responses.
This means that this module needs to know about all possible error
types. To work around this, we wrap some errors with a special
`HttpError` type so this module can create a synthetic response.
Furthermore, this means that we can't handle errors differently in
inbound/outbound contexts.

This change modies the `error::respond` module with a new strategy
trait--`HttpRescue`--that is responsible for producing a
`SyntheticHttpResponse` if the error can be handled and returning an
error if it should not be handled gracefully. A `DefaultHttpRescue` type
encodes default behavior and inbound- and outbound-specific types extend
this functionality with appropriate error handling.

This change will enable us to perform error handling in more places in
the stack so that we can be more tactical about which errors are handled
in various parts of the stack--for example, so that error responses are
included in respose metrics.

This change modifies HTTP/1 error responses to include a `connection:
close` header when the server-side connection is being torn down.
@olix0r olix0r requested a review from a team September 7, 2021 20:42
When an HTTP client throws an error--usually and I/O or connection
timeout error, these metrics are not recorded with the appropriate 5XX
status or reported to Tap. This change synthesizes responses for these
errors before incrementing HTTP response counters.
Copy link
Contributor

@kleimkuhler kleimkuhler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Have two smaller questions before approving.

linkerd/app/core/src/errors/respond.rs Outdated Show resolved Hide resolved
return Ok(errors::SyntheticHttpResponse {
http_status: http::StatusCode::GATEWAY_TIMEOUT,
grpc_status: errors::Grpc::DeadlineExceeded,
close_connection: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the only rescue that doesn't close the connection. Is this because we expect the proxy to retry in this case and possibly establish a connection?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other errors can be indicative of the target going away. with response timeouts, we're dealing with some user-configured timeout. there doesn't seem to be any reason to tear down the connection in this case--the request has probably reached the application, it's just slow.

Copy link
Contributor

@kleimkuhler kleimkuhler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

Copy link
Contributor

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall, this looks good to me! i picked a bunch of obnoxious minor nits, but you can feel free to ignore them!

linkerd/app/core/src/errors/mod.rs Outdated Show resolved Hide resolved
linkerd/app/core/src/errors/respond.rs Outdated Show resolved Hide resolved
pub grpc_status: tonic::Code,
pub http_status: http::StatusCode,
pub close_connection: bool,
pub message: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (possibly a premature optimization): I notice sometimes these strings are formatted dynamically, but sometimes they are fixed static strings that are converted to Strings. I wonder if we want to use Cow<'static, str> here so that we don't have to allocate a bunch of copies of the same string when it isn't dynamic?

we would then match on the Cow and call HeaderValue::from_static when the Cow has a static string in it (as a side note, i floated the idea to Sean of adding a TryFrom<Cow<'static, str>> to http::header::HeaderValue that would do that for us).

linkerd/app/core/src/errors/respond.rs Outdated Show resolved Hide resolved
linkerd/app/core/src/errors/respond.rs Outdated Show resolved Hide resolved
fn grpc_response<B: Default>(code: tonic::Code, message: &str) -> http::Response<B> {
info!(%code, %message, "Handling error on gRPC connection");

http::Response::builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit/TIOLI (possibly a premature optimization): we always add exactly 5 headers to the headermap here, so I wonder if we want to consider pre-allocating enough capacity for them? that way, we know it won't allocate multiple times...

Suggested change
http::Response::builder()
let mut builder = http::Response::builder();
// We always add 5 headers, so preallocate enough capacity for them.
builder
.headers_mut()
.expect("builder will not have an error")
.reserve(5);
builder

linkerd/app/core/src/errors/respond.rs Show resolved Hide resolved
linkerd/app/inbound/src/http/router.rs Outdated Show resolved Hide resolved
linkerd/app/inbound/src/http/server.rs Outdated Show resolved Hide resolved
@olix0r olix0r merged commit 556cc61 into main Sep 9, 2021
@olix0r olix0r deleted the ver/error-mas branch September 9, 2021 02:39
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Sep 9, 2021
This release improves error handling so that HTTP metrics include 5XX
responses for common errors.

Logging has also been improved to ensure `inbound` and `outbound` spans
are always present in log messages.

Outbound tap has been fixed to include route labels when service
profiles are configured.

---

* Set tracing spans on policy client (linkerd/linkerd2-proxy#1241)
* build(deps): bump tokio-util from 0.6.7 to 0.6.8 (linkerd/linkerd2-proxy#1240)
* core: fix missing spans in `serve` tasks (linkerd/linkerd2-proxy#1243)
* build(deps): bump thiserror from 1.0.28 to 1.0.29 (linkerd/linkerd2-proxy#1244)
* orig-proto: Avoid emiting HTTP/2 errors for upgraded requests (linkerd/linkerd2-proxy#1245)
* Fix route labels on outbound tap metadata (linkerd/linkerd2-proxy#1247)
* errors: Support contextual error handling strategies (linkerd/linkerd2-proxy#1246)
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Sep 9, 2021
This release improves error handling so that HTTP metrics include 5XX
responses for common errors.

Logging has also been improved to ensure `inbound` and `outbound` spans
are always present in log messages.

Outbound tap has been fixed to include route labels when service
profiles are configured.

---

* Set tracing spans on policy client (linkerd/linkerd2-proxy#1241)
* build(deps): bump tokio-util from 0.6.7 to 0.6.8 (linkerd/linkerd2-proxy#1240)
* core: fix missing spans in `serve` tasks (linkerd/linkerd2-proxy#1243)
* build(deps): bump thiserror from 1.0.28 to 1.0.29 (linkerd/linkerd2-proxy#1244)
* orig-proto: Avoid emiting HTTP/2 errors for upgraded requests (linkerd/linkerd2-proxy#1245)
* Fix route labels on outbound tap metadata (linkerd/linkerd2-proxy#1247)
* errors: Support contextual error handling strategies (linkerd/linkerd2-proxy#1246)
cratelyn added a commit that referenced this pull request Feb 13, 2025
`linkerd-app-core` includes an error recovery body middleware. this middleware will gracefully catch and report errors encountered when polling an inner body, and via an `R`-typed recovery strategy provided by the caller, will attempt to map the error to a gRPC status code denoting an error.

before we upgrade to hyper 1.0 in service of linkerd/linkerd2#8733, we add some test coverage to ensure that we preserve the behavior of this middleware.

see:
* linkerd/linkerd2#8733
* #3614.

for historical context on this tower layer, see:
* #222
* #1246
* #1282

---

* refactor(http/retry): outline `ForwardCompatibleBody<B>`

in #3559 (4b53081), we introduced a backported `Frame<T>` type, and a
`ForwardCompatibleBody<B>` type that allows us to interact with a
`http_body::Body` circa 0.4.6 in terms of frame-based interfaces that
match those of the 1.0 interface.

see linkerd/linkerd2#8733 for more information on upgrading hyper.

in #3559, we narrowly added this as an internal submodule of the
`linkerd-http-retry` library. these facilities however, would have
utility in other places such as `linkerd-app-core`.

this commit pulls these compatibility shims out into a
`linkerd-http-body-compat` library so that they can be imported and
reused elsewhere.

Signed-off-by: katelyn martin <[email protected]>

* nit(http/body-compat): tidy `combinators` imports

Signed-off-by: katelyn martin <[email protected]>

* refactor(app/core): hoist `errors::code_header` helper

Signed-off-by: katelyn martin <[email protected]>

* refactor(app/core): `l5d-*` constants are headers

these are header values. `http::HeaderName` has a const fn constructor,
so let's use that.

Signed-off-by: katelyn martin <[email protected]>

* refactor(app/core): grpc constants are headers

Signed-off-by: katelyn martin <[email protected]>

* refactor(app/core): hoist `l5d-` and `grpc-` constants

Signed-off-by: katelyn martin <[email protected]>

* refactor(app/core): outline `ResponseBody` middleware

we'll add a few tests for this middleware shortly.

this commit moves this middleware out into its own submodule.

Signed-off-by: katelyn martin <[email protected]>

* refactor(app/core): encapsulate `ResponseBody` enum

for other body middleware, we hide inner enum variants and their
constituent members by using the "inner" pattern.

this commit tweaks `ResponseBody` to follow suit, such that it now holds
an `Inner`, but does not expose its passthrough and rescue variants to
callers.

Signed-off-by: katelyn martin <[email protected]>

* docs(app/core): document `ResponseBody<R, B>`

this adds a small documentation comment describing what this type does.

Signed-off-by: katelyn martin <[email protected]>

* refactor(app/core): a unit test suite for rescue body

this commit introduces a test suite for our error recovery middleware.

this body middleware provides a mechanism to "rescue" errors, gracefully
mapping an error encountered when polling a gRPC body into e.g. trailers
with a gRPC status code.

before we upgrade this middleware in service of linkerd/linkerd2#8733,
we add some test coverage to ensure that we preserve this middleware.

Signed-off-by: katelyn martin <[email protected]>

---------

Signed-off-by: katelyn martin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants