Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

health check: gracefully handle GOAWAY in grpc health checker #11324

Merged
merged 20 commits into from
Jun 24, 2020

Conversation

jmuia
Copy link
Contributor

@jmuia jmuia commented May 26, 2020

Commit Message:
Gracefully handle GOAWAY in gRPC health checker, allowing in-progress healthchecks to complete before closing the connection.

Additional Description:
We've observed failures in our production and integration testing environments due to the gRPC health checker interpreting GOAWAYs as a health check failure. As an example, if all endpoints in an upstream cluster drain their incoming connections at the same time (eg. listener config change) a grpc healthcheck client with unhealthy_threshold=1 will see the entire cluster become unhealthy, which is undesirable.

This is my first contribution to Envoy, and would appreciate a close eye for edge cases I may have missed!

Risk Level: low/medium
Testing: added unit tests, ran (failing) grpc hc internal integration test with this patch which passed
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Joey Muia [email protected]

@murray-stripe
Copy link
Contributor

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: coverage (failed build)

🐱

Caused by: a #11324 (comment) was created by @murray-stripe.

see: more, trace.

@jmuia jmuia force-pushed the jmuia.grpc-hc-goaway branch from dae6ee5 to 518fe68 Compare May 27, 2020 19:07
@murray-stripe
Copy link
Contributor

/azp run envoy-presubmit

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 11324 in repo envoyproxy/envoy

@murray-stripe
Copy link
Contributor

murray-stripe commented May 28, 2020

@lizan These tests seem to be a flake. Or at least I can't seem to reproduce the failure locally with either

bazel test //test/common/config:subscription_factory_impl_test
# or
./ci/do_ci.sh 'bazel.release' //test/common/config:subscription_factory_impl_test

Would you mind taking a look, or kicking off the azure-pipeline tests again?

Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! A few small comments and an open question, otherwise this looks good. I would probably want a release note for this, especially if we end up making this configurable.

handleFailure(envoy::data::core::v3::NETWORK);
expect_reset_ = true;
request_encoder_->getStream().resetStream(Http::StreamResetReason::LocalReset);
received_goaway_ = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A one liner comment here about when this connection would eventually get closed would be nice. IIUC it's either on the next timeout or when the remote closes it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment. It would be closed when the active check completes (onRpcComplete) or other conditions like onResetStream and onTimeout.

@@ -757,6 +763,8 @@ void GrpcHealthCheckerImpl::GrpcActiveHealthCheckSession::onRpcComplete(
handleFailure(envoy::data::core::v3::ACTIVE);
}

const bool goaway = received_goaway_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind adding a comment here explaining why we have to read the value here? Its a bit less clear than on L705 since we're not unconditionally calling resetState

@@ -4529,21 +4529,176 @@ TEST_F(GrpcHealthCheckerImplTest, GrpcFailUnknownHealthStatus) {
cluster_->prioritySet().getMockHostSet(0)->hosts_[0]->health());
}

// Test receiving GOAWAY is interpreted as connection close event.
// Test receiving GOAWAY is handled gracefully while a check is in progress.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tells me that the code intentionally treated GOAWAY as failures, which makes me wonder if this is the desired behavior for some? Does this mean that this should be configurable?

@baranov1ch @htuch @mattklein123 WDYT

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might depend on the reason for the GOAWAY? I.e. a NO_ERROR mid check might be fine as it's telling to drain, but not if it was an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense.

It looks like the GOAWAY error code isn't provided in the callback at the moment, but I think it would be reasonable to do that?

It seems that the onGoAway() callbacks are only fired once (see here), and - with error codes - we'd want to call them for each GOAWAY frame received. I'm not sure if that would be problematic. For reference it looks the "fire once" behaviour was added in #103.

It may also be worth mentioning that the http2 connection pool implementation doesn't change behaviour based on the GOAWAY error code at the moment (see here). It transitions the active client to the draining state if it has active requests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, if the GOAWAY error code is NO_ERROR we should just treat it as "don't reuse connection", not an error. We may have to add some additional error code info to the callback. Note that @antoniovicente just opened #11420, though IIRC nghttp2 does provide the reason code inside the frame here:

if (frame->hd.type == NGHTTP2_GOAWAY && !raised_goaway_) {
, we just need to store it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IRC nghttp2 does provide the reason code inside the frame here:

Yep - it looks like it's provided in frame->goaway.error_code and the debug data in frame->goaway.opaque_data

I'm happy to update the onGoAway() interface as part of this PR if @antoniovicente doesn't already have something started.

Does something along these lines look right?

  # Maybe pass an Envoy-internal enum rather than the nghttp2 enum?
  virtual void onGoAway(nghttp2_error_code error_code, absl::Span<uint8_t> debug_data) PURE;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not leak nghttp2 specific structs to the public interface, so at least for the error code we should do a mapping. For your use case maybe just have an enum with {NO_ERROR, OTHER} and @antoniovicente can fill out more stuff later as needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enum with {NO_ERROR, OTHER} sounds like a good start

murray-stripe and others added 2 commits June 9, 2020 11:07
Signed-off-by: Joey Muia <[email protected]>
Signed-off-by: Joey Muia <[email protected]>
@jmuia jmuia requested a review from alyssawilk as a code owner June 9, 2020 19:12
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/.

🐱

Caused by: #11324 was synchronize by jmuia.

see: more, trace.

murray-stripe and others added 5 commits June 10, 2020 08:20
Revert "fix format with 'tools/code_format/check_format.py fix'"
Signed-off-by: Joey Muia <[email protected]>
Copy link
Contributor Author

@jmuia jmuia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snowp this is ready for another review.

(sorry for pinging everyone with the formatting change! 😬)

// notifications are the same as a normal GOAWAY.
if (frame->hd.type == NGHTTP2_GOAWAY && !raised_goaway_) {
// Shutdown notifications are the same as a normal GOAWAY.
if (frame->hd.type == NGHTTP2_GOAWAY) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is fine or if we want to track which goaway error codes we've already seen, and run the callbacks once per error code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected to get multiple GOAWAY frames with different error codes? If yes, it seems bad only look at the first one, if no then we can probably keep the old logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to receive GOAWAY frames with different error codes:

An endpoint MAY send multiple GOAWAY frames if circumstances change. For instance, an endpoint that sends GOAWAY with NO_ERROR during graceful shutdown could subsequently encounter a condition that requires immediate termination of the connection.

(from https://http2.github.io/http2-spec/#GOAWAY)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm in that case should the logic for failing the HC be that we see a NO_ERROR + no error GOAWAYs?

Also I'm not sure what the implications are of making onGoAway trigger multiple times per stream, maybe @mattklein123 or @alyssawilk has an idea? I'd be worried about invariants elsewhere relying on this being called at most once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm in that case should the logic for failing the HC be that we see a NO_ERROR + no error GOAWAYs?

Hm, I don't quite follow, could you rephrase?

I'd be worried about invariants elsewhere relying on this being called at most once.

Agreed. It seems it was added in #103. I'm not sure if that was an optimization back then. I'm also not sure if it has become an expected behaviour (intentionally or not) since then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  if (request_encoder_ && error_code == Http::GoAwayErrorCode::NoError) {
    received_no_error_goaway_ = true;
    return;
  }

if a GOAWAY error follows a NO_ERROR then received_no_error_goaway_ would still be true. I was thinking that the logic should reset this flag if it saw an error frame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I don't expect that to change the behaviour as it stands:

  • If there isn't an active probe, the first (NO_ERROR) GOAWAY would've caused the connection to be closed (and we only set received_no_error_goaway_ = true; when there's an active probe).
  • If there is an active probe, the stream reset will end up resetting state (resetState() will be called by onResetStream()).
  • In general, I believe a connection close will end up resetting state because the onResetStream() callbacks will run (from here).

I admit this isn't obvious / does feel a bit brittle -- happy to add a comment and/or reset the flag anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Friendly ping @mattklein123 @alyssawilk

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. My preference would be to not raise multiple goaways from the codec as I'm guessing there will be code that won't expect this and will break.

Can we just keep the previous behavior and not handle the case where multiple go away frames are received? We can potentially deal with this in a follow up if someone requests it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback - that seems reasonable to me. I've reverted that specific commit and added a TODO instead.

@@ -1456,7 +1457,7 @@ TEST_P(Http2CodecImplTest, LargeRequestHeadersExceedPerHeaderLimit) {
request_headers.addCopy("big", long_string);

EXPECT_CALL(request_decoder_, decodeHeaders_(_, _)).Times(0);
EXPECT_CALL(client_callbacks_, onGoAway());
EXPECT_CALL(client_callbacks_, onGoAway(_));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't completely follow why this test doesn't trigger onGoAway(_) to be called twice like the other one.

@@ -36,6 +36,11 @@ const char MaxResponseHeadersCountOverrideKey[] =

class Stream;

enum class ErrorCode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call this GoAwayErrorCode? to make it clearer what kind of error codes are being modeled here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 will update. Initially this enumerated all the error codes here, which are shared with RST_STREAM. But we've since undone that and there's already a separate Envoy-internal enum for stream resets.

// notifications are the same as a normal GOAWAY.
if (frame->hd.type == NGHTTP2_GOAWAY && !raised_goaway_) {
// Shutdown notifications are the same as a normal GOAWAY.
if (frame->hd.type == NGHTTP2_GOAWAY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected to get multiple GOAWAY frames with different error codes? If yes, it seems bad only look at the first one, if no then we can probably keep the old logic.

// The connection will be closed when the active check completes or another
// terminal condition occurs, such as a timeout or stream reset.
if (request_encoder_ && error_code == Http::ErrorCode::NoError) {
received_goaway_ = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename this to make it clear that we received a non-error GOAWAY?

@@ -36,6 +36,11 @@ const char MaxResponseHeadersCountOverrideKey[] =

class Stream;

enum class GoAwayErrorCode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry should have mention this but this should have a doc comment as its in include/

// notifications are the same as a normal GOAWAY.
if (frame->hd.type == NGHTTP2_GOAWAY && !raised_goaway_) {
// Shutdown notifications are the same as a normal GOAWAY.
if (frame->hd.type == NGHTTP2_GOAWAY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm in that case should the logic for failing the HC be that we see a NO_ERROR + no error GOAWAYs?

Also I'm not sure what the implications are of making onGoAway trigger multiple times per stream, maybe @mattklein123 or @alyssawilk has an idea? I'd be worried about invariants elsewhere relying on this being called at most once.

@jmuia
Copy link
Contributor Author

jmuia commented Jun 17, 2020

@snowp the CI failures look like flakes. Can you re-run them?

@mattklein123 mattklein123 self-assigned this Jun 20, 2020
@jmuia
Copy link
Contributor Author

jmuia commented Jun 24, 2020

/retest

@repokitteh-read-only
Copy link

🤷‍♀️ nothing to rebuild.

🐱

Caused by: a #11324 (comment) was created by @jmuia.

see: more, trace.

@jmuia
Copy link
Contributor Author

jmuia commented Jun 24, 2020

/azp run envoy-presubmit

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 11324 in repo envoyproxy/envoy

Signed-off-by: Joey Muia <[email protected]>
@jmuia
Copy link
Contributor Author

jmuia commented Jun 24, 2020

@snowp ready for another review. I reverted the commit that allowed multiple GOAWAYs and added a TODO instead.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks great. Will defer to @snowp for final review/merge.

Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for iterating on this!

@snowp snowp merged commit 4c4fc05 into envoyproxy:master Jun 24, 2020
songhu pushed a commit to songhu/envoy that referenced this pull request Jun 25, 2020
…roxy#11324)

Gracefully handle GOAWAY in gRPC health checker, allowing in-progress health checks to complete before closing the connection.

Signed-off-by: Joey Muia <[email protected]>
Co-authored-by: John Murray <[email protected]>
yashwant121 pushed a commit to yashwant121/envoy that referenced this pull request Jul 24, 2020
…roxy#11324)

Gracefully handle GOAWAY in gRPC health checker, allowing in-progress health checks to complete before closing the connection.

Signed-off-by: Joey Muia <[email protected]>
Co-authored-by: John Murray <[email protected]>
Signed-off-by: yashwant121 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants