Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http2: add optional connection ping/keepalive #13152

Merged
merged 29 commits into from
Sep 28, 2020

Conversation

ggreenway
Copy link
Contributor

Signed-off-by: Greg Greenway [email protected]

Commit Message:
Additional Description:
Risk Level: low (disabled by default)
Testing: Added new tests
Docs Changes: documented
Release Notes: added
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]

@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #13152 was opened by ggreenway.

see: more, trace.

@ggreenway
Copy link
Contributor Author

Should this be noted in the XDS docs somewhere as "we recommend setting this"? I didn't see a really good place to add it.

Any idea how to make the upstream integration test work? I expected what I wrote to work, but the test waits forever without the connection getting closed.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for working on this. Just starting with a few API comments before diving into the code.

/wait

Comment on lines 357 to 358
// If configured, this value must be less than :ref:`connection_keepalive_interval
// <envoy_v3_api_field_config.core.v3.Http2ProtocolOptions.connection_keepalive_interval>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why must this be less? Can't it work similar to health checking where the interval and the timeout are disjoint? It seems like that might be what we want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this for ease of implementation (so there can't be two outstanding pings at the same time). I figured it would easy to relax this at some point in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how you can guarantee that no matter what you do, due to event loop delay, etc. It seems like it would be better if it worked like health checking where there is an outstanding ping with a timeout, and then an interval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree that doing it like the health checker will be better (and more consistent). I'll go that way.

Comment on lines 349 to 350
// Send HTTP/2 PING frames at this period, in order to test that the connection is still alive.
// If not specified, no keepalive PING frames will be sent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have any built-in jitter here, either just implicit or defined as a param?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point; we can probably borrow some of the stuff from health checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -345,6 +345,22 @@ message Http2ProtocolOptions {
// <https://www.iana.org/assignments/http2-parameters/http2-parameters.xhtml#settings>`_ for
// standardized identifiers.
repeated SettingsParameter custom_settings_parameters = 13;

// Send HTTP/2 PING frames at this period, in order to test that the connection is still alive.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on making this a dedicated message, with the sub-fields all required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't care either way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would recommend that if you don't mind either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, +1 here.

ENVOY_CONN_LOG(trace, "Sending keepalive PING {}", connection_, ms_since_epoch);

// The last parameter is an opaque 8-byte buffer, so this cast is safe.
int rc = nghttp2_submit_ping(session_, 0 /*flags*/, reinterpret_cast<uint8_t*>(&ms_since_epoch));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like processing of the PING ACK is missing in the legacy codec.

@yanavlasov
Copy link
Contributor

It would also be good to add codec_impl_test.cc tests for the PINGs. There is a bit of an overlap with the integration tests, but having these lower level tests would be good too.

@yanavlasov yanavlasov self-assigned this Sep 19, 2020

// The last parameter is an opaque 8-byte buffer, so this cast is safe.
int rc = nghttp2_submit_ping(session_, 0 /*flags*/, reinterpret_cast<uint8_t*>(&ms_since_epoch));
ASSERT(rc == 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How confident are you that this is always true?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this is only NOMEM. In that case, I think a RELEASE_ASSERT is warranted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied the pattern from elsewhere in the file; it's pervasive. I personally have no confidence that it's true.

* use same timer approach as for health checks

Signed-off-by: Greg Greenway <[email protected]>
@ggreenway
Copy link
Contributor Author

@mattklein123 I added jitter, but it blew up the size of the PR, due to needing to pass RandomGenerator to the codec now. But I think it's worth it; jitter is rather important.

@mattklein123
Copy link
Member

Can you merge main and check CI and I can take a look?

/wait

@ggreenway
Copy link
Contributor Author

Can you merge main and check CI and I can take a look?

/wait

I'm still trying to figure out a unit test (but I keep getting distracted by other things).

@ggreenway
Copy link
Contributor Author

It would also be good to add codec_impl_test.cc tests for the PINGs. There is a bit of an overlap with the integration tests, but having these lower level tests would be good too.

@yanavlasov I tried, but I can't figure out what/how the tests work. I think I can test the jitter, and that the initial timer is set, and what happens when the timeout timer fires, but I don't have any idea how to verify that a PING was either sent, or how to send/not-send a response PING.

Signed-off-by: Greg Greenway <[email protected]>
@mattklein123
Copy link
Member

@yanavlasov I tried, but I can't figure out what/how the tests work. I think I can test the jitter, and that the initial timer is set, and what happens when the timeout timer fires, but I don't have any idea how to verify that a PING was either sent, or how to send/not-send a response PING.

The codec tests use a fake codec client for the other half of the connection. It should respond to pings just like a real connection does. Thus, if you don't hold any data, it should respond, and you for example held the the response data, it should not send any ping response. Whether this is worth it or not I will defer to both of you.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API LGTM modulo small comments. I will take a look at the code once you sort out the CI issues.

/wait

gt {nanos: 1000000}
}];

// An optional jitter amount as a percentage of interval_ms. If specified,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/interval_ms/interval here and below

api/envoy/config/core/v3/protocol.proto Outdated Show resolved Hide resolved
@@ -345,6 +365,9 @@ message Http2ProtocolOptions {
// <https://www.iana.org/assignments/http2-parameters/http2-parameters.xhtml#settings>`_ for
// standardized identifiers.
repeated SettingsParameter custom_settings_parameters = 13;

// Send HTTP/2 PING frames to verify that the connection is still healthy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you clarify what happens if the keep alive fails?

@ggreenway
Copy link
Contributor Author

Got another clang_tidy error about not being able to find a header on the last run. Hoping that merging master will make it pass. No idea why it failed, since no related files changed.

@ggreenway
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines, to retry CircleCI checks, use /retest-circle.
Cannot retry non-completed check: envoy-presubmit (Linux-x64 compile_time_options), please wait.

🐱

Caused by: a #13152 (comment) was created by @ggreenway.

see: more, trace.

@mattklein123
Copy link
Member

If you merge main it should fix gcc. I think your other issues are an AZP outage.

/wait

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm api

docs/root/version_history/current.rst Show resolved Hide resolved
Signed-off-by: Greg Greenway <[email protected]>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with small comments.

/wait

Comment on lines 702 to 703
interval: 1s
timeout: 1s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem pretty aggressive. I would go either a longer interval and longer timeout? Same elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just an example; people should come up with their own values. But if you want it changed, what values should we use?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People just copy the docs... I would probably use 30s interval with 5s timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines 700 to 703
# configure a TCP keep-alive to detect and reconnect to the admin
# server in the event of a TCP socket disconnection
tcp_keepalive:
...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to also demonstrate TCP keep alive and describe both options? I think TCP keep alive would work fine in many cases and is lower overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. I'll update.

@ggreenway
Copy link
Contributor Author

I don't understand why, but //test/config_test:example_configs_test starting failing after I touched the examples, because they use deprecated field watchdog. That field was deprecated a week ago; why did it start failing now? Was bazel caching the test result based only on the content of the example .yaml, and not re-running when Envoy changed? @htuch

@mattklein123
Copy link
Member

I don't understand why, but //test/config_test:example_configs_test starting failing after I touched the examples, because they use deprecated field watchdog. That field was deprecated a week ago; why did it start failing now? Was bazel caching the test result based only on the content of the example .yaml, and not re-running when Envoy changed? @htuch

Hmm interesting. cc @phlax who has been working on this.

@mattklein123
Copy link
Member

LGTM. Can you merge main one more time which should fix ASAN? I don't understand the example config issue but we can look at that separately. I think it must have to do with the fact that watchdog is a bootstrap field, but I don't understand why the failure would not be deterministic.

/wait

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants