-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
http2: add optional connection ping/keepalive #13152
Conversation
Signed-off-by: Greg Greenway <[email protected]>
Should this be noted in the XDS docs somewhere as "we recommend setting this"? I didn't see a really good place to add it. Any idea how to make the upstream integration test work? I expected what I wrote to work, but the test waits forever without the connection getting closed. |
Signed-off-by: Greg Greenway <[email protected]>
Signed-off-by: Greg Greenway <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for working on this. Just starting with a few API comments before diving into the code.
/wait
// If configured, this value must be less than :ref:`connection_keepalive_interval | ||
// <envoy_v3_api_field_config.core.v3.Http2ProtocolOptions.connection_keepalive_interval>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why must this be less? Can't it work similar to health checking where the interval and the timeout are disjoint? It seems like that might be what we want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this for ease of implementation (so there can't be two outstanding pings at the same time). I figured it would easy to relax this at some point in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how you can guarantee that no matter what you do, due to event loop delay, etc. It seems like it would be better if it worked like health checking where there is an outstanding ping with a timeout, and then an interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree that doing it like the health checker will be better (and more consistent). I'll go that way.
// Send HTTP/2 PING frames at this period, in order to test that the connection is still alive. | ||
// If not specified, no keepalive PING frames will be sent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have any built-in jitter here, either just implicit or defined as a param?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point; we can probably borrow some of the stuff from health checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -345,6 +345,22 @@ message Http2ProtocolOptions { | |||
// <https://www.iana.org/assignments/http2-parameters/http2-parameters.xhtml#settings>`_ for | |||
// standardized identifiers. | |||
repeated SettingsParameter custom_settings_parameters = 13; | |||
|
|||
// Send HTTP/2 PING frames at this period, in order to test that the connection is still alive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on making this a dedicated message, with the sub-fields all required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't care either way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would recommend that if you don't mind either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, +1 here.
ENVOY_CONN_LOG(trace, "Sending keepalive PING {}", connection_, ms_since_epoch); | ||
|
||
// The last parameter is an opaque 8-byte buffer, so this cast is safe. | ||
int rc = nghttp2_submit_ping(session_, 0 /*flags*/, reinterpret_cast<uint8_t*>(&ms_since_epoch)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like processing of the PING ACK is missing in the legacy codec.
It would also be good to add codec_impl_test.cc tests for the PINGs. There is a bit of an overlap with the integration tests, but having these lower level tests would be good too. |
|
||
// The last parameter is an opaque 8-byte buffer, so this cast is safe. | ||
int rc = nghttp2_submit_ping(session_, 0 /*flags*/, reinterpret_cast<uint8_t*>(&ms_since_epoch)); | ||
ASSERT(rc == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How confident are you that this is always true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, this is only NOMEM. In that case, I think a RELEASE_ASSERT
is warranted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied the pattern from elsewhere in the file; it's pervasive. I personally have no confidence that it's true.
* use same timer approach as for health checks Signed-off-by: Greg Greenway <[email protected]>
@mattklein123 I added jitter, but it blew up the size of the PR, due to needing to pass |
Signed-off-by: Greg Greenway <[email protected]>
Signed-off-by: Greg Greenway <[email protected]>
Can you merge main and check CI and I can take a look? /wait |
I'm still trying to figure out a unit test (but I keep getting distracted by other things). |
@yanavlasov I tried, but I can't figure out what/how the tests work. I think I can test the jitter, and that the initial timer is set, and what happens when the timeout timer fires, but I don't have any idea how to verify that a PING was either sent, or how to send/not-send a response PING. |
Signed-off-by: Greg Greenway <[email protected]>
The codec tests use a fake codec client for the other half of the connection. It should respond to pings just like a real connection does. Thus, if you don't hold any data, it should respond, and you for example held the the response data, it should not send any ping response. Whether this is worth it or not I will defer to both of you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API LGTM modulo small comments. I will take a look at the code once you sort out the CI issues.
/wait
gt {nanos: 1000000} | ||
}]; | ||
|
||
// An optional jitter amount as a percentage of interval_ms. If specified, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/interval_ms/interval here and below
@@ -345,6 +365,9 @@ message Http2ProtocolOptions { | |||
// <https://www.iana.org/assignments/http2-parameters/http2-parameters.xhtml#settings>`_ for | |||
// standardized identifiers. | |||
repeated SettingsParameter custom_settings_parameters = 13; | |||
|
|||
// Send HTTP/2 PING frames to verify that the connection is still healthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you clarify what happens if the keep alive fails?
Got another clang_tidy error about not being able to find a header on the last run. Hoping that merging master will make it pass. No idea why it failed, since no related files changed. |
/retest |
Retrying Azure Pipelines, to retry CircleCI checks, use |
If you merge main it should fix gcc. I think your other issues are an AZP outage. /wait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm api
Signed-off-by: Greg Greenway <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with small comments.
/wait
api/xds_protocol.rst
Outdated
interval: 1s | ||
timeout: 1s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem pretty aggressive. I would go either a longer interval and longer timeout? Same elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just an example; people should come up with their own values. But if you want it changed, what values should we use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
People just copy the docs... I would probably use 30s interval with 5s timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
api/xds_protocol.rst
Outdated
# configure a TCP keep-alive to detect and reconnect to the admin | ||
# server in the event of a TCP socket disconnection | ||
tcp_keepalive: | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to also demonstrate TCP keep alive and describe both options? I think TCP keep alive would work fine in many cases and is lower overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. I'll update.
Signed-off-by: Greg Greenway <[email protected]>
Signed-off-by: Greg Greenway <[email protected]>
Signed-off-by: Greg Greenway <[email protected]>
I don't understand why, but |
Hmm interesting. cc @phlax who has been working on this. |
LGTM. Can you merge main one more time which should fix ASAN? I don't understand the example config issue but we can look at that separately. I think it must have to do with the fact that watchdog is a bootstrap field, but I don't understand why the failure would not be deterministic. /wait |
Signed-off-by: Greg Greenway [email protected]
Commit Message:
Additional Description:
Risk Level: low (disabled by default)
Testing: Added new tests
Docs Changes: documented
Release Notes: added
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]