-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
http: alpn upstream #13922
http: alpn upstream #13922
Conversation
Signed-off-by: Alyssa Wilk <[email protected]>
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
Unit tests are expected to fail - there's some mock clean up around timers and connection.connected() to do, but it's in pretty good shape stability-wise (beyond the new integration test, I've run the full http2_upstream_integration_test and http_integration_test with alpn forced true and they pass) I wanted to get include-API thumbs up before I spent too much time reordering mocks and writing the new nuit tests. |
Signed-off-by: Alyssa Wilk <[email protected]>
Whoops, meant to cc @mattklein123 and @snowp last week. Still got 1 of 2, not bad :-P |
Signed-off-by: Alyssa Wilk <[email protected]>
Signed-off-by: Alyssa Wilk <[email protected]>
Signed-off-by: Alyssa Wilk <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this long requested feature. I have some high level API/usability comments to get started. I have major regrets about how upstream h2 is configured and I fear this is only going to dig the whole deeper so want to discuss if there are other options we should consider.
/wait-any
// configured, Envoy will attempt to do ALPN negotiation for TLS connections, failing | ||
// over to HTTP/1.1 if ALPN negotiation fails. | ||
// If only one protocol option is present it will be used as the hard-coded | ||
// protocol. If neither is present, HTTP/1.1 will be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat orthogonal, while talking config though, I was of the mind to refuse to allow the new config unless TLS was explicitly configured - you can't do ALPN without TLS and given the ALPN pool "needs" to fail over to HTTP/1 I think it'd be easy to accidentally configure ALPN, forget TLS, and get locked into HTTP/1
We can't require TLS though, because there's other ALPN (say ALTS ALPN), which we use internally.
I was thinking we could make transport sockets register if they do ALPN, and reject config which enables H1/H2 without ALPN. That doesn't extend well to HTTP/3 which AFIK requires TLS/ALTS but doesn't actually do ALPN.
Worst case we could just comment a warning, and increment a stat of ALPN fails, but I'm wondering if you have ideas to make borked configs more obvious here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above. This is what I was going to suggest and I think we should do this.
Signed-off-by: Alyssa Wilk <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this LGTM at a high level. It's going to need a ton of fixes around docs. We need to be super clear on how to configure things for the average user, so we will need to update all the sandboxes, probably ref link from the typed_protocol_config field, etc.
But if you are up for all of this, I think this will be a massive improvement. Thank you!
/wait
api/envoy/extensions/filters/network/http_connection_manager/v3/http_connection_manager.proto
Outdated
Show resolved
Hide resolved
|
||
config.core.v3.UpstreamHttpProtocolOptions upstream_http_protocol_options = 2; | ||
|
||
oneof upstream_protocol_options { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one can have a required validation.
Ok, I'll go beat the config loading into shape, and see how much of the config the deprecated CI gets, then ping for another pass. So waiting w.r.t Matt, but I would appreciate Snow's toughts w.r.t. alpn decoration. I'm not sure of the motivation for explicitly configured transport ALPN vs cluster failover ALPN so I'm not sure how to address Matt's question there. |
Signed-off-by: Alyssa Wilk <[email protected]>
new conflicts are just two build file additions and two include additions. Will push another update but it should have no impact on tests. |
Signed-off-by: Alyssa Wilk <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool. Flushing out some comments to get started. Thank you!
/wait
// protocol is negotiated by ALPN with the upstream. | ||
// Clusters configured with *AutoHttpConfig* will use the highest available | ||
// protocol; HTTP/2 if supported, otherwise HTTP/1. | ||
// If the upstream does not support ALPN, *AutoHttpConfig* will will fail over to HTTP/1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If the upstream does not support ALPN, *AutoHttpConfig* will will fail over to HTTP/1. | |
// If the upstream does not support ALPN, *AutoHttpConfig* will fail over to HTTP/1. |
@@ -64,6 +64,7 @@ New Features | |||
* health_check: added option to use :ref:`no_traffic_healthy_interval <envoy_v3_api_field_config.core.v3.HealthCheck.no_traffic_healthy_interval>` which allows a different no traffic interval when the host is healthy. | |||
* http: added HCM :ref:`timeout config field <envoy_v3_api_field_extensions.filters.network.http_connection_manager.v3.HttpConnectionManager.request_headers_timeout>` to control how long a downstream has to finish sending headers before the stream is cancelled. | |||
* http: added frame flood and abuse checks to the upstream HTTP/2 codec. This check is off by default and can be enabled by setting the `envoy.reloadable_features.upstream_http2_flood_checks` runtime key to true. | |||
* http: alpn is now supported upstream, configurable via `alpn_config <envoy_v3_api_field_extensions.upstreams.http.v3.HttpProtocolOptions.alpn_config>` in the :ref:`http_protocol_options <envoy_v3_api_msg_extensions.upstreams.http.v3.HttpProtocolOptions>` message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: alpn was already supported upstream, just manually configured. Rephrase slightly? Also, I think the field names here are out of date so I think you are also missing a :ref
.
include/envoy/upstream/upstream.h
Outdated
@@ -711,6 +711,10 @@ class ClusterInfo { | |||
static const uint64_t USE_DOWNSTREAM_PROTOCOL = 0x2; | |||
// Whether connections should be immediately closed upon health failure. | |||
static const uint64_t CLOSE_CONNECTIONS_ON_HOST_HEALTH_FAILURE = 0x4; | |||
// If HTTP2 is true, the upstream protocol will be negotiated using ALPN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this true? Don't we still support H2 with prior knowledge explicitly configured?
// When we upgrade from a TCP client to non-TCP we get a spurious onConnected | ||
// from the new client. Ignore that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems broken. TODO somewhere to not raise onConnected() multiple times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens here is that the client is subscribed to network callbacks. When under the call stack of the network raising the connection event, we detach the TCP client from the network callbacks, and attach the HTTP client.
I don't think it's safe to delay subscribing the HTTP client to the network callbacks until not under the stack of raising the connected event. We could avoid the detach and reattach by having a shim event handler in the connection class which originally pass onEvent to the TCP client, and is std::moved over to pass onEvent to the HTTP client. But the problem with that is that for TCP, the connection pool client subscribes directly to the network's callbacks. For HTTP, the connection pool client asks the codec to add it to network callbacks, and doesn't do so to the connection directly. Today that's the same code, but if we decide to do custom event work in the codec client, we could end up introducing some really weird bugs.
Long story short I actually think this is cleaner than any other option. If I can sell you on that I'd be happy to add inline comments on why it's safer than other options. Otherwise I'd prefer nailing down something which isn't uglier before I add a TODO :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry are you saying that because we are under the same call stack the event gets added and then it gets iterated to and called? I'm surprised that actually works but I forget what data structure is used. I believe you that this is the cleanest way so yeah more comments would be good.
// If an old TLS stack does not negotiate alpn, it likely does not support | ||
// HTTP/2. Fail over to HTTP/1. | ||
protocol_ = Protocol::Http11; | ||
auto tcp_client = static_cast<Tcp::ActiveTcpClient*>(&client); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there by a dynamic_cast ASSERT here? Is there any way to avoid the effective dynamic_cast? Seems like this could be an interface function of some type?
!raw_factory_pointer->supportsAlpn()) { | ||
throw EnvoyException( | ||
fmt::format("ALPN configured for a cluster which has a non-ALPN transport socket: {}", | ||
cluster.DebugString())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would probably print the cluster name here vs. debug string but up to you.
Signed-off-by: Alyssa Wilk <[email protected]>
Signed-off-by: Alyssa Wilk <[email protected]>
Signed-off-by: Alyssa Wilk <[email protected]>
PTAL? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very cool. Some random comments but generally LGTM.
/wait
api/envoy/extensions/upstreams/http/v3/http_protocol_options.proto
Outdated
Show resolved
Hide resolved
api/envoy/extensions/upstreams/http/v3/http_protocol_options.proto
Outdated
Show resolved
Hide resolved
source/common/http/mixed_conn_pool.h
Outdated
Http::Protocol protocol() { return protocol_; } | ||
|
||
private: | ||
bool connected_{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used? Do you maybe mean to guard re-checking protocol if we previously had a connection? If so should this be saw_first_connection_
or something like that?
// If an old TLS stack does not negotiate alpn, it likely does not support | ||
// HTTP/2. Fail over to HTTP/1. | ||
protocol_ = Protocol::Http11; | ||
auto tcp_client = dynamic_cast<Tcp::ActiveTcpClient*>(&client); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For perf do you want to static cast this and then assert the dynamic cast? Also, I think I asked this before, but is it possible to not have casts here by just having an interface function that returns nextProtocol()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so - we still need to move the connection to the new class. We could have an interface function for "detach and return the network::connection" if you prefer that to the cast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have an interface function for "detach and return the network::connection" if you prefer that to the cast?
Personally yes but up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bah, checked this branch back out to start on that, but of course the HTTP and HTTP2 clients can't have a function that rips off the network::connection because it's owned by the codec. given it was optional I'm going to leave as is since I think the cast is less ugly than a "hand off if possible but mostly it isn't"
if (protocols.size() == 2 && | ||
((protocols[0] == Http::Protocol::Http2 && protocols[1] == Http::Protocol::Http11) || | ||
(protocols[1] == Http::Protocol::Http2 && protocols[0] == Http::Protocol::Http11))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems kind of fragile and also won't obviously support H3. Can we at least have a TODO about H3 here. I wonder also is there some way of pre-selecting the conn pool somewhere else and then passing an enum in here? I'm not sure what is possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is limited by design. I don't think we know if when we add HTTP/3 it'll be into the mixed pool - I suspect we'd have a wrapper pool around a mixed pool and a probing-h3 pool but not sure yet. Whoever adds that (which may yet be me) will surely be adding code and tests, so can add the relevant extra code then, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK that's fine maybe add a TODO but up to you.
// onConnected is called under the stack of the Network::Connection raising | ||
// the Connected event. The first time it is called, it's called for a TCP | ||
// client, the TCP client is detached from the connection and discarded, and an | ||
// HTTP client is associated with that connection. When the first call returns, the | ||
// Network::Connection will inform the new callback (the HTTP client) that it | ||
// is connected. The early return is to ignore that second call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK this makes sense to me now. My only concern is this assumes that adding/removing callbacks is safe in the context of callbacks being called. This is true with the current implementation inside ConnectionImpl, but do you know if we have explicit unit tests there about that? I'm mainly concerned about this being somewhat fragile and we might break it in a non-obvious way. If you think we have good coverage that's fine, but maybe update this comment to mention the implementation assumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's fragile given it's pretty deterministic and regression tested. The only bit that could change is if someone fixed the extra callback under the hood and that'd be a no-op (just make the comment obsolete). Added some unit tests of unregister/reregister mid-callback as that's totally a good thing to test.
Signed-off-by: Alyssa Wilk <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo remaining optional comments.
Signed-off-by: Alyssa Wilk <[email protected]>
* master: (49 commits) sds: allow multiple init managers share sds target (envoyproxy#14357) [http] Remove legacy codecs (envoyproxy#14381) http2: Add integration tests for METADATA and RST_STREAM frame flood mitigation for upstream servers (envoyproxy#14365) test: start dissolving :printers_include rule. (envoyproxy#14429) integration tests: re-enable set_node_on_first_message_only (envoyproxy#14270) formatter: add a formatter that returns a google::protobuf::Struct rather than a string (envoyproxy#14258) ratelimit: support returning custom response bodies for non-OK responses from the external ratelimit service (envoyproxy#14189) deps: update protobuf to 3.14 (envoyproxy#14253) stream_info: add setResponseCode and update local_reply to take a normal StreamInfo (envoyproxy#14402) http: alpn upstream (envoyproxy#13922) Moved starttls integration test to test/extensions/transport_sockets/starttls. (envoyproxy#14425) generic conn pool: directly use thread local cluster (envoyproxy#14423) wasm: add mathetake to CODEOWNERS (envoyproxy#14427) wasm: clear route cache when modifying HTTP request headers. (envoyproxy#14318) tls: disable TLS inspector injection (envoyproxy#14404) aggregate cluster: cleanups (envoyproxy#14411) Mark starttls_integration_test flaky on Windows (envoyproxy#14419) tcp: improved unit testing (envoyproxy#14415) config: making protocol config explicit (envoyproxy#14362) wasm: dead code (envoyproxy#14407) ... Signed-off-by: Michael Puncel <[email protected]>
Allows configuring both HTTP/1 and HTTP/2 options for a cluster, which triggers TLS alpn negotiation.
Risk Level: Medium (a bunch of little refactors)
Testing: integration tests. NEEDS UNIT TESTS
Docs Changes: n.a
Release Notes: inline
Fixes #3431