-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/net/http2: regression in Transport's reuse of connections #60818
Comments
See golang/go#60818 Updates tailscale/corp#12296 Signed-off-by: Brad Fitzpatrick <[email protected]>
An option like this sounds great. What's not to love? One of the great strengths of the HTTP/2 protocol is that implementations can cancel a request without throwing away a TCP+TLS connection. But one of the great strengths of HTTP/1.x is that it minimizes the blast radius of a single bad connection, allowing it to harm only a single request. Reusing connections is important for application stability—otherwise when CPU starvation in a client causes it to perceive its outbound calls as timing out, an application that reacts by closing its TCP+TLS connections and then dialing new ones will make its problem worse by spending any remaining CPU time on new TLS handshakes. I think Go's implementation can give the best of both, and through that reduce the operational risks associated with moving from HTTP/1 to HTTP/2. Here's how I'd like to deploy HTTP/2:
And:
Maybe those settings aren't the right choice for everyone, but today it's very hard to configure that behavior. |
That sounds like a reasonable fix to me too. The only thing I would note - I don't know if it's changed - but as recently as 2021, AWS Application Load Balancers lacked support for HTTP2 Ping. So a ping based mitigation won't work if writing into something fronted by an ALB. My original PR had checked whether the failed stream was the only one for the connection - that (I think) would also have addressed this issue, but adding pings should be much more robust if there's upstream support. |
Just checking, have you also considered the option of rolling back CL 486156 here in x/net, reopening #59690, and trying to fix again? Or is this issue comparatively smaller than #59690 and better to fix forward? @neild There are currently open backport requests of that issue to Go 1.20 and 1.19 (#60301 and #60662). Should the decision to backport be revisited given the new information in this report, or are they okay to proceed as is? Thanks. |
I kicked together a quick Python script to test import socket
import ssl
import h2.connection
import h2.events
import certifi
import sys
SERVER_NAME = sys.argv[1]
SERVER_PORT = 443
socket.setdefaulttimeout(15)
ctx = ssl.create_default_context(cafile=certifi.where())
ctx.set_alpn_protocols(['h2'])
s = socket.create_connection((SERVER_NAME, SERVER_PORT))
s = ctx.wrap_socket(s, server_hostname=SERVER_NAME)
c = h2.connection.H2Connection()
c.initiate_connection()
c.ping(b'ffffffff')
s.sendall(c.data_to_send())
response_stream_ended = False
while not response_stream_ended:
data = s.recv(65536 * 1024)
if not data:
break
events = c.receive_data(data)
for event in events:
if isinstance(event, h2.events.PingAckReceived):
print(event)
response_stream_ended = True
break I'm getting ACK's back from AWS ALB's, so it looks like they've added support. A lack of upstream support would still be a potential issue, but it at least wouldn't be quite so widespread as if ALB's didn't support it. |
Even the part we previously kept was bad. Revert all of 82780d6 but keep 6826f5a (which depended on bits of 82780d6). So this is a mix. TestTransportRetryAfterRefusedStream fails with this change, but only because it was adjusted in 82780d6 to pass with 82780d6, and this test doesn't revert all the test changes. I just skip that test instead, because it doesn't really affect us. Updates tailscale/corp#12296 Updates golang/go#60818 Signed-off-by: Brad Fitzpatrick <[email protected]>
Even the part we previously kept was bad. Revert all of 82780d6 but keep 6826f5a (which depended on bits of 82780d6). So this is a mix. TestTransportRetryAfterRefusedStream fails with this change, but only because it was adjusted in 82780d6 to pass with 82780d6, and this test doesn't revert all the test changes. I just skip that test instead, because it doesn't really affect us. Updates tailscale/corp#12296 Updates golang/go#60818 Signed-off-by: Brad Fitzpatrick <[email protected]>
Even the part we previously kept was bad. Revert all of 82780d6 but keep 6826f5a (which depended on bits of 82780d6). So this is a mix. TestTransportRetryAfterRefusedStream fails with this change, but only because it was adjusted in 82780d6 to pass with 82780d6, and this test doesn't revert all the test changes. I just skip that test instead, because it doesn't really affect us. Updates tailscale/corp#12296 Updates golang/go#60818 Signed-off-by: Brad Fitzpatrick <[email protected]>
Even the part we previously kept was bad. Revert all of 82780d6 but keep 6826f5a (which depended on bits of 82780d6). So this is a mix. TestTransportRetryAfterRefusedStream fails with this change, but only because it was adjusted in 82780d6 to pass with 82780d6, and this test doesn't revert all the test changes. I just skip that test instead, because it doesn't really affect us. Updates tailscale/corp#12296 Updates golang/go#60818 Signed-off-by: Brad Fitzpatrick <[email protected]>
Change https://go.dev/cl/507395 mentions this issue: |
Theory is that our long lived http2 connection to control would get tainted by _something_ (unclear what) and would get closed. This picks up the fix for golang/go#60818. Updates tailscale/corp#5761 Signed-off-by: Maisem Ali <[email protected]>
Theory is that our long lived http2 connection to control would get tainted by _something_ (unclear what) and would get closed. This picks up the fix for golang/go#60818. Updates tailscale/corp#5761 Signed-off-by: Maisem Ali <[email protected]>
**Description:** This PR enables the HTTP2 health check to workaround the issue described here open-telemetry/opentelemetry-collector#9022 As to why I chose 10 seconds for `HTTP2ReadIdleTimeout` and ~~5 seconds~~ 10 seconds (see review comment) for `HTTP2PingTimeout` Those values have been tested in production and they will result, in an active env (with default http timeout of 10 seconds and default retry settings), of a single export failure at max before the health check detects the corrupted tcp connection and closes it. The only drawback is if the connection was not used for over 10 seconds, we might end up sending unnecessary ping frames, which should not be an issue and if it became an issue, then we can tune those settings. The SFX exporter has multiples http clients: - Metric client, Trace client and Event client . Those client will have the http2 health check enabled by default as they share the same default config - Correlation client and Dimension client will NOT have the http2 health check enabled. We can revisit this if needed. **Testing:** - Run OTEL with one of the exporters that uses HTTP/2 client, example `signalfx` exporter - For simplicity use a single pipeline/exporter - In a different shell, run this to watch the tcp state of the established connection ``` while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done ``` - From the netstat, take a note of the source port and the source IP address - replace <> from previous step `sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP` - Note how the OTEL exporter export starts timing out Expected Result: - A new connection should be established, similarly to http/1 and exports should succeed Actual Result: - The exports keep failing for ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to - After 15 minutes, a new tcp connection is created and exports start working **Documentation:** <Describe the documentation added.> Readme is updated **Disclaimer:** Not all HTTP/2 servers support H2 Ping, however, this should not be a concern as our ingest servers do support H2 ping. But if you are routing you can check if H2 ping is supported using this script golang/go#60818 (comment) Signed-off-by: Dani Louca <[email protected]>
Theory is that our long lived http2 connection to control would get tainted by _something_ (unclear what) and would get closed. This picks up the fix for golang/go#60818. Updates tailscale/corp#5761 Signed-off-by: Maisem Ali <[email protected]> Signed-off-by: Alex Paguis <[email protected]>
The fix to #59690 (https://go-review.googlesource.com/c/net/+/486156) broke our load balancer.
That change made any RoundTripper caller's cancelation ends up tainting the underlying HTTP/2 connection (setting its do-not-reuse bit), causing our load balancer to create new conns forever, rather than reuse existing totally fine ones.
Of note: our application only uses ClientConn.RoundTrip (not Transport.RoundTrip), so moving up the connection tainting up a level (to where it arguably belongs in Transport & its conn pool) would fix us, but still isn't the proper fix for others.
The original issue was about dead backend connections and Transport reusing them forever. It fixed it, but a bit too aggressively. We shouldn't mark good connections (in-use, recent traffic, able to reply to pings, etc) as do-not-reuse. This is already addressed if you enable pings on your backend connections, but that's not the default, probably because it's not network chatty to do by default for idle connections.
@neild and I discussed a handful of options but I'm not we've decided on any particular path(s). Some:
The text was updated successfully, but these errors were encountered: