-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HttpStress: connection failures with HTTP 1.1 caused by Windows Firewall #50854
Comments
Tagging subscribers to this area: @dotnet/ncl Issue DetailsCurrently, the the vast majority of our HttpStress failures are Socket connection failures on HTTP 1.1. (Socket error 10060 running inside, 10061 running outside of Docker): While running these tests, there are thousands of Kestrel sockets in lingering in It doesn't look like the stress client's Things I tried:
I'm not familiar enough with SocketsHttpHandler codebase and Kestrel (or webservers in general), so I can't assess if what I'm seeing is extreme or not in a stress scenario. @geoffkizer @scalablecory @halter73 @stephentoub thoughts?
|
Tagging subscribers to this area: @dotnet/ncl Issue DetailsCurrently, the the vast majority of our HttpStress failures (#42211) are HTTP 1.1 Socket connection failures. (Socket error 10060 running inside, 10061 running outside of Docker): While running these tests, there are thousands of Kestrel sockets in lingering in It doesn't look like the stress client's Things I tried:
I'm not familiar enough with SocketsHttpHandler codebase and Kestrel (or webservers in general), so I can't assess if what I'm seeing is extreme or not in a stress scenario. @wfurt @geoffkizer @scalablecory @halter73 @stephentoub thoughts?
|
Non-clean shutdown of the sockets from the client? Does the |
All the code lives in https://github.com/dotnet/runtime/tree/main/src/libraries/System.Net.Http/tests/StressTests/HttpStress @benaadams . I assume the |
As I understand the closer of the socket moves to |
Hmmm, could the partial reads on the client side be a culprit here? runtime/src/libraries/System.Net.Http/tests/StressTests/HttpStress/ClientOperations.cs Lines 197 to 209 in 740121d
I'd maybe try the test without these "ugly" operations and see if it reproduces. edit: Not saying the code is wrong per se, just that there might be error in client/server that manifests under these conditions. |
Yeah, exactly. We need to understand why Kestrel is closing the connections on their end. |
Ok, seeing the server sockets at the first part of at my
It seems like there are always around 16k sockets lingering in total, which is the size of the ephemeral port range on Windows. Seeing both server and client sockets lingering for a while after the connections have been terminated is a normal thing AFAIK, even if the sockets are properly closed. I can't decide if what we are seeing is normal, and it's just that the CI machine is unable to handle this kind of load, or if there is actual space for improvement either in stress test implementation or in SocketsHttpHandler itself (or maybe even in Kestrel). @geoffkizer @benaadams @scalablecory @ManickaP any further insights / ideas / suggestions? Edit: Removing |
Could you increase the ip addresses used to increase the ports? e.g. if its localhost |
@benaadams not a bad idea. Note that in our stress test design, there is only one Kestrel instance on a pre-configured endpoint, and only one As a first step, I'm really looking for a confirmation from more experienced folks if this is just a normal HTTP 1.1 behavior we need to accept and workaround in the tests, or do we think this might be a potential customer issue we need to address in the product? |
If its doing keep-alives (HTTP 1.1 default) it should be using the same connection so not creating lots of sockets; and it doesn't look like it setting connection close.
runtime/src/libraries/System.Net.Http/tests/StressTests/HttpStress/ClientOperations.cs Lines 64 to 69 in 103ab87
|
I agree with @benaadams that this means that Kestrel initiated the FIN: See https://en.wikipedia.org/wiki/Transmission_Control_Protocol#/media/File:TCP_CLOSE.svg For HTTP/1.1, this should only happen if the app explicitly set the "Connection: close" header, the app called
Port exhaustion is a well known issue with quickly opening and closing TCP connections to a single endpoint. This (along with improved perf in general) is partly why HTTP/1.1 keeps the TCP connection alive by default and Kestrel only closes the connection if it's explicitly told to by the app or if there's an error that makes reusing the connection infeasible. |
Wondering if that's the |
Wouldn't the |
🤷♂️ From an brief glance at the code it didn't look like any of the requests were bad requests and it didn't look like close was being sent; so just guessing the cancel without reading the response might be something to look at? |
I think, I managed to track it down: Client-side lingering is caused entirely by cancellation. Setting Server-side lingering is caused by the In short: It seems to me that there is nothing unexpected happening, and we need to workaround this in the stress test code. |
Sounds right to me. |
I think there are two things we can do:
(1) is most likely enough to address this, but I'm curious on your thoughts regarding (2). |
I would try a minimum working solution which means to start with (1) and see if it's enough. Only go to (2), if it's not. |
This is blocked on #51117 now 😞 |
+1 to what @alnikola said. Also: re 1, I think it's fine to reduce the cancel rate even further, like just make it 5% or whatever. We should still get plenty of coverage of cancellation, and having a bit of extra room here is probably a good thing. |
For context, many of these defaults were informed by stress testing specific to HTTP 2, which multiplexes requests. Adding frequent cancellation was a relatively inexpensive way of introducing entropy to the socket handler, which is not necessarily true in the case of HTTP 1.1. |
And HTTP2 would not have the same issue of TIME_WAIT on cancellation, because it is multiplexed. So perhaps we should have a higher rate for cancellation on HTTP2 than HTTP/1.1. |
Unfortunately, reducing the client cancellation is not enough on the actual CI machine, but disabling the @geoffkizer @ManickaP @alnikola how important is that in it's ability to find bugs in HTTP 1.1? A relatively easy fix could be to introduce a switch to completely disable it. |
I don't think we should disable it entirely. It does seem like it induces code paths that aren't otherwise hit. Can we just reduce it significantly? |
Yes, but I wouldn't pollute the code with hacks specific to |
Yeah, I like the weights per operation and through that reducing the |
According to experiments in #51233, failures still may happen even with Summary of the symptoms again:
Feels like the problem is rooted in some weird OS behavior I don't understand. @wfurt do you have any ideas? |
I know this is late in the discussion but why HttpClient keeps the half-closed sockets after cancellation? |
@wfurt it does dispose (close) the sockets. The sockets then |
It's the operating system that keeps them open as part of the normal operation of TCP |
yah, I miss-read the conversation @benaadams. /proc/sys/net/ipv4/tcp_tw_recycle That should help with lingering sockets and port exhausting. |
The problem is that I'm starting to question my original assumptions. A port exhaustion should end with What made me think it's port exhaustion is that if I get rid of the connection abortion scenarios, the issue disappears. But the whole picture tells me that there is some further weirdness in the OS I fail to understand. |
I think, I have a proof now, that this is very unlikely an issue with
The next step would be to collect ETW traces and packet captures from the test lab VM, however this gonna be difficult chore work, since the disk space is limited there, and the data we need to analyze is of several 10s of GB in scale. |
@antonfirsov just an idea: What about collecting ETW traces (and hopefully as well packet capture) into circular buffer? I know ETW has the option ... |
Disabling the Windows firewall makes the connection failures disappear, which practically rules out the possibility of a product bug for this failure type. The probability of the issue does not scale linearly with the amount of connection recreation, so even with a very low (~0.5%) cancellation/server abortion rate there will be a significant chance for connection failures. We need a quick, temporary workaround to remove this failure type ASAP, since the high amount of these false negatives renders our stress test reports practically useless. Considering other priorities, I would go with the easiest solution, and disable cancellation and |
Instead of completely turning off those scenarios, could we rather catch and filter the exception by the known error codes for this issues? So that we don't miss potentially different problems. |
@ManickaP this is the second thing I was thinking about, it might be actually a better solution. Could be done with a switch I wonder about other opinions? |
Reminds me of this SignalR/SignalR/../WebSocketHandler.cs#L245-L255 // If this exception is due to the underlying TCP connection going away, treat as a normal close
// rather than a fatal exception.
COMException ce = ex as COMException;
if (ce != null)
{
switch ((uint)ce.ErrorCode)
{
// These are the three error codes we've seen in testing which can be caused by the TCP connection going away unexpectedly.
case 0x800703e3:
case 0x800704cd:
case 0x80070026:
return false;
}
} |
@benaadams this is interesting, do you know more details about that story? Do those COMException error codes map to WSA error codes? /cc @davidfowl |
@GrabYourPitchforks wrote that code a long time ago 😄 |
#50854 seems to be rooted in an unknown firewall problem, and from the POV of http stress testing it's a false negative. We need a quick solution to remove this failure type, because the high amount failures renders our stress test reports useless. Since the build agents do not have a public IP, it seems OK to disable the Windows firewall for the runs.
Currently, the the vast majority of our HttpStress failures (#42211) are HTTP 1.1 Socket connection failures. (Socket error 10060 running inside, 10061 running outside of Docker):
https://dev.azure.com/dnceng/public/_build/results?buildId=1045441&view=logs&j=2d2b3007-3c5c-5840-9bb0-2b1ea49925f3&t=ac0b0b0f-051f-52f7-8fb3-a7e384b0dde9&l=1244
https://gist.github.com/antonfirsov/dc7af2d9cb4213c835a41f59f09a0775#file-type1-txt
Update 1: I no longer think this is an issue with
TIME_WAIT
, see #50854 (comment).Update 2: Disabling the Windows Firewall seems to fix the issue
While running these tests, there are thousands of Kestrel sockets in lingering in
TIME_WAIT
. This may lead to port exhaustion, which I believe explains the CI failures.It doesn't look like the stress client's
ClientOperations
are doing anything unconventional --HttpRequestMessage
-es seem to be properly disposed.Things I tried:
-cancelRate
to0
, but it did not change the amount of lingering socketsMaxConnectionsPerServer
to100
there were still thousands of lingering socketsI'm not familiar enough with SocketsHttpHandler codebase and Kestrel (or webservers in general), so I can't assess if what I'm seeing is extreme or not in a stress scenario.
@wfurt @geoffkizer @scalablecory @halter73 @stephentoub thoughts?
The text was updated successfully, but these errors were encountered: