-
Notifications
You must be signed in to change notification settings - Fork 846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCP ephemeral ports exhausted and network broken after lots of early-closed non-blocking connections #2913
Comments
The bugcheck should be resolved as of Insider Build 17083. This is unlikely to be related to the reported issue of resource exhaustion resulting in failure to open browser. I am assuming netstat.exe will fill up many pages with information, but it would be interesting to output that information (e.g. netstat.exe -aboq) to a file and see if anything stands out, and if so which process is responsible. |
Yes, it "unlikely". It IS.
I am very troubled. Because of this problem, I can no longer use WSL as my development environment. Otherwise, I have to reboot the computer every half an hour. This is not interesting. @Brian-Perkins |
I have sent the coredump files to [email protected]. These coredump generated by Insider Build 17074.1000. Currently there is no BSOD in 17083.1000, just the network connection problem. STACK_TEXT:
It seems a issue in |
After the issue triggered, a win32 program want to listen a TCP port and failed. There is its logs:
|
@YihaoPeng - Also, if there is a targeted repro, do share that out. |
I can add to this. I have a python script in Ubuntu scrapes data from webpages. After a couple of days, without fail my computer will lose (practically) all internet and the computer management shows that there is report ephemeral port exhaustion. Normally this should resolve itself if you just back off on your outgoing requests. As stated above, Netstat -aboq shows only a few entries, nowhere near what would be expected in a normal case. I believe that the subsystem is not releasing the ephemeral ports, so once I go through all of them once, I'm done until a reboot. |
Do you have any plan to fix this? |
Like Suvega, I have the same problem... can't do Linux development until this is fixed. Any plans to address ephemoral port handling properly within WSL? This is still hapenning in Windows 10 Pro version 1803. |
Can't speak for the devs, but personally I read:
...and pretty much stopped there. This has been basically blocked on "if there is a targeted repro, do share that out" since February. If someone has a tight repro that can be cut-and-pasted into WSL and Real Linux that demonstrates ephemeral port exhaustion on WSL but not RL I am sure it will get looked at. There is possibly something to this, but absent a tight repro I doubt this is being looked at. [Noting, importantly, I have no idea what the MSFT devs look at or don't look at, but speaking for myself.] |
@therealkenc I tried to reproduce the problem with simple code many times, but none of them succeeded. Currently I can only reproduce this problem steady in a complex system as described above. I can't do anything about it at current. I can only use virtual machines to avoid this issue. |
@therealkenc Can you clarify what a "targeted repro" is? |
A repro is list of CLI commands that can be cut-and-paste into a Real Linux terminal on the left and a WSL terminal on the right. Targeted means the steps are limited to the least number of moving parts possible to demonstrate a diverge in an |
FWIW I am getting this with WSL apache + php-fpm + mysql. No full repro yet but what i've seen so far:
|
Thanks. To be clear, I have no doubt there is something to this. You aren't imagining the problem. There just hasn't been a "targeted repro" that Sunil (who I haven't seen around for ages) or one of the other devs is likely to cut-and-paste into their WSL terminal and Real Linux terminal to triage. "WSL apache + php-fpm + mysql". Even if you gave the two dozen steps from clean install to set it all up just-so, there are way too many moving parts to triage at the syscall level. No one is going to chase that. That's the problem, no you. More likely than not the bug is legit. |
Hey, I think I may be able to reliably reproduce this with a single NodeJS app. Is there any interest in fixing this problem? |
@oxygen - currently the issue is not understood, so repro steps would be very helpful. |
apt install android-tools-adb wait for a few minutes and it's reproduced. even the win32/64 subsystem is affected, can't connect using tcp to any server, including all http/https |
I can reliably reproduce this issue using the attached program (~80 lines of C): wsl-issue-2913-repro.c.txt The program performs these steps in a loop:
On a Linux VM I can run the loop several hundred thousand times and see the ports being used cycle through the entire ephemeral range multiple times. On my Windows laptop it loops about 16,000 times and then EINVAL is returned from connect(). At this point the symptoms described in previous comments appear: Win32 programs such as web browsers fail to connect and the output of "netstat -anoq" in a command prompt shows many connections stuck in the "BOUND" state. The only way to get network connections working again is to restart the LxssManager service. Versions used to reproduce this: |
@sunilmut Can you take a look at this issue again? A simple method of reproduction is given upstairs. I am still experiencing this issue recently. Although restarting the LxssManager service is effective, it is very inconvenient. |
I haven't seen Sunil around in over a year. It might be worth extracting that test case into a new issue with a better title. [The signal to noise ratio around here has dropped to the point that even if that test case is a good reproducer it is probably buried.] Bonus points if you cut the number of lines in half and embed into the post itself, with actual copy-pasteable CLI repro steps and failing |
I have one more case for it. I'm using Android ADB over IP. So my case on WSL Ubuntu 18.04: Now suspend Windows. |
I have been seeing this issue and it's easily reproducible with Expo.
To fix it, I do Ctrl-C everything that I have open, |
I have been having the same issue using Expo too,
I restart LXSS services so networking work again, but it's very
annoying because all the processes you were running on WSL are killed, so
you have to run everything again.
…On Sun, Mar 10, 2019 at 6:04 AM Chris Hammond ***@***.***> wrote:
I have been seeing this issue and it's easily reproducible with Expo
<https://github.com/expo/expo-cli/>.
1. Start an Expo build / publish
2. After some time the networking both on WSL and Windows stop
allowing fresh connections, so webpages already open will continue to work
and you can establish fresh connections to the same server, but any
external assets referenced on other pages that need to load from other URLs
will fail to load
To fix it, I do Ctrl-C everything that I have open, killall adb to get
rid of the background adb processes, exit out of every tmux window and
fully close the Ubuntu WSL terminal. Then networking will work properly
again in Windows, without needing to restart the LXSS process as mentioned
above.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2913 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AKYyWgmnC2Bmy9zdlFw3XnHsfAo-Bvw_ks5vVOa_gaJpZM4R4lz0>
.
|
find my family |
My development environment is a mix of Windows and Linux-based backend applications and a frontend WinForm application. Previously, I would compile the Linux-based applications to target Windows. Once I got Docker working on WSL, I decided to run those applications in Linux containers. I decided against Docker for Windows in any form because it lacks the host networking mode, which is necessary for my frontend application to connect to these backend services. Unfortunately, after running my environment for a little bit, my Chrome was unable to connect to any website in a new tab that I wasn't already connected to. At first, I thought this may be DNS, but my existing containers could ping any website I could think of. strace and netcat led me to a TCP issue. I then followed the repro above to reproduce without Docker running. My hope was eventually, we would migrate our Windows-based backend applications to .NET Core 3.0 and redeployment on Linux, then everything in my development environment could run in Docker, without a VM. Unfortunately, this bug now means I will have to install a VM to run my Linux applications, or go back to targeting them against Windows. |
At the suggestion of therealkenc, the title and description of the issue have been completely updated. I hope that a Microsoft employee can pay attention to this issue again. I also tried to create a new Issue: #3951 |
Rebooting this into the new one. Thanks. |
I am having the same issue, and is reproducible. |
Fixed in Windows Insider Build 18890 |
Since wsl2 will release, it is not neccssary to fix it even. |
started occuring again! |
At the suggestion of therealkenc, the title and description of the issue have been completely updated.
See more progress in #3951.
Description
After creating and closing (before established) a large number of non-blocking connections in WSL, all TCP ephemeral ports will be exhausted, then no new TCP connections from WSL or Win32 can be established. Closing related processes in WSL does not release these ports. All new TCP connections or listening will failed and must to restart the
LxssManager
service to recover.Reproducible Demo
The demo is from philip-searle's comment of #2913 on 18 Jan.
You can reliably reproduce this issue using the attached program (~80 lines of C): wsl-issue-2913-repro.c.txt
Output from strace looks normal to me and is attached as wsl-issue-2913-repro.strace.zip
The program performs these steps in a loop:
Environment
In Ubuntu in WSL, build and run the demo with these commands:
Expected Behavior
On a Linux VM I can run the loop several hundred thousand times and see the ports being used cycle through the entire ephemeral range multiple times.
In addition, even if the program has a bug that does not properly release the occupied port, these ports should be automatically released after the program exits.
Observed Behavior
On philip-searle's Windows laptop it loops about 16,000 times and then EINVAL is returned from connect(). At this point the symptoms described in previous comments appear: Win32 programs such as web browsers fail to connect and the output of "netstat -anoq" in a command prompt shows many connections stuck in the "BOUND" state. The only way to get network connections working again is to restart the
LxssManager
service.On YihaoPeng's PC with Windows 1809
build 17763.379
, the ports will be exhausted after 2899 rounds:No ports released after the program exits. If you let the program run repeatedly (so it will immediately take up the ephemeral port released by other programs), you will find that no TCP connections in your Windows can be established. For example, your EDGE browser will not be able to load any page.
Use the following commands to run the program repeatedly:
The text was updated successfully, but these errors were encountered: