Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP ephemeral ports exhausted and network broken after lots of early-closed non-blocking connections #2913

Closed
YihaoPeng opened this issue Feb 4, 2018 · 33 comments

Comments

@YihaoPeng
Copy link

YihaoPeng commented Feb 4, 2018

At the suggestion of therealkenc, the title and description of the issue have been completely updated.

See more progress in #3951.

Description

After creating and closing (before established) a large number of non-blocking connections in WSL, all TCP ephemeral ports will be exhausted, then no new TCP connections from WSL or Win32 can be established. Closing related processes in WSL does not release these ports. All new TCP connections or listening will failed and must to restart the LxssManager service to recover.

Reproducible Demo

The demo is from philip-searle's comment of #2913 on 18 Jan.

You can reliably reproduce this issue using the attached program (~80 lines of C): wsl-issue-2913-repro.c.txt

Output from strace looks normal to me and is attached as wsl-issue-2913-repro.strace.zip

The program performs these steps in a loop:

  1. Allocate a TCP socket for IPv4 (AF_INET) family.
  2. Make the socket non-blocking with O_NONBLOCK.
  3. Attempt to connect() the socket to 127.0.0.1:1234
  4. Use getsockname() to obtain the port number and output it to the console.
  5. Close the socket

Environment

In Ubuntu in WSL, build and run the demo with these commands:

apt update
apt install gcc wget
wget -O wsl-issue-2913-repro.c https://github.com/Microsoft/WSL/files/2769821/wsl-issue-2913-repro.c.txt
gcc -o wsl-issue-2913-repro wsl-issue-2913-repro.c
./wsl-issue-2913-repro

Expected Behavior

On a Linux VM I can run the loop several hundred thousand times and see the ports being used cycle through the entire ephemeral range multiple times.

In addition, even if the program has a bug that does not properly release the occupied port, these ports should be automatically released after the program exits.

Observed Behavior

On philip-searle's Windows laptop it loops about 16,000 times and then EINVAL is returned from connect(). At this point the symptoms described in previous comments appear: Win32 programs such as web browsers fail to connect and the output of "netstat -anoq" in a command prompt shows many connections stuck in the "BOUND" state. The only way to get network connections working again is to restart the LxssManager service.

On YihaoPeng's PC with Windows 1809 build 17763.379, the ports will be exhausted after 2899 rounds:

...
Socket 2893 is using port 4930
Socket 2894 is using port 4943
Socket 2895 is using port 4944
Socket 2896 is using port 4945
Socket 2897 is using port 4952
Socket 2898 is using port 4953
Invalid argument returned from connect() - BOUND socket exhaustion likely after 2899 rounds
Exiting (failed)

No ports released after the program exits. If you let the program run repeatedly (so it will immediately take up the ephemeral port released by other programs), you will find that no TCP connections in your Windows can be established. For example, your EDGE browser will not be able to load any page.

Use the following commands to run the program repeatedly:

while true; do ./wsl-issue-2913-repro; done
@Brian-Perkins
Copy link

The bugcheck should be resolved as of Insider Build 17083. This is unlikely to be related to the reported issue of resource exhaustion resulting in failure to open browser. I am assuming netstat.exe will fill up many pages with information, but it would be interesting to output that information (e.g. netstat.exe -aboq) to a file and see if anything stands out, and if so which process is responsible.

@YihaoPeng
Copy link
Author

YihaoPeng commented Feb 6, 2018

Yes, it "unlikely". It IS.
I have to REBOOT my Windows 10 Insider Build 17083.1000 before I can reply the post.
Otherwise, I can not open ANY web page.

netstat.exe -aboqn display ONLY 400 lines (200 connections) of its output. The existence of each connection is reasonable and does not lead to such a problem.
The file: netstat-aboqn.txt

I am very troubled. Because of this problem, I can no longer use WSL as my development environment. Otherwise, I have to reboot the computer every half an hour.

This is not interesting. @Brian-Perkins

@YihaoPeng YihaoPeng changed the title TCP user port exhaust and/or BSOD after lots of network operates TCP user port exhaust and/or BSOD after lots of network/epoll operates Feb 6, 2018
@YihaoPeng
Copy link
Author

YihaoPeng commented Feb 6, 2018

I have sent the coredump files to [email protected]. These coredump generated by Insider Build 17074.1000. Currently there is no BSOD in 17083.1000, just the network connection problem.

STACK_TEXT:

ffffe90b`88e37530 fffff80c`e620b457 : 00000000`00000033 ffffe90b`88e375e9 ffffc60b`329221d0 ffffc60b`3288f3b0 : LXCORE!LxpDagEnumerateNextOutgoingVertex+0xb0
ffffe90b`88e37580 fffff80c`e620b3b0 : 00000000`00000031 00000000`00000000 00000000`00000033 00000000`00000001 : LXCORE!LxpEpollFileStateUpdateEpollEntries+0x83
ffffe90b`88e37650 fffff80c`e6212006 : 00000000`00000000 ffffe90b`88e37769 00000000`00000001 ffffe90b`88e37769 : LXCORE!LxpEpollFileStateUpdate+0x60
ffffe90b`88e37680 fffff80c`e6212f50 : 00000000`00000000 ffffdcff`00000000 00000000`00000000 ffffc60b`2ccc9880 : LXCORE!LxpPipeReadReady+0x82
ffffe90b`88e376b0 fffff80c`e624912f : ffffe90b`88e37808 00000000`00000000 ffffe90b`88e378a0 00000000`00000000 : LXCORE!LxPipeFsFileWriteVector+0x1b0
ffffe90b`88e377c0 fffff80c`e6248f07 : ffffe90b`88e37950 ffffc60b`2fd12dc0 00000000`00000000 fffff802`c1c77e27 : LXCORE!VfsFileWriteVector+0x11b
ffffe90b`88e37860 fffff80c`e622d777 : 00000000`000000a4 ffffe90b`88e37a39 00007fe1`0ce4f6a0 fffff80c`e61cd720 : LXCORE!VfsFileWrite+0x97
ffffe90b`88e37900 fffff80c`e62220a4 : 00000000`00000000 ffffc60b`313aa000 00000000`00000000 ffffc60b`2fd12dc0 : LXCORE!LxpSyscall_WRITE+0xa7
ffffe90b`88e379d0 fffff80c`e623d2b6 : 00000000`00000001 ffffe90b`88e37b00 00007fe1`0ce4f6a0 00000000`000000a4 : LXCORE!LxpSysDispatch+0x184
ffffe90b`88e37aa0 fffff802`c237bc0f : 00000000`00000000 00000000`00000000 00000000`ed6a0c14 fffff802`c1dcbfb5 : LXCORE!PicoSystemCallDispatch+0x16
ffffe90b`88e37ad0 fffff802`c1dda58c : ffffe90b`88e37b00 ffff8c0e`77afdd00 00000000`00000000 00000000`edd94d50 : nt!PsPicoSystemCallDispatch+0x1f
ffffe90b`88e37b00 00007fe1`ee7172ad : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceUser+0x76
00007fe1`0ce4f690 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x00007fe1`ee7172ad

It seems a issue in epoll operation.

@YihaoPeng
Copy link
Author

YihaoPeng commented Feb 6, 2018

After the issue triggered, a win32 program want to listen a TCP port and failed. There is its logs:

2018-02-06 06:29:10 Namecoin version nc0.13.99-name-tab-beta1
2018-02-06 06:29:10 InitParameterInteraction: parameter interaction: -whitelistforcerelay=1 -> setting -whitelistrelay=1
2018-02-06 06:29:10 GUI: "registerShutdownBlockReason: Successfully registered: Namecoin Core didn't yet exit safely..."
2018-02-06 06:29:10 Default data directory C:\Users\hu60\AppData\Roaming\Namecoin
2018-02-06 06:29:10 Using data directory C:\Tools\namecoin-0.13.99\data9\testnet
2018-02-06 06:29:10 Using config file C:\Tools\namecoin-0.13.99\data9\namecoin.conf
2018-02-06 06:29:10 Using at most 125 connections (2048 file descriptors available)
2018-02-06 06:29:10 Using 2 threads for script verification
2018-02-06 06:29:10 scheduler thread start
2018-02-06 06:29:10 libevent: evsig_init: socketpair: No buffer space available [WSAENOBUFS ]
2018-02-06 06:29:10 libevent: evthread_make_base_notifiable: socketpair: No buffer space available [WSAENOBUFS ]
2018-02-06 06:29:10 libevent: event_base_new_with_config: Unable to make base notifiable.
2018-02-06 06:29:10 Couldn't create an event_base: exiting
2018-02-06 06:29:12 scheduler thread interrupt
2018-02-06 06:29:12 Shutdown: In progress...
2018-02-06 06:29:12 StopNode()
2018-02-06 06:29:12 Shutdown: done

@sunilmut
Copy link
Member

sunilmut commented Feb 6, 2018

@YihaoPeng - WSAENOBUFS is no buffer space or out of memory. That could happen if there is a memory leak. On build 17083, can you watch the memory of the process in the process manager or through some other tool?

Also, if there is a targeted repro, do share that out.

@YihaoPeng
Copy link
Author

The first time I cannot open any webpage. It shows 58% of memory usage:
memory

I rebooted. And the second time, 67% of memory usage.
memory2

It seems like no memory leak.


You can follow the steps that I wrote in the first post. And it is a targeted repro of the issue. I am trying to reproduce it on my other computer using the package I pasted at the first post.

@Suvega
Copy link

Suvega commented Mar 22, 2018

I can add to this. I have a python script in Ubuntu scrapes data from webpages. After a couple of days, without fail my computer will lose (practically) all internet and the computer management shows that there is report ephemeral port exhaustion.

Normally this should resolve itself if you just back off on your outgoing requests.
In this case it does not. Only a reboot fixes the problem.

As stated above, Netstat -aboq shows only a few entries, nowhere near what would be expected in a normal case.

I believe that the subsystem is not releasing the ephemeral ports, so once I go through all of them once, I'm done until a reboot.

@YihaoPeng YihaoPeng changed the title TCP user port exhaust and/or BSOD after lots of network/epoll operates TCP user port exhaust after lots of network/epoll operates Jul 13, 2018
@YihaoPeng
Copy link
Author

YihaoPeng commented Jul 13, 2018

Do you have any plan to fix this?
Even in Windows 10 version 1803 I still have this problem. I have encountered this problem twice today (every time I have to restart).

@YihaoPeng YihaoPeng changed the title TCP user port exhaust after lots of network/epoll operates TCP user port exhaust and network broken after lots of network/epoll operates Jul 13, 2018
@rcoulsell
Copy link

rcoulsell commented Jul 29, 2018

Like Suvega, I have the same problem... can't do Linux development until this is fixed. Any plans to address ephemoral port handling properly within WSL? This is still hapenning in Windows 10 Pro version 1803.

@therealkenc
Copy link
Collaborator

Can't speak for the devs, but personally I read:

If everything is ok, it will install supervisor, zookeeper, kafka & btcpool, then run them. It will run two bitcoin-qt.exe too (they are win32 programs).

...and pretty much stopped there. This has been basically blocked on "if there is a targeted repro, do share that out" since February. If someone has a tight repro that can be cut-and-pasted into WSL and Real Linux that demonstrates ephemeral port exhaustion on WSL but not RL I am sure it will get looked at. There is possibly something to this, but absent a tight repro I doubt this is being looked at. [Noting, importantly, I have no idea what the MSFT devs look at or don't look at, but speaking for myself.]

@SwimmingTiger
Copy link

SwimmingTiger commented Aug 29, 2018

@therealkenc I tried to reproduce the problem with simple code many times, but none of them succeeded. Currently I can only reproduce this problem steady in a complex system as described above.

I can't do anything about it at current. I can only use virtual machines to avoid this issue.

@emontnemery
Copy link

@therealkenc Can you clarify what a "targeted repro" is?
I see a similar issue as Suvega: A Python script doing some periodic polling eventually leads to events 4227 and 4231.
There is no issue when running the same script in Linux.

@therealkenc
Copy link
Collaborator

Can you clarify what a "targeted repro" is?

A repro is list of CLI commands that can be cut-and-paste into a Real Linux terminal on the left and a WSL terminal on the right. Targeted means the steps are limited to the least number of moving parts possible to demonstrate a diverge in an strace log. For example, if your Python script required (entirely hypothetically) MongoDB in order to run, the likelihood of it being looked at is low(er) because MongoDB works and can't be the cause. It's a stripping down exercise. The most targeted repro possible is a small C program that exercises the minimum Linux syscall surface necessary to demonstrate a diverge. Hope that helps.

@xyphoid
Copy link

xyphoid commented Sep 24, 2018

FWIW I am getting this with WSL apache + php-fpm + mysql. No full repro yet but what i've seen so far:

  • netstat -qno shows buffers up to 0.0.0.0:65535 in use by missing processes.
  • I added process creation logging as listed at https://superuser.com/questions/1052541/how-can-i-get-a-history-of-running-processes/1052593 , and was able to match up the process IDs in netstat with php-fpm (or apache, when using mod_php) processes such as
    C:\Users\tim\AppData\Local\Packages\CanonicalGroupLimited.Ubuntu18.04onWindows_79rhkp1fndgsc\LocalState\rootfs\usr\sbin\php-fpm7.2
  • Restarting the LXSS service clears all the hanging buffers without having to reboot windows.

@xyphoid
Copy link

xyphoid commented Sep 24, 2018

image

@therealkenc
Copy link
Collaborator

Restarting the LXSS service clears all the hanging buffers without having to reboot windows

Thanks. To be clear, I have no doubt there is something to this. You aren't imagining the problem. There just hasn't been a "targeted repro" that Sunil (who I haven't seen around for ages) or one of the other devs is likely to cut-and-paste into their WSL terminal and Real Linux terminal to triage. "WSL apache + php-fpm + mysql". Even if you gave the two dozen steps from clean install to set it all up just-so, there are way too many moving parts to triage at the syscall level. No one is going to chase that. That's the problem, no you. More likely than not the bug is legit.

@oxygen
Copy link

oxygen commented Oct 5, 2018

Hey, I think I may be able to reliably reproduce this with a single NodeJS app.
#3591

Is there any interest in fixing this problem?

@Brian-Perkins
Copy link

@oxygen - currently the issue is not understood, so repro steps would be very helpful.

@xuefer
Copy link

xuefer commented Dec 3, 2018

apt install android-tools-adb
adb connect android-device-ip
ssh some-linux-host

wait for a few minutes and it's reproduced. even the win32/64 subsystem is affected, can't connect using tcp to any server, including all http/https
the problem persist until i close all wsl process, including the background adb daemon

@philip-searle
Copy link

I can reliably reproduce this issue using the attached program (~80 lines of C): wsl-issue-2913-repro.c.txt
Output from strace looks normal to me and is attached as wsl-issue-2913-repro.strace.zip

The program performs these steps in a loop:

  1. Allocate a TCP socket for IPv4 (AF_INET) family.
  2. Make the socket non-blocking with O_NONBLOCK.
  3. Attempt to connect() the socket to 127.0.0.1:1234
  4. Use getsockname() to obtain the port number and output it to the console.
  5. Close the socket

On a Linux VM I can run the loop several hundred thousand times and see the ports being used cycle through the entire ephemeral range multiple times.

On my Windows laptop it loops about 16,000 times and then EINVAL is returned from connect(). At this point the symptoms described in previous comments appear: Win32 programs such as web browsers fail to connect and the output of "netstat -anoq" in a command prompt shows many connections stuck in the "BOUND" state. The only way to get network connections working again is to restart the LxssManager service.

Versions used to reproduce this:
Output from ver in command prompt: Microsoft Windows [Version 10.0.17134.472]
Output from uname -a in WSL: Linux SAG-FNH45S2 4.4.0-17134-Microsoft #471-Microsoft Fri Dec 07 20:04:00 PST 2018 x86_64 GNU/Linux

@SwimmingTiger
Copy link

SwimmingTiger commented Mar 6, 2019

@sunilmut Can you take a look at this issue again? A simple method of reproduction is given upstairs.

I am still experiencing this issue recently. Although restarting the LxssManager service is effective, it is very inconvenient.

@therealkenc
Copy link
Collaborator

I haven't seen Sunil around in over a year. It might be worth extracting that test case into a new issue with a better title. [The signal to noise ratio around here has dropped to the point that even if that test case is a good reproducer it is probably buried.]

Bonus points if you cut the number of lines in half and embed into the post itself, with actual copy-pasteable CLI repro steps and failing strace snippet for the pedantry (so no one can claim they're missing). I can't guarantee doing all that that will help, but it can't hurt either. Bonne chance.

@slonm
Copy link

slonm commented Mar 7, 2019

I have one more case for it. I'm using Android ADB over IP. So my case on WSL Ubuntu 18.04:
Run in the bash:
sudo apt install android-tools-adb
adb connect
adb shell

Now suspend Windows.
Wake up it after 30 min timeout.
Result: unable open new port. Netstat show all dynamic ports are in BIND state

@ChristopherHammond13
Copy link

I have been seeing this issue and it's easily reproducible with Expo.

  1. Start an Expo build / publish
  2. After some time the networking both on WSL and Windows stop allowing fresh connections, so webpages already open will continue to work and you can establish fresh connections to the same server, but any external assets referenced on other pages that need to load from other URLs will fail to load

To fix it, I do Ctrl-C everything that I have open, killall adb to get rid of the background adb processes, exit out of every tmux window and fully close the Ubuntu WSL terminal. Then networking will work properly again in Windows, without needing to restart the LXSS process as mentioned above.

@KinIcy
Copy link

KinIcy commented Mar 11, 2019 via email

@whitehatboxer
Copy link

find my family

@atniomn
Copy link

atniomn commented Mar 26, 2019

My development environment is a mix of Windows and Linux-based backend applications and a frontend WinForm application. Previously, I would compile the Linux-based applications to target Windows. Once I got Docker working on WSL, I decided to run those applications in Linux containers. I decided against Docker for Windows in any form because it lacks the host networking mode, which is necessary for my frontend application to connect to these backend services.

Unfortunately, after running my environment for a little bit, my Chrome was unable to connect to any website in a new tab that I wasn't already connected to. At first, I thought this may be DNS, but my existing containers could ping any website I could think of. strace and netcat led me to a TCP issue. I then followed the repro above to reproduce without Docker running.

My hope was eventually, we would migrate our Windows-based backend applications to .NET Core 3.0 and redeployment on Linux, then everything in my development environment could run in Docker, without a VM. Unfortunately, this bug now means I will have to install a VM to run my Linux applications, or go back to targeting them against Windows.

@YihaoPeng YihaoPeng changed the title TCP user port exhaust and network broken after lots of network/epoll operates TCP user port exhaust and network broken after lots of non-blocking loopback connections Apr 1, 2019
@YihaoPeng YihaoPeng changed the title TCP user port exhaust and network broken after lots of non-blocking loopback connections TCP user port exhaust and network broken after lots of non-blocking connections Apr 1, 2019
@YihaoPeng
Copy link
Author

YihaoPeng commented Apr 1, 2019

At the suggestion of therealkenc, the title and description of the issue have been completely updated.

I hope that a Microsoft employee can pay attention to this issue again.

I also tried to create a new Issue: #3951

@YihaoPeng YihaoPeng changed the title TCP user port exhaust and network broken after lots of non-blocking connections TCP user ports exhausted and network broken after lots of early-closing non-blocking connections Apr 1, 2019
@YihaoPeng YihaoPeng changed the title TCP user ports exhausted and network broken after lots of early-closing non-blocking connections TCP ephemeral ports exhausted and network broken after lots of early-closing non-blocking connections Apr 1, 2019
@therealkenc
Copy link
Collaborator

Rebooting this into the new one. Thanks.

@YihaoPeng YihaoPeng changed the title TCP ephemeral ports exhausted and network broken after lots of early-closing non-blocking connections TCP ephemeral ports exhausted and network broken after lots of early-closed non-blocking connections Apr 2, 2019
@attilastrba
Copy link

attilastrba commented Apr 25, 2019

I am having the same issue, and is reproducible.
Once I byobu for some reason wget being called every second and killed allocating TCP ports, once I run out all the ports.
What I don't get is that this issue and all related one is closed. But is there a fix for it or I am missing something?

@Brian-Perkins
Copy link

Fixed in Windows Insider Build 18890

@whitehatboxer
Copy link

Since wsl2 will release, it is not neccssary to fix it even.

@Wladastic
Copy link

started occuring again!
Whenever I close Gradio on wsl2.1.4.0, any port redirects of 7860 for example stop working randomly and I get this error message in the Event Log:
TCP/IP failed to establish an outgoing connection because the selected local endpoint was recently used to connect to the same remote endpoint. This error typically occurs when outgoing connections are opened and closed at a high rate, causing all available local ports to be used and forcing TCP/IP to reuse a local port for an outgoing connection. To minimize the risk of data corruption, the TCP/IP standard requires a minimum time period to elapse between successive connections from a given local endpoint to a given remote endpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests