-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mirrored network becomes unavailable 24h after session start #10587
Comments
Hi there. Since this takes a very long time to repro, could you run the following until you get a repro - it will capture minimal traces (just WSL traces managing the Linux network settings). e.g. C:>logman start wsl_trace -p {b99cdb5a-039c-5046-e672-1a0de0a40211} -o wsl_trace.etl -ets <<<<<<<< Now Repro >>>>>>>> C:>logman stop wsl-trace -ets Error: C:>logman stop wsl_trace -ets C:>dir *.etl Directory of C:\ 10/04/2023 07:53 PM 368,640 wsl_trace.etl Please send back the generated ETL file. Once you have a repro, could you then run a very short repro attempting to make a network connection from the WSL container. The below will be a much deeper trace to try to collect where data is getting lost. powershell .\collect-wsl-logs.ps1 .\wsl_networking.wprp (from https://github.com/microsoft/WSL/tree/master/diagnostics) Thanks! |
ETS and WSL log traces attached. Started the ETS approx. 2 hours before predicted loss of networking, and ended the ETS afterwards. WSL log trace was performed immediately after loss of networking. Loss occurred at 24h+10m... and here's the juicy new tidbit - I just happened to notice that when the event occurred, the interfaces don't lose IPs or anything at the interface level. However, the in-WSL Linux routing table is completely wiped clean. No default routes, nothing. That may be the smoking gun or where to point the Eye of Sauron next. |
Thanks. I can see from the trace that our code in WSL has been successfully pushing IP updates into the container. There weren't errors setting things up with Linux. It doesn't look like the wsl_networking.wprp created a trace to observe traffic failing. While it's in this bad state, can you dump out the Linux state (https://github.com/microsoft/WSL/blob/master/diagnostics/networking.sh), then run wpr.exe -start wsl_networking.wprp -filemode (then generate network traffic from the Linux container, like trying to ping an address, or wget bing.com a few times) wpr.exe -stop wsl_networking.etl please let me know what traffic you tried to send, and that ETL file. If you could also cat /etc/resolv.conf so we can see what the DNS configuration is. |
The ETL file is approximately 300 megs, zipped to 70, too big for an attachment. I have made it available at https://1drv.ms/u/s!AtUhMGXKAUHRgqFFUgXnAVLNKoHZmA?e=uLvtda For giggles and completeness, networking-good.txt is a run of the networking shell script while everything is okay. networking-bad.txt is a dump of the script in the bad state. The ETL is also attached. Some pings against 1.1.1.1, local router 192.168.0.1, bing, google, etc were all attempted. Interestingly, looking at the networking-bad dump and some other observations, the IPv4 default route and subnets are nuked. But IPv6 remains up and available. In fact, if I know the IPv6 address of some services, new traffic is passed. Existing traffic was dropped - at the time of the event, I had an IPv6 connection open to one of the Libera IRC network servers, which was dropped. But I was able to successfully ping the IPv6 addresses of both Google and the Libera IRC server I was connected to at the time. Those all, both successful and unsuccessful pings, were captured in the ETL file. |
Thanks. The traffic over IPv6 is working (because there's a v6 route), but IPv4 doesn't have a route, so all of that traffic is failing. We have now heard a couple of instances where something is running on the Windows host that is affecting the vNIC that we use - causing the vmNIC in the container to go down & up again, at which point Linux will delete the IPv4 route (that's just Linux stack behavior, for whatever reasons). It's not clear what is changing the state of the vNIC on the host though. There's nothing indicated in WSL that it changed (if HNS changed it for example, we would get a callback notification). (HNS is the component that creates the vNICs). We are going to talk more internally about better responding to this and syncing IP state in Linux when we see changes occur unexpectedly. |
Yup, thanks Keith! I read the other thread, and that one's author is a lot more thorough than I am. I just confirmed over on that thread that the IPv6 Temporary IP behavior he suspected is also what triggers it for me. |
Thank you all for your help debugging this. I was able to reproduce this and I have a fix which will hopefully be out with the next update. |
The preview release should have the fix for this. Which hopefully will be going to the public release soon. wsl --update --pre-release Thanks again! |
closing since the issue is fixed. if you still encounter the problem, please open a new issue. thanks |
Windows Version
Microsoft Windows [Version 10.0.22621.2361]
WSL Version
2.0.1.0
Are you using WSL 1 or WSL 2?
Kernel Version
5.15.123.1-1
Distro Version
Debian 11
Other Software
1Password SSH agent relay tunneling using npiperelay.
Repro Steps
Within 24 hours of starting a Debian 11 WSL session running with mirrored networking, the network becomes unavailable. All existing connections are dropped, and all attempts to use non-loopback IPs return Network is unreachable. Remediation requires completely exiting WSL and performing a full shutdown of the WSL environment through wsl.exe --shutdown.
Expected Behavior
Networking remains available throughout the life of the session.
Actual Behavior
Networking becomes unavailable the next day.
sbalmos@stormfront:/mnt/c/Users/sbalmos$ ping 192.168.0.1
ping: connect: Network is unreachable
sbalmos@stormfront:/mnt/c/Users/sbalmos$ ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.036 ms
^C
--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.036/0.036/0.036/0.000 ms
sbalmos@stormfront:/mnt/c/Users/sbalmos$ ping 1.1.1.1
ping: connect: Network is unreachable
sbalmos@stormfront:/mnt/c/Users/sbalmos$ exit
logout
PS C:\Users\sbalmos> wsl
sbalmos@stormfront:/mnt/c/Users/sbalmos$ ping 1.1.1.1
ping: connect: Network is unreachable
sbalmos@stormfront:/mnt/c/Users/sbalmos$ exit
logout
PS C:\Users\sbalmos> wsl --shutdown
PS C:\Users\sbalmos> wsl
removing previous socket...
Starting SSH-Agent relay...
sbalmos@stormfront:/mnt/c/Users/sbalmos$ ping 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=56 time=14.1 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=56 time=11.5 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=56 time=11.9 ms
^C
--- 1.1.1.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 11.511/12.520/14.105/1.134 ms
sbalmos@stormfront:/mnt/c/Users/sbalmos$
Current time is 10:48am ET. Since I restarted WSL here to regain networking, I expect to see networking become unavailable somewhere around 10:45am ET, give or take a few minutes.
Diagnostic Logs
No response
The text was updated successfully, but these errors were encountered: