Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WSL hang, arm64 cpu, not solved by other solutions #9454

Open
1 of 2 tasks
hwine opened this issue Jan 7, 2023 · 102 comments
Open
1 of 2 tasks

WSL hang, arm64 cpu, not solved by other solutions #9454

hwine opened this issue Jan 7, 2023 · 102 comments

Comments

@hwine
Copy link

hwine commented Jan 7, 2023

Version

Windows version: 10.0.22621.963

WSL Version

  • WSL 2
  • WSL 1

Kernel Version

Kernel version: 5.15.79.1

Distro Version

Ubuntu 20.04.5 LTS

Other Software

I'm running on a Microsoft Surface X laptop with a Microsoft SQ2 processor & 16GB ram

C:\Users\hwine> wsl -v -l
WSL version: 1.0.3.0
Kernel version: 5.15.79.1
WSLg version: 1.0.47
MSRDC version: 1.2.3575
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22621.963

Repro Steps

Intermittent, but occurs regularly (1-4 times a day).

  1. start Ubuntu in WSL
  2. use it for a while
  3. wait a few hours (leaving open)
  4. try to use it again, find terminals locked up.

Some factors that may make it worse (subjective & anecdotal):

  • sleep/wake PC (but PC is plugged in the entire time).
  • leave "fancy terminal apps" (e.g. vim) running in a non-displayed window (e.g. via tmux)

Some factors that do not appear to change behavior:

  • terminal app (I've used "Windows Terminal", "Windows Terminal Preview", and VSCode terminal)
  • docker running or not
  • use of X or Wayland GUI apps

I have followed suggestions in other "hang issues" such as #9114, #8824, but they have not resolved the issue.

Expected Behavior

Terminals and apps work without hangs. (I have simple needs.)

(I use Windows for all high GUI/network apps, such as Zoom, Web Browsing, etc.)

Actual Behavior

WSL becomes unresponsive:

  • no response to keyboard (powershell tabs in same terminal session work fine)
  • no response to mouse operations, such as clicking 'close' on an Xterm window
  • high CPU on VmmemWSL Windows background process

Trying to regain control in powershell with wsl.exe

  • wsl.exe -l -v executes and output is correct
  • wsl.exe -t work executes but may delay noticeably before returning
    • it "works" (as shown by a 2nd run of wsl.exe -l -v), but cannot start a fresh session
  • wsl.exe --shutdown with some success
    • if it returns, I can open a fresh working wsl session
    • wsl.exe --shutdown hangs about 50% of the time (does not return to powershell prompt)
    • Killing wslservice.exe will reliably allow me to start a new WSL session, but very occasionally the new session will be overly sluggish, and I find rebooting will clear the problem.
    • When killing wslservice.exe, the wsl.exe --shutdown will terminate, with the message:
C:\Users\hwine> wsl --shutdown
The remote procedure call failed.
Error code: Wsl/0x800706be
C:\Users\hwine>

Diagnostic Logs

I've tried the procedure described here by OneBlue to obtain dumps & other logs. However, I can not successfully perform the procedure as the system distro shell is also hung. (It is a separate tab in the same windows terminal instance.)

I'm hoping you can provide some additional techniques to allow me to gather logs which will be of use in finding and fixing the issue.

@maxboone
Copy link

maxboone commented Jan 16, 2023

Seconded, I experience the same behavior on my SP X SQ2, sometimes the threads that are running will be responsive for a while, but generally the entire system / WSL instance just freezes. The only way for me to kill WSL in this situation is to forcefully kill the WSL instance (wsl --shutdown does not do a thing for me). Through a elevated PowerShell window using hcsdiag kill @(hcsdiag list | Select-Object -First 1).

Considering the entire WSL instance (c.q. VM) does not respond at the time of these crashes, it's not possible to run any command into the system distribution, even if that was opened before the crash happens even. It feels like something is not entirely right with the process forking / VM on this chip, but it's difficult to do any troubleshooting (on the kernel level) as everything is frozen.

I've learned to live with this for now, and regularly have to clean up corrupted git repositories or files if this crash happens during a write (or git commit). I have noticed that just walking away from the machine and letting it run for fifteen minutes or so, it often recovers.

@usedbytes
Copy link

I'm seeing this too - if anything, more frequently than ever.

I was on W10 22H2, now on W11 22H2 22621.1105 and it hasn't made any noticeable difference. Surface Pro X SQ2.

I previously raised #7913 for what could be the same issue - but I no longer see RCU stalls reported in dmesg. I still see the "locked up" processes consuming 100% CPU according to top inside WSL2.

@OneBlue
Copy link
Collaborator

OneBlue commented Jan 17, 2023

/logs

@ghost
Copy link

ghost commented Jan 17, 2023

Hello! Could you please provide more logs to help us better diagnose your issue?

To collect WSL logs, download and execute collect-wsl-logs.ps1 in an administrative powershell prompt:

Invoke-WebRequest -UseBasicParsing "https://raw.githubusercontent.com/microsoft/WSL/master/diagnostics/collect-wsl-logs.ps1" -OutFile collect-wsl-logs.ps1
Set-ExecutionPolicy Bypass -Scope Process -Force
.\collect-wsl-logs.ps1

The scipt will output the path of the log file once done.

Once completed please upload the output files to this Github issue.

Click here for more info on logging

Thank you!

@ghost
Copy link

ghost commented Jan 17, 2023

@hwine , Do you have the debug shell running? You could try the same steps from there? Regarding getting the dumps of init.

@ghost
Copy link

ghost commented Jan 17, 2023

You can enable the debug shell as described here: #7930

debugShell=true

@hwine
Copy link
Author

hwine commented Jan 17, 2023

@pmartincic - thanks for the pointer -- I'm attempting to enable that, but having issues (or don't understand the full sequence). After adding debugShell=true, the root shell will not stay open.

To be clear, I'm attempting to follow the steps from this prior issue, with the debugShell addition. When I launch the root shell via C:\windows\system32\wsl.exe --user root --system:

  • the root window opens as before (from a non-elevated powershell prompt)
    • a (new) detached terminal window opens, titled WSL Debug Shell (no prompt ever appears in this windows)
    • the normal bash prompt appears after about 3 seconds
    • this window closes after about 10 seconds
      Is this the expected behavior?

Also, I now have 2 sets of instructions for gathering logs:

I'm assuming only the former logs are being requested. Please correct me if I've misunderstood the process.

@ghost ghost removed the needs-author-feedback label Jan 17, 2023
@ghost
Copy link

ghost commented Jan 17, 2023

For this to work you have to leave the detached window titled WSL Debug Shell open while you use WSL as you normally would. Currently it only opens when you launch an instance, can't be reopened unless the WSL instance is restarted. Once you experience a hang you would work inside the WSL Debug Shell instead of launching a system shell via C:\windows\system32\wsl.exe --user root --system to attempt debugging the problem. Conversely you could just have a system shell running minimized while you work waiting for a reproduction of the hang. Make sense?

@ghost
Copy link

ghost commented Jan 17, 2023

I misread your comment. We'll definitely want the logs as specified in /logs. It is not expected that no prompt appears in the debug shell or that the system shell window closes itself.

To clarify, both #9454 (comment) and #9114 (comment) will give us information, but both different.

It may be simpler for you to use the system shell, assuming you remember to launch it and it leave it open because it has /mnt/c mounted, but the debug shell will not. However, if you're extracting dumps via the debug shell you won't need the system shell. You'll need one or the other, but not both.

@ghost
Copy link

ghost commented Jan 17, 2023

Hello! Could you please provide more logs to help us better diagnose your issue?

To collect WSL logs, download and execute collect-wsl-logs.ps1 in an administrative powershell prompt:

Invoke-WebRequest -UseBasicParsing "https://raw.githubusercontent.com/microsoft/WSL/master/diagnostics/collect-wsl-logs.ps1" -OutFile collect-wsl-logs.ps1
Set-ExecutionPolicy Bypass -Scope Process -Force
.\collect-wsl-logs.ps1

The scipt will output the path of the log file once done.

Once completed please upload the output files to this Github issue.

Click here for more info on logging

Thank you!

@ghost ghost deleted a comment Jan 17, 2023
@ghost ghost deleted a comment Jan 17, 2023
@maxboone
Copy link

maxboone commented Jan 18, 2023

Same here, if I run the system shell as specified and have the debug shell configured, it just disappears after a couple of seconds.

Of course the system shell stays open, but that one freezes up when my main / working instance hangs.

I think this is also the behavior that OP is talking about, see video:

P.S. same behavior on regular WSL instances (so not the system distro, but e.g. wsl -d teamblue), debug shell pops up and closes itself after a couple seconds (probably due to a timeout connecting to the underlying shell?).

@hwine
Copy link
Author

hwine commented Jan 18, 2023

I think this is also the behavior that OP is talking about, see video:

Exactly! Thanks @maxboone! (I should learn how to make those videos.)

P.S. same behavior on regular WSL instances (so not the system distro, but e.g. wsl -d teamblue), debug shell pops up and closes itself after a couple seconds (probably due to a timeout connecting to the underlying shell?).

I do not get a debug shell popping up on my normal instances. 🤷

@ghost ghost removed the needs-author-feedback label Jan 18, 2023
@ghost ghost self-assigned this Jan 18, 2023
@ghost
Copy link

ghost commented Jan 18, 2023

Okay, I can reproduce the issue with the debug shell not working on ARM.

@hwine
Copy link
Author

hwine commented Jan 19, 2023

Okay, I think I have a good set of logs from both collection techniques.
wsl-init-dump-2023-01-18_16-35-12.zip
WslLogs-2023-01-18_16-53-18.zip

Note for the logs collected from an elevated powershell, the process seemed to hang for a bit, and the screen text didn't give me confidence I did it correctly. :/

screen text
(.venv) C:\Users\hwine\Desktop\plumbum [master ≡ +8 ~2 -0 !]> Invoke-WebRequest -UseBasicParsing "https://raw.githubusercontent.com/microsoft/WSL/master/diagnostics/collect-wsl-logs.ps1" -OutFile collect-wsl-logs.ps1
(.venv) C:\Users\hwine\Desktop\plumbum [master ≡ +8 ~2 -0 !]> Set-ExecutionPolicy Bypass -Scope Process -Force
(.venv) C:\Users\hwine\Desktop\plumbum [master ≡ +8 ~2 -0 !]> .\collect-wsl-logs.ps1
Directory: C:\Users\hwine\Desktop\plumbum

Mode LastWriteTime Length Name


d---- 2023-01-18 16:53 WslLogs-2023-01-18_16-53-18
The operation completed successfully.eted. (Number of bytes processed: 3527) ]
The operation completed successfully.
The operation completed successfully.
Get-WindowsOptionalFeature: C:\Users\hwine\Desktop\plumbum\collect-wsl-logs.ps1:38
Line |
38 | Get-WindowsOptionalFeature -Online > $folder/optional-components.txt
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| Class not registered
Log collection is running. Please reproduce the problem and press any key to save the logs.
Saving logs...
Press Ctrl+C to cancel the stop operation. (Number of bytes processed: 3527) ]
100% [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>]
Logs saved in: C:\Users\hwine\Desktop\plumbum\WslLogs-2023-01-18_16-53-18.zip. Please attach that file to the GitHub issue.

(.venv) C:\Users\hwine\Desktop\plumbum [master ≡ +9 ~2 -0 !]> ls .\WslLogs-2023-01-18_16-53-18.zip

Directory: C:\Users\hwine\Desktop\plumbum

Mode LastWriteTime Length Name


-a--- 2023-01-18 16:55 36975 WslLogs-2023-01-18_16-53-18.zip

(.venv) C:\Users\hwine\Desktop\plumbum [master ≡ +9 ~2 -0 !]>

@ghost
Copy link

ghost commented Jan 19, 2023

Thanks! I'm still working on the debug shell issue at the moment.

@hwine
Copy link
Author

hwine commented Jan 19, 2023

Thanks! I'm still working on the debug shell issue at the moment.

Serendipity -- shortly after reading your response, I noticed an anomaly that might shed some light. I had one of the "WSL Debug Shell" windows display on my screen. What's odd about that is I had closed out of all wsl instances last night (wsl --shutdown), and do not see any of the normal processes I associate with WSL running.

However:

  • I can't "find" that window again, even though I didn't close it
    • it doesn't appear in <alt>-<tab> cycle
    • it doesn't appear in the "new desktop" icon on the task bar
    • it doesn't appear after I've minimized all other windows
  • I do use an X server, and it reports it is still serving 6 clients!!!!

I've not had luck with WSLg -- issues from a number of months ago suggested the SQ2 chip's graphics chip wasn't supported, so poor performance was expected. I haven't tried WSLg since then.

@maxboone are you using WSLg or an X server?

@maxboone
Copy link

@maxboone are you using WSLg or an X server?

WSLg natively, but I must say that I don't use it often.

@hwine
Copy link
Author

hwine commented Jan 25, 2023

As I understand it, there's nothing I can do about providing additional logs until @pmartincic resolves the debug shell bug.

For whatever reason, the hanging is now occurring more frequently for me. Anecdotally, I am finding dmesg reports a stalled CPU after a "hang" self resolves.

typical report
[  +0.006416] sd 0:0:0:1: [sdb] Attached SCSI disk
[  +0.005168] sd 0:0:0:2: [sdc] Attached SCSI disk
[  +0.088459] Adding 2097152k swap on /dev/sdb.  Priority:-2 extents:1 across:2097152k
[  +0.011115] EXT4-fs (sdc): recovery complete
[  +0.002790] EXT4-fs (sdc): mounted filesystem with ordered data mode. Opts: discard,errors=remount-ro,data=ordered>
[  +0.077833] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  +0.263242] rsyslog: unrecognized service
[Jan25 07:58] hv_balloon: Max. dynamic memory size: 7972 MB
[Jan25 08:00] TCP: eth0: Driver has suspect GRO implementation, TCP performance may be compromised.
[Jan25 11:32] rcu: INFO: rcu_sched self-detected stall on CPU
[  +0.001879] rcu:      5-....: (14999 ticks this GP) idle=2b7/1/0x4000000000000002 softirq=54910/54910 fqs=7336
[  +0.000764]   (t=15000 jiffies g=192973 q=27380)
[  +0.000512] Task dump for CPU 5:
[  +0.000316] task:bash            state:R  running task     stack:    0 pid:20006 ppid:   645 flags:0x0000000a
[  +0.000885] Call trace:
[  +0.000222]  dump_backtrace+0x0/0x1c8
[  +0.001334]  show_stack+0x1c/0x28
[  +0.000345]  sched_show_task+0x14c/0x180
[  +0.000376]  dump_cpu_task+0x48/0x54
[  +0.000546]  rcu_dump_cpu_stacks+0xf4/0x138
[  +0.000276]  rcu_sched_clock_irq+0x938/0xaa0
[  +0.000479]  update_process_times+0x9c/0x2e0
[  +0.000373]  tick_sched_handle.isra.0+0x38/0x50
[  +0.000467]  tick_sched_timer+0x50/0xa0
[  +0.000274]  __hrtimer_run_queues+0x11c/0x328
[  +0.000345]  hrtimer_interrupt+0x118/0x300
[  +0.000291]  hv_stimer0_isr+0x28/0x30
[  +0.000516]  hv_stimer0_percpu_isr+0x14/0x20
[  +0.000431]  handle_percpu_devid_irq+0x8c/0x1c0
[  +0.000461]  handle_domain_irq+0x64/0x90
[  +0.000345]  gic_handle_irq+0xb8/0x128
[  +0.000324]  call_on_irq_stack+0x28/0x3c
[  +0.000330]  do_interrupt_handler+0x54/0x5c
[  +0.000397]  el1_interrupt+0x2c/0x40
[  +0.000452]  el1h_64_irq_handler+0x14/0x20
[  +0.000353]  el1h_64_irq+0x74/0x78
[  +0.000319]  clear_rseq_cs.isra.0+0x4c/0x60
[  +0.000483]  do_notify_resume+0xfc/0x3c0
[  +0.000317]  el0_svc+0x3c/0x48
[  +0.000315]  el0t_64_sync_handler+0xa8/0xb0
[  +0.000324]  el0t_64_sync+0x158/0x15c

From scanning the RCU docs and configuration options, I'm hoping that some tuning may reduce the impact of these hangs until a fix is available. Before I start kernel-tuning-by-coincidence (which feels risky), does anyone have any suggestions?

Absent any other guidance, my first experiment will be to change /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout from the current value of 60, to the documented (for kernel 6.2.0) 21 seconds.
60

@ghost
Copy link

ghost commented Jan 25, 2023

@hwine , My apologies, I got pre-empted by 9508. After that I can come back to this.

@2hansen
Copy link

2hansen commented Jan 27, 2023

I experience the same behavior since switching from Intel based laptop to on my SP 9 SQ3. Happens very frequently makes it quite hard to use my wsl for development work. When unresponsive VmmemWSL process has high CPU load.
PS C:\WINDOWS\system32> wsl -v
WSL version: 1.1.0.0
Kernel version: 5.15.83.1
WSLg version: 1.0.48
MSRDC version: 1.2.3770
Direct3D version: 1.608.2-61064218
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22623.1245

@maxboone
Copy link

maxboone commented Jan 27, 2023

Absent any other guidance, my first experiment will be to change /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout from the current value of 60, to the documented (for kernel 6.2.0) 21 seconds. 60

Good stuff @hwine!

I'll try the same for my instances, where do you plan on tweaking the kernel settings. Considering it's a system-wide (all instances & --system instance), are you going to try tweaking this on the system distro?

For the non-system distros, I'll set these tweaks (or create a script that does them, if more tweaks are added) using the following in my wsl.conf:

[boot]
command = "echo 21 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout &"

I do see that there's a vmIdleTimeout setting that defaults to 60000 (60s) in .wslconfig. I wonder if that's related.

It's not.

@2hansen
Copy link

2hansen commented Jan 27, 2023

I experience the same behavior since switching from Intel based laptop to on my SP 9 SQ3. Happens very frequently makes it quite hard to use my wsl for development work. When unresponsive VmmemWSL process has high CPU load. PS C:\WINDOWS\system32> wsl -v WSL version: 1.1.0.0 Kernel version: 5.15.83.1 WSLg version: 1.0.48 MSRDC version: 1.2.3770 Direct3D version: 1.608.2-61064218 DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp Windows version: 10.0.22623.1245

Noted that after switching to visual studio code insiders edition the problem seems to be much less prevalant - now only system sleep seems to make wsl hang.

@hwine
Copy link
Author

hwine commented Jan 28, 2023

I'll try the same for my instances, where do you plan on tweaking the kernel settings. Considering it's a system-wide (all instances & --system instance), are you going to try tweaking this on the system distro?

Until the debug feature is fixed, I'm assuming that running the system instance won't provide any value, so I'm only running normal distros (I only have one).

For the non-system distros, I'll set these tweaks (or create a script that does them, if more tweaks are added) using the following in my wsl.conf:

@maxboone -- you are far gutsier than I! An error there could prevent booting, and I don't want the recovery hassle. :)

I'm manually running a script after I boot. Based on the docs, the new values "will [affect] the timeout for the next stall". So, as long as the system survives for a minute, I'm good.

I'm just activating the changes -- this week had too many fire drills to try "on the clock".

@david-nordvall
Copy link

@david-nordvall, sorry for the slow response. I'm currently working on other host/guest interaction issues surrounding time keeping, hibernation. It is tracked? But it being arm specific doesn't help. The one with the ability to bump/reprioritize is @craigloewen-msft.

@craigloewen-msft, any chance to get some focus on this issue? Having purchased a perfectly capable (and expensive) PC on a platform which was very much being positioned as a developer platform at the time (I guess everything is AI nowadays, instead...) just to be forced to use Github Codespaces in order to actually get some work done doesn't feel great... What do you guys need from me to be able to successfully debug this?

@craigloewen-msft
Copy link
Member

@david-nordvall can you open a new issue for us that describes your specific issue? And can you tag this issue number in it? Lastly, can you get WSL to the point where it's hanging for a long time and then can you collect crash dumps and attach them to the issue? Updated instructions on how to do that can be found here.

Thank you!

@david-nordvall
Copy link

@craigloewen-msft No problem, have created issue 10667 and made crash dumps available there.

@maxboone
Copy link

maxboone commented Nov 1, 2023

#10667 (comment)

There is a Windows OS fix that targets an ARM-specific hang that should be going out next month (I think the fourth patch Tuesday of November).

@agascoig
Copy link

agascoig commented Nov 17, 2023

I am having periodic freezes in WSL2 on a Thinkpad x13s (Qualcomm 8cx Gen 3) ARM64 Windows. I am using the collect-wsl-logs.ps1 script to try to trace what is happening, and I get events in logs.etl like "The description for Event ID 0 from source Microsoft-Windows-Host-Network-Service cannot be found." I have tried editing the /etc/hosts file and turning off regeneration of it, but this issue still persists.

PS C:\Users\alexg\wsl> wsl --version
WSL version: 2.0.9.0
Kernel version: 5.15.133.1-1
WSLg version: 1.0.59
MSRDC version: 1.2.4677
Direct3D version: 1.611.1-81528511
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22631.2715

It appears there is some sort of event that is not getting properly logged. I hope that the upcoming fix will address this issue.
WslLogs-2023-11-16_17-23-50.zip

EDIT: 6:46 PM PST: I see on WSL2 reddit that one may need to re-enable the "Virtual Machine Platform" in Windows Features after the November 2023 Windows 11 updates.

EDIT: 5:17 PM PST Nov 19 2023: I am still getting the freezes, but less often.

EDIT: 5:09 PM PST Nov 22 2023: I should mention I am using VS Code WSL while observing all of these problems.

@f0o
Copy link

f0o commented Dec 22, 2023

#10667 (comment)

There is a Windows OS fix that targets an ARM-specific hang that should be going out next month (I think the fourth patch Tuesday of November).

Which patch would that be? I'm still experiencing this multiple times per day and have to force-kill wslservice.exe to get it restarted

@david-nordvall
Copy link

@f0o I've installed all patch tuesday patches to date and the issue still persists. I've asked in issue #10667 if the mentioned fix has actually been released (or, if it has and doesn't solve the issue, I can provide any more information to help debugging), but no response so far... 😔

@f0o
Copy link

f0o commented Dec 22, 2023

@david-nordvall Shame; I'm going to get 23H2 update now and see if that solves anything. Will report back in probably an hour or two.

Force-Killing wslservice every so often (sometimes once per hour) is getting very tedious when you want to work on-the-go with your surface.

@david-nordvall
Copy link

@f0o I'm already on 23H2 but no luck, for me at least. But please report back.

Once per hour sounds great, actually. If I start a moderately complex development environment in Docker (in WSL2) I can work for about 2-10 minutes before WSL hangs.

@f0o
Copy link

f0o commented Dec 22, 2023

@david-nordvall yikes.

I'm also on VSCode with WSL2 -> Docker (DevContainers). It feels more or less random when it crashes. Sometimes it just dies instantly other times it survives an hour or two.

Not really running much intensive work, usually just writing code and pushing to git for the CI to do all the heavy lifting of compiling/testing.

Hell it even crashes when I just use WSL to ssh into some jumpbox for IRC'ing. I somehow dont think it's workload dependent but simply some interrupt hangup somewhere that ends up stalling the whole thing.

@f0o
Copy link

f0o commented Dec 23, 2023

23H2 deffo changed something because wsl wont start at all anymore 😂

Had to wipe all vms and resetup everything for some reason.

Issue still persists.

@david-nordvall
Copy link

Update: I completely wiped my Surface Pro 9 and re-installed Windows (using the restore functionality built in to Windows). The issue still persists :-(

@luozf14
Copy link

luozf14 commented Feb 2, 2024

same issue. ThinkPad x13s gen1. Win11 23H2 on arm64. WSL2 hangs after 20 minutes text editing work using vscode. wsl --shutdown takes about 20 min to complete.

@vadika
Copy link

vadika commented Feb 13, 2024

Thinkpad x13s gen1 -- looks like I found the solution, at least it works for me -- no hangs, no excessive cpu load and survives sleep/resume.
Version WSL: 2.1.1.0
Version ядра: 5.15.146.1-2
Version WSLg: 1.0.60
Version MSRDC: 1.2.5105
Version Direct3D: 1.611.1-81528511
Version DXCore: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Version Windows: 10.0.22631.3085

I upgraded WSL to pre-release ( wsl --update --pre-release)
and disabled DirectX ( /Users/%username% .wslconfig -->
[wsl2]
gpuSupport=false
)
Did it because there were kernel messages about dx interaction in linux kernel log.
Restarted WSL and have no problems since (24 hours stable now)

@awfulcooking
Copy link

awfulcooking commented Feb 13, 2024

Surface Pro X (SQ1, 23H2, Debian):

[    2.764032] Adding 4194304k swap on /dev/sdb.  Priority:-2 extents:1 across:4194304k
[    3.511084] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[    3.519239] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[    3.520770] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[    3.522474] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[    3.524000] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -2
Processing LX_GNS_MESSAGE_TYPE LxGnsMessageConnectTestRequest
CheckConnection: resolving the name www.msftconnecttest.com [AF_INET]
CheckConnection: connecting to 90.244.154.42
CheckConnection: resolving the name www.msftconnecttest.com [AF_INET6]
Exception: CheckConnection: getaddrinfo() failed: -2 - Name does not resolve
CheckConnection: arming select for 5 seconds
CheckConnection: v4 succeeded
CheckConnection: returning v4 (1) v6 (2)
LxGnsMessageConnectTestRequest (destination: www.msftconnecttest.com) returning: 0x20001
GNS: Processing Next Message Successful (0x20001)
Returning LxGnsMessageResult [131073 - ]
GNS: Processing Next Message
[   49.974868] hv_balloon: Max. dynamic memory size: 1700 MB
[  152.236427] mini_init (109): drop_caches: 1
Processing LX_GNS_MESSAGE_TYPE LxGnsMessageConnectTestRequest
CheckConnection: resolving the name www.msftconnecttest.com [AF_INET]
CheckConnection: connecting to 90.244.154.42
CheckConnection: resolving the name www.msftconnecttest.com [AF_INET6]
Exception: CheckConnection: getaddrinfo() failed: -2 - Name does not resolve
CheckConnection: arming select for 5 seconds
CheckConnection: v4 succeeded
CheckConnection: returning v4 (1) v6 (2)
LxGnsMessageConnectTestRequest (destination: www.msftconnecttest.com) returning: 0x20001
GNS: Processing Next Message Successful (0x20001)
Returning LxGnsMessageResult [131073 - ]
GNS: Processing Next Message
[  967.094206] rcu: INFO: rcu_sched self-detected stall on CPU
[  967.096424] rcu:     1-....: (14964 ticks this GP) idle=9b9/1/0x4000000000000002 softirq=5231/5231 fqs=7411
[  967.097970]  (t=15000 jiffies g=8973 q=536)
[  967.098671] Task dump for CPU 1:
[  967.099278] task:weston          state:R  running task     stack:    0 pid:  137 ppid:   133 flags:0x0000000a
[  967.100801] Call trace:
[  967.101219]  dump_backtrace+0x0/0x1b0
[  967.102459]  show_stack+0x1c/0x24
[  967.103047]  sched_show_task+0x164/0x190
[  967.103731]  dump_cpu_task+0x48/0x54
[  967.104653]  rcu_dump_cpu_stacks+0xec/0x130
[  967.105234]  rcu_sched_clock_irq+0x908/0xa40
[  967.106096]  update_process_times+0xa0/0x190
[  967.106904]  tick_sched_timer+0x5c/0xd0
[  967.107540]  __hrtimer_run_queues+0x140/0x32c
[  967.108283]  hrtimer_interrupt+0xf4/0x240
[  967.108817]  hv_stimer0_isr+0x28/0x30
[  967.109868]  hv_stimer0_percpu_isr+0x14/0x20
[  967.110610]  handle_percpu_devid_irq+0x8c/0x1b4
[  967.111397]  handle_domain_irq+0x64/0x90
[  967.111987]  gic_handle_irq+0x58/0x128
[  967.112506]  call_on_irq_stack+0x20/0x38
[  967.113052]  do_interrupt_handler+0x54/0x5c
[  967.113650]  el1_interrupt+0x2c/0x4c
[  967.114441]  el1h_64_irq_handler+0x14/0x20
[  967.115052]  el1h_64_irq+0x74/0x78
[  967.115632]  clear_rseq_cs.isra.0+0x4c/0x60
[  967.116433]  do_notify_resume+0xf8/0xeb0
[  967.116960]  el0_svc+0x3c/0x50
[  967.117537]  el0t_64_sync_handler+0x9c/0x120
[  967.118323]  el0t_64_sync+0x158/0x15c
Processing LX_GNS_MESSAGE_TYPE LxGnsMessageConnectTestRequest
CheckConnection: resolving the name www.msftconnecttest.com [AF_INET]
CheckConnection: connecting to 90.244.154.34
CheckConnection: resolving the name www.msftconnecttest.com [AF_INET6]
Exception: CheckConnection: getaddrinfo() failed: -2 - Name does not resolve
CheckConnection: arming select for 5 seconds
CheckConnection: result.Ipv4Status = ConnCheckStatus::FailureSocketConnect (timeout)
CheckConnection: returning v4 (4) v6 (2)
LxGnsMessageConnectTestRequest (destination: www.msftconnecttest.com) returning: 0x20004
GNS: Processing Next Message Successful (0x20004)
Returning LxGnsMessageResult [131076 - ]
GNS: Processing Next Message
[ 1135.741051] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1135.742230] rcu:     1-....: (59854 ticks this GP) idle=9b9/1/0x4000000000000002 softirq=5231/5231 fqs=29408
[ 1135.743672]  (t=60008 jiffies g=8973 q=2022)
[ 1135.744460] Task dump for CPU 1:
[ 1135.745254] task:weston          state:R  running task     stack:    0 pid:  137 ppid:   133 flags:0x0000000a
[ 1135.746398] Call trace:
[ 1135.746621]  dump_backtrace+0x0/0x1b0
[ 1135.747115]  show_stack+0x1c/0x24
[ 1135.747660]  sched_show_task+0x164/0x190
[ 1135.748118]  dump_cpu_task+0x48/0x54
[ 1135.748543]  rcu_dump_cpu_stacks+0xec/0x130
[ 1135.749080]  rcu_sched_clock_irq+0x908/0xa40
[ 1135.749650]  update_process_times+0xa0/0x190
[ 1135.750207]  tick_sched_timer+0x5c/0xd0
[ 1135.750728]  __hrtimer_run_queues+0x140/0x32c
[ 1135.751211]  hrtimer_interrupt+0xf4/0x240
[ 1135.751598]  hv_stimer0_isr+0x28/0x30
[ 1135.752030]  hv_stimer0_percpu_isr+0x14/0x20
[ 1135.752591]  handle_percpu_devid_irq+0x8c/0x1b4
[ 1135.753091]  handle_domain_irq+0x64/0x90
[ 1135.753471]  gic_handle_irq+0x58/0x128
[ 1135.753886]  call_on_irq_stack+0x20/0x38
[ 1135.754276]  do_interrupt_handler+0x54/0x5c
[ 1135.754717]  el1_interrupt+0x2c/0x4c
[ 1135.755145]  el1h_64_irq_handler+0x14/0x20
[ 1135.755505]  el1h_64_irq+0x74/0x78
[ 1135.755849]  clear_rseq_cs.isra.0+0x4c/0x60
[ 1135.756319]  do_notify_resume+0xf8/0xeb0
[ 1135.756675]  el0_svc+0x3c/0x50
[ 1135.757065]  el0t_64_sync_handler+0x9c/0x120
[ 1135.757596]  el0t_64_sync+0x158/0x15c
Processing LX_GNS_MESSAGE_TYPE LxGnsMessageConnectTestRequest
CheckConnection: resolving the name www.msftconnecttest.com [AF_INET]
Exception: CheckConnection: getaddrinfo() failed: -3 - Try again
CheckConnection: resolving the name www.msftconnecttest.com [AF_INET6]
Exception: CheckConnection: getaddrinfo() failed: -3 - Try again
CheckConnection: returning v4 (2) v6 (2)
LxGnsMessageConnectTestRequest (destination: www.msftconnecttest.com) returning: 0x20002
GNS: Processing Next Message Successful (0x20002)
Returning LxGnsMessageResult [131074 - ]
GNS: Processing Next Message
[ 1301.906019] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1301.907897] rcu:     1-....: (104816 ticks this GP) idle=9b9/1/0x4000000000000002 softirq=5231/5231 fqs=51500
[ 1301.909720]  (t=105011 jiffies g=8973 q=3510)
[ 1301.910689] Task dump for CPU 1:
[ 1301.911380] task:weston          state:R  running task     stack:    0 pid:  137 ppid:   133 flags:0x0000000a
[ 1301.913333] Call trace:
[ 1301.913785]  dump_backtrace+0x0/0x1b0
[ 1301.914694]  show_stack+0x1c/0x24
[ 1301.915494]  sched_show_task+0x164/0x190
[ 1301.916105]  dump_cpu_task+0x48/0x54
[ 1301.916837]  rcu_dump_cpu_stacks+0xec/0x130
[ 1301.917522]  rcu_sched_clock_irq+0x908/0xa40
[ 1301.918803]  update_process_times+0xa0/0x190
[ 1301.919619]  tick_sched_timer+0x5c/0xd0
[ 1301.920287]  __hrtimer_run_queues+0x140/0x32c
[ 1301.921390]  hrtimer_interrupt+0xf4/0x240
[ 1301.922079]  hv_stimer0_isr+0x28/0x30
[ 1301.922699]  hv_stimer0_percpu_isr+0x14/0x20
[ 1301.923514]  handle_percpu_devid_irq+0x8c/0x1b4
[ 1301.924360]  handle_domain_irq+0x64/0x90
[ 1301.924952]  gic_handle_irq+0x58/0x128
[ 1301.925609]  call_on_irq_stack+0x20/0x38
[ 1301.926268]  do_interrupt_handler+0x54/0x5c
[ 1301.926894]  el1_interrupt+0x2c/0x4c
[ 1301.927543]  el1h_64_irq_handler+0x14/0x20
[ 1301.928252]  el1h_64_irq+0x74/0x78
[ 1301.928911]  clear_rseq_cs.isra.0+0x4c/0x60
[ 1301.929527]  do_notify_resume+0xf8/0xeb0
[ 1301.931746]  el0_svc+0x3c/0x50
[ 1301.932741]  el0t_64_sync_handler+0x9c/0x120
[ 1301.933676]  el0t_64_sync+0x158/0x15c
[ 1468.089387] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1468.090681] rcu:     1-....: (149813 ticks this GP) idle=9b9/1/0x4000000000000002 softirq=5231/5231 fqs=73593
[ 1468.092491]  (t=150019 jiffies g=8973 q=4968)
[ 1468.093259] Task dump for CPU 1:
[ 1468.093982] task:weston          state:R  running task     stack:    0 pid:  137 ppid:   133 flags:0x0000000a
[ 1468.095831] Call trace:
[ 1468.096253]  dump_backtrace+0x0/0x1b0
[ 1468.097658]  show_stack+0x1c/0x24
[ 1468.098868]  sched_show_task+0x164/0x190
[ 1468.099487]  dump_cpu_task+0x48/0x54
[ 1468.100903]  rcu_dump_cpu_stacks+0xec/0x130
[ 1468.101611]  rcu_sched_clock_irq+0x908/0xa40
[ 1468.102838]  update_process_times+0xa0/0x190
[ 1468.103719]  tick_sched_timer+0x5c/0xd0
[ 1468.104406]  __hrtimer_run_queues+0x140/0x32c
[ 1468.105287]  hrtimer_interrupt+0xf4/0x240
[ 1468.105953]  hv_stimer0_isr+0x28/0x30
[ 1468.106588]  hv_stimer0_percpu_isr+0x14/0x20
[ 1468.107497]  handle_percpu_devid_irq+0x8c/0x1b4
[ 1468.109079]  handle_domain_irq+0x64/0x90
[ 1468.109684]  gic_handle_irq+0x58/0x128
[ 1468.110504]  call_on_irq_stack+0x20/0x38
[ 1468.111322]  do_interrupt_handler+0x54/0x5c
[ 1468.111945]  el1_interrupt+0x2c/0x4c
[ 1468.112568]  el1h_64_irq_handler+0x14/0x20
[ 1468.114268]  el1h_64_irq+0x74/0x78
[ 1468.115010]  clear_rseq_cs.isra.0+0x4c/0x60
[ 1468.115848]  do_notify_resume+0xf8/0xeb0
[ 1468.116700]  el0_svc+0x3c/0x50
[ 1468.117973]  el0t_64_sync_handler+0x9c/0x120
[ 1468.118875]  el0t_64_sync+0x158/0x15c
[ 1634.272774] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1634.273960] rcu:     1-....: (194688 ticks this GP) idle=9b9/1/0x4000000000000002 softirq=5231/5231 fqs=95152
[ 1634.275462]  (t=195027 jiffies g=8973 q=6426)
[ 1634.276232] Task dump for CPU 1:
[ 1634.276900] task:weston          state:R  running task     stack:    0 pid:  137 ppid:   133 flags:0x0000000a
[ 1634.278492] Call trace:
[ 1634.278904]  dump_backtrace+0x0/0x1b0
[ 1634.279530]  show_stack+0x1c/0x24
[ 1634.280323]  sched_show_task+0x164/0x190
[ 1634.281025]  dump_cpu_task+0x48/0x54
[ 1634.281731]  rcu_dump_cpu_stacks+0xec/0x130
[ 1634.282321]  rcu_sched_clock_irq+0x908/0xa40
[ 1634.283189]  update_process_times+0xa0/0x190
[ 1634.284083]  tick_sched_timer+0x5c/0xd0
[ 1634.284730]  __hrtimer_run_queues+0x140/0x32c
[ 1634.285560]  hrtimer_interrupt+0xf4/0x240
[ 1634.286247]  hv_stimer0_isr+0x28/0x30
[ 1634.286878]  hv_stimer0_percpu_isr+0x14/0x20
[ 1634.287682]  handle_percpu_devid_irq+0x8c/0x1b4
[ 1634.288558]  handle_domain_irq+0x64/0x90
[ 1634.289195]  gic_handle_irq+0x58/0x128
[ 1634.289787]  call_on_irq_stack+0x20/0x38
[ 1634.290395]  do_interrupt_handler+0x54/0x5c
[ 1634.291090]  el1_interrupt+0x2c/0x4c
[ 1634.291835]  el1h_64_irq_handler+0x14/0x20
[ 1634.292476]  el1h_64_irq+0x74/0x78
[ 1634.293174]  clear_rseq_cs.isra.0+0x4c/0x60
[ 1634.293958]  do_notify_resume+0xf8/0xeb0
[ 1634.294543]  el0_svc+0x3c/0x50
[ 1634.295654]  el0t_64_sync_handler+0x9c/0x120
[ 1634.297186]  el0t_64_sync+0x158/0x15c
[ 1800.463929] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1800.539655] rcu:     1-....: (238052 ticks this GP) idle=9b9/1/0x4000000000000002 softirq=5231/5231 fqs=115916
[ 1800.691505]  (t=240096 jiffies g=8973 q=7884)
[ 1800.692592] Task dump for CPU 1:
[ 1800.725924] task:weston          state:R  running task     stack:    0 pid:  137 ppid:   133 flags:0x0000000a
[ 1800.789528] Call trace:
[ 1800.791436]  dump_backtrace+0x0/0x1b0
[ 1800.899671]  show_stack+0x1c/0x24
[ 1800.913866]  sched_show_task+0x164/0x190
[ 1800.943585]  dump_cpu_task+0x48/0x54
[ 1800.960038]  rcu_dump_cpu_stacks+0xec/0x130
[ 1800.970630]  rcu_sched_clock_irq+0x908/0xa40
[ 1801.002321]  update_process_times+0xa0/0x190
[ 1801.010376]  tick_sched_timer+0x5c/0xd0
[ 1801.016018]  __hrtimer_run_queues+0x140/0x32c
[ 1801.039996]  hrtimer_interrupt+0xf4/0x240
[ 1801.132611]  hv_stimer0_isr+0x28/0x30
[ 1801.177669]  hv_stimer0_percpu_isr+0x14/0x20
[ 1801.233281]  handle_percpu_devid_irq+0x8c/0x1b4
[ 1801.234490]  handle_domain_irq+0x64/0x90
[ 1801.335334]  gic_handle_irq+0x58/0x128
[ 1801.336804]  call_on_irq_stack+0x20/0x38
[ 1801.338003]  do_interrupt_handler+0x54/0x5c
[ 1801.353157]  el1_interrupt+0x2c/0x4c
[ 1801.403627]  el1h_64_irq_handler+0x14/0x20
[ 1801.467756]  el1h_64_irq+0x74/0x78
[ 1801.491109]  clear_rseq_cs.isra.0+0x4c/0x60
[ 1801.519716]  do_notify_resume+0xf8/0xeb0
[ 1801.570939]  el0_svc+0x3c/0x50
[ 1801.587231]  el0t_64_sync_handler+0x9c/0x120
[ 1801.598718]  el0t_64_sync+0x158/0x15c
[ 1836.057995] hrtimer: interrupt took 44005651 ns
[ 1962.769498] Exception:
[ 1962.770161] Operation canceled @p9io.cpp:258 (AcceptAsync)
[ 1962.944713]
[ 1967.784952] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1967.856279] rcu:     1-....: (276088 ticks this GP) idle=9b9/1/0x4000000000000002 softirq=5231/5231 fqs=132893
[ 1968.000378]  (t=285406 jiffies g=8973 q=9331)
[ 1968.103228] rcu: rcu_sched kthread starved for 1379 jiffies! g8973 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=2
[ 1968.320482] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 1968.600515] rcu: RCU grace-period kthread stack dump:
[ 1968.683365] task:rcu_sched       state:R  running task     stack:    0 pid:   14 ppid:     2 flags:0x00000008
[ 1968.852308] Call trace:
[ 1968.888048]  __switch_to+0xb4/0xec
[ 1968.999212]  __schedule+0x2cc/0x800
[ 1969.061683]  schedule+0x64/0x100
[ 1969.144007]  schedule_timeout+0x9c/0x184
[ 1969.220361]  rcu_gp_fqs_loop+0x100/0x364
[ 1969.296220]  rcu_gp_kthread+0x108/0x140
[ 1969.388705]  kthread+0x124/0x130
[ 1969.468734]  ret_from_fork+0x10/0x20
[ 1969.557124] rcu: Stack dump where RCU GP kthread last ran:
[ 1969.689180] Task dump for CPU 2:
[ 1969.770725] task:weston          state:R  running task     stack:    0 pid:  145 ppid:   133 flags:0x0000000f
[ 1970.150921] Call trace:
[ 1970.246591]  __switch_to+0xb4/0xec
[ 1970.344106]  0xffff000002cb1d00
[ 1970.437906] Task dump for CPU 1:
[ 1970.528765] task:weston          state:R  running task     stack:    0 pid:  137 ppid:   133 flags:0x0000000b
[ 1971.000258] Call trace:
[ 1971.129089]  dump_backtrace+0x0/0x1b0
[ 1971.277596]  show_stack+0x1c/0x24
[ 1971.362667]  sched_show_task+0x164/0x190
[ 1971.434626]  dump_cpu_task+0x48/0x54
[ 1971.554021]  rcu_dump_cpu_stacks+0xec/0x130
[ 1971.691116]  rcu_sched_clock_irq+0x908/0xa40
[ 1971.826492]  update_process_times+0xa0/0x190
[ 1971.899642]  tick_sched_timer+0x5c/0xd0
[ 1971.964201]  __hrtimer_run_queues+0x140/0x32c
[ 1972.133174]  hrtimer_interrupt+0xf4/0x240
[ 1972.185289]  hv_stimer0_isr+0x28/0x30
[ 1972.278692]  hv_stimer0_percpu_isr+0x14/0x20
[ 1972.415058]  handle_percpu_devid_irq+0x8c/0x1b4
[ 1972.495565]  handle_domain_irq+0x64/0x90
[ 1972.577916]  gic_handle_irq+0x58/0x128
[ 1972.695828]  call_on_irq_stack+0x20/0x38
[ 1972.740457]  do_interrupt_handler+0x54/0x5c
[ 1972.783890]  el1_interrupt+0x2c/0x4c
[ 1972.887542]  el1h_64_irq_handler+0x14/0x20
[ 1972.939049]  el1h_64_irq+0x74/0x78
[ 1972.993798]  clear_rseq_cs.isra.0+0x4c/0x60
[ 1973.103571]  do_notify_resume+0xf8/0xeb0
[ 1973.162704]  el0_svc+0x3c/0x50
[ 1973.203463]  el0t_64_sync_handler+0x9c/0x120
[ 1973.282090]  el0t_64_sync+0x158/0x15c
The connection with the virtual machine or container was closed.

WSL version: 2.0.9.0
Kernel version: 5.15.133.1-1
WSLg version: 1.0.59
MSRDC version: 1.2.4677
Direct3D version: 1.611.1-81528511
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22631.3085

@maxboone
Copy link

@pmartincic any updates?

@awfulcooking
Copy link

Still hangs after wsl --update and setting gpuSupport = false in .wslconfig

WSL version: 2.0.14.0
Kernel version: 5.15.133.1-1
WSLg version: 1.0.59
MSRDC version: 1.2.4677
Direct3D version: 1.611.1-81528511
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22631.3155

@maxboone
Copy link

maxboone commented Mar 2, 2024

I'm feeling like these RCU stalls are mainly happening for me after I've had my screen locked for a while, or even had my device suspend. Does that rhyme with experience others have?

@hwine
Copy link
Author

hwine commented Mar 2, 2024

I'm feeling like these RCU stalls are mainly happening for me after I've had my screen locked for a while, or even had my device suspend. Does that rhyme with experience others have?

My initial experiences seemed to be linked to "fancy terminal usage" (e.g. tmux, nvim). See also #7913, #10667, and #9454. However, I have seen recent hangs with simply normal bash terminal idling, but have not checked if they are RCU related. (Passing that test is now my first "gate" for any wsl/win11 updates that might improve things enough to warrant more in-depth testing.)

fwiw, there are some RCU tuning notes in that thread. IIRC, at the time it helped delay freezes, but not prevent them. Maybe they're worth another try now.

@maxboone
Copy link

maxboone commented Mar 2, 2024

fwiw, there are some RCU tuning notes in that thread. IIRC, at the time it helped delay freezes, but not prevent them. Maybe they're worth another try now.

Good call!

I do indeed experience them often while using tmux, neovim or vscode (which probably does fancy terminal stuff with it's integrated terminal too). I'm hesitant to draw a relation there though, as I'm using (at least) one of those most of the time I'm working in Linux.

I've done that at the time but it only gave (short-term) placebo results for me. I have to note that I've been running some Hyper-V VMs for a couple days now and haven't seen any issues there, however I haven't done a lot of work there either, will try working from a VM there for a while and see if similar issues pop up.

I'm currently trying to build a kernel from that VM to see if there might be any cross-compilation shenanigans causing this issue (not that I really expect that). Feel free to try it out too:

If that doesn't work, I'll rebuild it with debugging symbols enabled (and see if that adds more information to the RCU call stacks). I'll report back here in time.

@maxboone
Copy link

maxboone commented Mar 2, 2024

Well it's not much better, but I do feel that it recovers better with a newer kernel (whether local or cross compiled). Moreover, it seems that I can force it to happen by letting my machine go into standby while it's running something that either polls, or does some kind of watching.

dmesg
[ 3488.823969] rcu: INFO: rcu_sched self-detected stall on CPU
[ 3488.825404] rcu:     6-....: (14979 ticks this GP) idle=125c/1/0x4000000000000000 softirq=34649/34649 fqs=7434
[ 3488.826547] rcu:     (t=15000 jiffies g=204277 q=1309 ncpus=8)
[ 3488.827311] CPU: 6 PID: 2530 Comm: node Not tainted 6.7.7-WSL2-STABLE+ #1
[ 3488.827430] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 3488.827486] pc : clear_rseq_cs.isra.0+0x20/0x38
[ 3488.828123] lr : __rseq_handle_notify_resume+0x174/0x498
[ 3488.828126] sp : ffff80008338bc80
[ 3488.828127] x29: ffff80008338bcd0 x28: 000000000000080c x27: 0000000000000000
[ 3488.828130] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[ 3488.828132] x23: 0000ff03322cb6d8 x22: 0000000000000000 x21: 00000000ffffffff
[ 3488.828135] x20: ffff80008338beb0 x19: ffff000194cde740 x18: 0000000000000000
[ 3488.828137] x17: 0000000000000000 x16: 0000000000000000 x15: 0000fffa2c60c800
[ 3488.828139] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 3488.828141] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800080258d8c
[ 3488.828143] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[ 3488.828145] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000fffffffffff8
[ 3488.828147] x2 : 0000000000000000 x1 : 0000ff0330e0f7e8 x0 : 0000000000000000
[ 3488.828174] Call trace:
[ 3488.828200]  clear_rseq_cs.isra.0+0x20/0x38
[ 3488.828203]  do_notify_resume+0x1d0/0xe70
[ 3488.828286]  el0_svc+0x90/0xb0
[ 3488.828411]  el0t_64_sync_handler+0x138/0x148
[ 3488.828413]  el0t_64_sync+0x14c/0x150
[ 3557.107643] br-1951cbf8ee7a: port 4(vetha101994) entered disabled state
[ 3557.108802] vethf736b24: renamed from eth0
[ 3557.172825] br-1951cbf8ee7a: port 4(vetha101994) entered disabled state
[ 3557.175231] vetha101994 (unregistering): left allmulticast mode
[ 3557.175262] vetha101994 (unregistering): left promiscuous mode
[ 3557.175268] br-1951cbf8ee7a: port 4(vetha101994) entered disabled state
[ 3829.870764] br-1951cbf8ee7a: port 4(veth2d06f9f) entered blocking state
[ 3829.870849] br-1951cbf8ee7a: port 4(veth2d06f9f) entered disabled state
[ 3829.870944] veth2d06f9f: entered allmulticast mode
[ 3829.871296] veth2d06f9f: entered promiscuous mode
[ 3830.501386] eth0: renamed from veth268a22a
[ 3830.534499] br-1951cbf8ee7a: port 4(veth2d06f9f) entered blocking state
[ 3830.534511] br-1951cbf8ee7a: port 4(veth2d06f9f) entered forwarding state
[ 4110.229251] rcu: INFO: rcu_sched self-detected stall on CPU
[ 4110.230914] rcu:     3-....: (14990 ticks this GP) idle=9894/1/0x4000000000000000 softirq=73639/73639 fqs=6232
[ 4110.232326] rcu:     (t=15001 jiffies g=219869 q=1843 ncpus=8)
[ 4110.233456] CPU: 3 PID: 353 Comm: cron Not tainted 6.7.7-WSL2-STABLE+ #1
[ 4110.233462] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 4110.233464] pc : do_notify_resume+0x348/0xe70
[ 4110.233742] lr : do_notify_resume+0x2c8/0xe70
[ 4110.233744] sp : ffff8000834abd30
[ 4110.233745] x29: ffff8000834abe20 x28: 000000000000000d x27: 0000000000000000
[ 4110.233748] x26: 0000ffffcc8272f0 x25: 0000000000000000 x24: ffff000000c0d130
[ 4110.233751] x23: ffff8000834abdc0 x22: 0000000000000041 x21: 0000ffffcc827270
[ 4110.233753] x20: ffff000000c0c9c0 x19: ffff8000834abeb0 x18: 0000000000000000
[ 4110.233756] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8000834abcd8
[ 4110.233758] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 4110.233760] x11: 0000000000000000 x10: 0000000000000ac0 x9 : ffff80008001b4f8
[ 4110.233762] x8 : ffff8000834abdb8 x7 : 0000000000000000 x6 : 0000000000000000
[ 4110.233765] x5 : 0001000000000000 x4 : 0000000000001250 x3 : 0000ffffcc829720
[ 4110.233767] x2 : 0000ffffcc8284c0 x1 : 0000ffffcc8284d0 x0 : 0000ffffcc8272f0
[ 4110.233770] Call trace:
[ 4110.233797]  do_notify_resume+0x348/0xe70
[ 4110.233801]  el0_svc+0x90/0xb0
[ 4110.233921]  el0t_64_sync_handler+0x138/0x148
[ 4110.233923]  el0t_64_sync+0x14c/0x150
[ 5001.903196] rcu: INFO: rcu_sched self-detected stall on CPU
[ 5001.905317] rcu:     6-....: (14975 ticks this GP) idle=595c/1/0x4000000000000000 softirq=44771/44771 fqs=6495
[ 5001.906636] rcu:     (t=15001 jiffies g=237549 q=18838 ncpus=8)
[ 5001.907838] CPU: 6 PID: 1321 Comm: nvim Not tainted 6.7.7-WSL2-STABLE+ #1
[ 5001.907843] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 5001.907845] pc : do_epoll_wait.part.0+0x2ec/0x6f8
[ 5001.908261] lr : do_epoll_wait.part.0+0x238/0x6f8
[ 5001.908262] sp : ffff80008470bca0
[ 5001.908291] x29: ffff80008470bd40 x28: ffff000055332600 x27: ffff000055332618
[ 5001.908294] x26: 0000fffffffffffc x25: ffff0000193b3dd0 x24: 0000000000000400
[ 5001.908297] x23: 0000ffffc0ebd4a8 x22: ffff80008470bcf0 x21: 0000000000000000
[ 5001.908299] x20: ffff0000193b3d80 x19: ffff80008470bcd8 x18: 0000000000000000
[ 5001.908302] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 5001.908304] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 5001.908333] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800080b5ce1c
[ 5001.908335] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[ 5001.908337] x5 : 0000ffffc0ebd4a8 x4 : 0000000000000000 x3 : 0000000000000000
[ 5001.908339] x2 : 000000000000001b x1 : 0000000000000000 x0 : 0000000000000004
[ 5001.908342] Call trace:
[ 5001.908370]  do_epoll_wait.part.0+0x2ec/0x6f8
[ 5001.908373]  do_epoll_pwait.part.0+0x38/0x120
[ 5001.908375]  __arm64_sys_epoll_pwait+0x78/0x138
[ 5001.908376]  invoke_syscall.constprop.0+0x54/0x128
[ 5001.908473]  do_el0_svc+0x44/0xf0
[ 5001.908474]  el0_svc+0x24/0xb0
[ 5001.908595]  el0t_64_sync_handler+0x138/0x148
[ 5001.908597]  el0t_64_sync+0x14c/0x150
[ 5168.075433] rcu: INFO: rcu_sched self-detected stall on CPU
[ 5168.088496] rcu:     6-....: (59979 ticks this GP) idle=595c/1/0x4000000000000000 softirq=44771/44771 fqs=25910
[ 5168.091110] rcu:     (t=60010 jiffies g=237549 q=65516 ncpus=8)
[ 5168.092512] CPU: 6 PID: 1321 Comm: nvim Not tainted 6.7.7-WSL2-STABLE+ #1
[ 5168.092518] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 5168.092521] pc : do_epoll_wait.part.0+0x2ec/0x6f8
[ 5168.092529] lr : do_epoll_wait.part.0+0x238/0x6f8
[ 5168.092530] sp : ffff80008470bca0
[ 5168.092531] x29: ffff80008470bd40 x28: ffff000055332600 x27: ffff000055332618
[ 5168.092535] x26: 0000fffffffffffc x25: ffff0000193b3dd0 x24: 0000000000000400
[ 5168.092537] x23: 0000ffffc0ebd4a8 x22: ffff80008470bcf0 x21: 0000000000000000
[ 5168.092540] x20: ffff0000193b3d80 x19: ffff80008470bcd8 x18: 0000000000000000
[ 5168.092542] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 5168.092544] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 5168.092547] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800080b5ce1c
[ 5168.092549] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[ 5168.092552] x5 : 0000ffffc0ebd4a8 x4 : 0000000000000000 x3 : 0000000000000000
[ 5168.092554] x2 : 000000000000001b x1 : 0000000000000000 x0 : 0000000000000004
[ 5168.092556] Call trace:
[ 5168.092558]  do_epoll_wait.part.0+0x2ec/0x6f8
[ 5168.092560]  do_epoll_pwait.part.0+0x38/0x120
[ 5168.092562]  __arm64_sys_epoll_pwait+0x78/0x138
[ 5168.092564]  invoke_syscall.constprop.0+0x54/0x128
[ 5168.092570]  do_el0_svc+0x44/0xf0
[ 5168.092572]  el0_svc+0x24/0xb0
[ 5168.092576]  el0t_64_sync_handler+0x138/0x148
[ 5168.092578]  el0t_64_sync+0x14c/0x150
[ 5334.258791] rcu: INFO: rcu_sched self-detected stall on CPU
[ 5334.260000] rcu:     6-....: (104955 ticks this GP) idle=595c/1/0x4000000000000000 softirq=44771/44771 fqs=44963
[ 5334.261527] rcu:     (t=105014 jiffies g=237549 q=112023 ncpus=8)
[ 5334.262403] CPU: 6 PID: 1321 Comm: nvim Not tainted 6.7.7-WSL2-STABLE+ #1
[ 5334.262408] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 5334.262410] pc : do_epoll_wait.part.0+0x2ec/0x6f8
[ 5334.262464] lr : do_epoll_wait.part.0+0x238/0x6f8
[ 5334.262465] sp : ffff80008470bca0
[ 5334.262466] x29: ffff80008470bd40 x28: ffff000055332600 x27: ffff000055332618
[ 5334.262469] x26: 0000fffffffffffc x25: ffff0000193b3dd0 x24: 0000000000000400
[ 5334.262472] x23: 0000ffffc0ebd4a8 x22: ffff80008470bcf0 x21: 0000000000000000
[ 5334.262474] x20: ffff0000193b3d80 x19: ffff80008470bcd8 x18: 0000000000000000
[ 5334.262477] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 5334.262479] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 5334.262481] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800080b5ce1c
[ 5334.262484] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[ 5334.262486] x5 : 0000ffffc0ebd4a8 x4 : 0000000000000000 x3 : 0000000000000000
[ 5334.262519] x2 : 000000000000001b x1 : 0000000000000000 x0 : 0000000000000004
[ 5334.262522] Call trace:
[ 5334.262524]  do_epoll_wait.part.0+0x2ec/0x6f8
[ 5334.262526]  do_epoll_pwait.part.0+0x38/0x120
[ 5334.262528]  __arm64_sys_epoll_pwait+0x78/0x138
[ 5334.262529]  invoke_syscall.constprop.0+0x54/0x128
[ 5334.262533]  do_el0_svc+0x44/0xf0
[ 5334.262535]  el0_svc+0x24/0xb0
[ 5334.262538]  el0t_64_sync_handler+0x138/0x148
[ 5334.262540]  el0t_64_sync+0x14c/0x150
[ 5438.715315] systemd-journald: systemd-journal: potentially unexpected fatal signal 6.
[ 5438.715470] CPU: 2 PID: 195 Comm: systemd-journal Not tainted 6.7.7-WSL2-STABLE+ #1
[ 5438.715482] pstate: 80001000 (Nzcv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
[ 5438.715489] pc : 0000fff48dec9df8
[ 5438.715492] lr : 0000fff48dec9dd8
[ 5438.715557] sp : 0000fffff35eee70
[ 5438.715560] x29: 0000fffff35eee70 x28: 0000fffff35ef4d0 x27: 0000fffff35ef110
[ 5438.715571] x26: 0000fffff35eef08 x25: 000000000000004b x24: 0000000000000000
[ 5438.715579] x23: 0000000000000000 x22: 0000000000000000 x21: 0000fff48c20eff0
[ 5438.715588] x20: 0000000000001a99 x19: 0000000000000109 x18: 0000000000000000
[ 5438.715600] x17: 0000fff48deceda0 x16: 0000000000000001 x15: ab4d18e2501f5b66
[ 5438.715610] x14: 000000000038fe48 x13: 084c9107e1e25704 x12: 000000000038fda0
[ 5438.715618] x11: b8e208097994f7d5 x10: 000000000047fa48 x9 : 0000000000000000
[ 5438.715627] x8 : 0000000000000062 x7 : 0000000000000000 x6 : 0000000000000000
[ 5438.715635] x5 : 00000000ffffffff x4 : 0000000000000000 x3 : 0000000000000000
[ 5438.715643] x2 : 0000000000001a99 x1 : 0000000000000109 x0 : 0000fff48c20eff0
[ 5500.427396] rcu: INFO: rcu_sched self-detected stall on CPU
[ 5500.428731] rcu:     6-....: (149957 ticks this GP) idle=595c/1/0x4000000000000000 softirq=44771/44771 fqs=64339
[ 5500.430529] rcu:     (t=150018 jiffies g=237549 q=158161 ncpus=8)
[ 5500.431664] CPU: 6 PID: 1321 Comm: nvim Not tainted 6.7.7-WSL2-STABLE+ #1
[ 5500.431675] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 5500.431682] pc : do_epoll_wait.part.0+0x2ec/0x6f8
[ 5500.431696] lr : do_epoll_wait.part.0+0x238/0x6f8
[ 5500.431701] sp : ffff80008470bca0
[ 5500.431704] x29: ffff80008470bd40 x28: ffff000055332600 x27: ffff000055332618
[ 5500.431714] x26: 0000fffffffffffc x25: ffff0000193b3dd0 x24: 0000000000000400
[ 5500.431819] x23: 0000ffffc0ebd4a8 x22: ffff80008470bcf0 x21: 0000000000000000
[ 5500.431828] x20: ffff0000193b3d80 x19: ffff80008470bcd8 x18: 0000000000000000
[ 5500.431837] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 5500.431846] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 5500.431854] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800080b5ce1c
[ 5500.431862] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[ 5500.431870] x5 : 0000ffffc0ebd4a8 x4 : 0000000000000000 x3 : 0000000000000000
[ 5500.431879] x2 : 000000000000001b x1 : 0000000000000000 x0 : 0000000000000004
[ 5500.431888] Call trace:
[ 5500.431891]  do_epoll_wait.part.0+0x2ec/0x6f8
[ 5500.431899]  do_epoll_pwait.part.0+0x38/0x120
[ 5500.431904]  __arm64_sys_epoll_pwait+0x78/0x138
[ 5500.431909]  invoke_syscall.constprop.0+0x54/0x128
[ 5500.431917]  do_el0_svc+0x44/0xf0
[ 5500.431923]  el0_svc+0x24/0xb0
[ 5500.431929]  el0t_64_sync_handler+0x138/0x148
[ 5500.431935]  el0t_64_sync+0x14c/0x150
[ 5666.599687] rcu: INFO: rcu_sched self-detected stall on CPU
[ 5666.601114] rcu:     6-....: (194955 ticks this GP) idle=595c/1/0x4000000000000000 softirq=44771/44771 fqs=83882
[ 5666.602695] rcu:     (t=195023 jiffies g=237549 q=203648 ncpus=8)
[ 5666.603842] CPU: 6 PID: 1321 Comm: nvim Not tainted 6.7.7-WSL2-STABLE+ #1
[ 5666.603853] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 5666.603860] pc : do_epoll_wait.part.0+0x2ec/0x6f8
[ 5666.603874] lr : do_epoll_wait.part.0+0x238/0x6f8
[ 5666.603879] sp : ffff80008470bca0
[ 5666.603882] x29: ffff80008470bd40 x28: ffff000055332600 x27: ffff000055332618
[ 5666.603893] x26: 0000fffffffffffc x25: ffff0000193b3dd0 x24: 0000000000000400
[ 5666.603903] x23: 0000ffffc0ebd4a8 x22: ffff80008470bcf0 x21: 0000000000000000
[ 5666.603911] x20: ffff0000193b3d80 x19: ffff80008470bcd8 x18: 0000000000000000
[ 5666.604048] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 5666.604057] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 5666.604066] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800080b5ce1c
[ 5666.604074] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[ 5666.604082] x5 : 0000ffffc0ebd4a8 x4 : 0000000000000000 x3 : 0000000000000000
[ 5666.604091] x2 : 000000000000001b x1 : 0000000000000000 x0 : 0000000000000004
[ 5666.604100] Call trace:
[ 5666.604104]  do_epoll_wait.part.0+0x2ec/0x6f8
[ 5666.604110]  do_epoll_pwait.part.0+0x38/0x120
[ 5666.604115]  __arm64_sys_epoll_pwait+0x78/0x138
[ 5666.604122]  invoke_syscall.constprop.0+0x54/0x128
[ 5666.604131]  do_el0_svc+0x44/0xf0
[ 5666.604138]  el0_svc+0x24/0xb0
[ 5666.604145]  el0t_64_sync_handler+0x138/0x148
[ 5666.604151]  el0t_64_sync+0x14c/0x150
[ 5832.768305] rcu: INFO: rcu_sched self-detected stall on CPU
[ 5832.769532] rcu:     6-....: (239703 ticks this GP) idle=595c/1/0x4000000000000000 softirq=44771/44771 fqs=85733
[ 5832.771313] rcu:     (t=240027 jiffies g=237549 q=208202 ncpus=8)
[ 5832.773553] rcu: rcu_sched kthread starved for 40626 jiffies! g237549 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=3
[ 5832.775413] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 5832.776782] rcu: RCU grace-period kthread stack dump:
[ 5832.777585] task:rcu_sched       state:R  running task     stack:0     pid:16    tgid:16    ppid:2      flags:0x00000008
[ 5832.777697] Call trace:
[ 5832.777700]  __switch_to+0xa4/0xe0
[ 5832.777717]  __schedule+0x360/0xd78
[ 5832.777723]  schedule+0x2c/0x110
[ 5832.777729]  schedule_timeout+0xa0/0x1a8
[ 5832.777797]  rcu_gp_fqs_loop+0xf0/0x4c0
[ 5832.777911]  rcu_gp_kthread+0x11c/0x158
[ 5832.777917]  kthread+0xe0/0xf0
[ 5832.777980]  ret_from_fork+0x10/0x20
[ 5832.777988] rcu: Stack dump where RCU GP kthread last ran:
[ 5832.778872] Sending NMI from CPU 6 to CPUs 3:
[ 5832.778926] NMI backtrace for cpu 3
[ 5832.778939] CPU: 3 PID: 430 Comm: containerd Not tainted 6.7.7-WSL2-STABLE+ #1
[ 5832.778947] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 5832.778953] pc : clear_rseq_cs.isra.0+0x20/0x38
[ 5832.779136] lr : __rseq_handle_notify_resume+0x174/0x498
[ 5832.779141] sp : ffff80008344bc80
[ 5832.779144] x29: ffff80008344bcd0 x28: 000000000000000c x27: 0000000000000000
[ 5832.779154] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[ 5832.779163] x23: 0000ab1793fcfe20 x22: 0000000000000000 x21: 00000000ffffffff
[ 5832.779171] x20: ffff80008344beb0 x19: ffff000004610000 x18: 0000000000000000
[ 5832.779180] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 5832.779189] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 5832.779197] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800080258d8c
[ 5832.779205] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[ 5832.779213] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000fffffffffff8
[ 5832.779221] x2 : 0000000000000000 x1 : 0000ffff88c0f8a8 x0 : 0000000000000000
[ 5832.779230] Call trace:
[ 5832.779233]  clear_rseq_cs.isra.0+0x20/0x38
[ 5832.779238]  do_notify_resume+0x1d0/0xe70
[ 5832.779246]  el0_svc+0x90/0xb0
[ 5832.779251]  el0t_64_sync_handler+0x138/0x148
[ 5832.779256]  el0t_64_sync+0x14c/0x150
[ 5832.779918] CPU: 6 PID: 1321 Comm: nvim Not tainted 6.7.7-WSL2-STABLE+ #1
[ 5832.779922] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 5832.779924] pc : do_epoll_wait.part.0+0x2ec/0x6f8
[ 5832.779929] lr : do_epoll_wait.part.0+0x238/0x6f8
[ 5832.779930] sp : ffff80008470bca0
[ 5832.779931] x29: ffff80008470bd40 x28: ffff000055332600 x27: ffff000055332618
[ 5832.779934] x26: 0000fffffffffffc x25: ffff0000193b3dd0 x24: 0000000000000400
[ 5832.779937] x23: 0000ffffc0ebd4a8 x22: ffff80008470bcf0 x21: 0000000000000000
[ 5832.779939] x20: ffff0000193b3d80 x19: ffff80008470bcd8 x18: 0000000000000000
[ 5832.779974] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 5832.779976] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 5832.780009] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800080b5ce1c
[ 5832.780012] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[ 5832.780014] x5 : 0000ffffc0ebd4a8 x4 : 0000000000000000 x3 : 0000000000000000
[ 5832.780015] x2 : 000000000000001b x1 : 0000000000000000 x0 : 0000000000000004
[ 5832.780018] Call trace:
[ 5832.780019]  do_epoll_wait.part.0+0x2ec/0x6f8
[ 5832.780021]  do_epoll_pwait.part.0+0x38/0x120
[ 5832.780022]  __arm64_sys_epoll_pwait+0x78/0x138
[ 5832.780024]  invoke_syscall.constprop.0+0x54/0x128
[ 5832.780027]  do_el0_svc+0x44/0xf0
[ 5832.780029]  el0_svc+0x24/0xb0
[ 5832.780031]  el0t_64_sync_handler+0x138/0x148
[ 5832.780032]  el0t_64_sync+0x14c/0x150
[ 5906.221529] systemd-journald[6821]: File /var/log/journal/4f2f70649e9e4776a9f1716c9737e494/system.journal corrupted or uncleanly shut down, renaming and replacing.
[ 6307.581091] rcu: INFO: rcu_sched self-detected stall on CPU
[ 6307.582514] rcu:     4-....: (15001 ticks this GP) idle=a414/1/0x4000000000000000 softirq=46243/46243 fqs=6470
[ 6307.583963] rcu:     (t=15001 jiffies g=249165 q=3624 ncpus=8)
[ 6307.584810] CPU: 4 PID: 6088 Comm: zsh Not tainted 6.7.7-WSL2-STABLE+ #1
[ 6307.584816] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 6307.584818] pc : __arch_copy_to_user+0xbc/0x240
[ 6307.585195] lr : tty_ioctl+0x200/0xa20
[ 6307.585422] sp : ffff800084c4bc90
[ 6307.585423] x29: ffff800084c4bd10 x28: ffff0000006be740 x27: 0000000000000000
[ 6307.585426] x26: 0000000000000000 x25: ffff00001904f000 x24: 0000ab36ba6c1a74
[ 6307.585429] x23: 0000000000005413 x22: ffff800080b54ac0 x21: ffff000020dffc00
[ 6307.585432] x20: ffff000020dffc00 x19: ffff00001904f0e8 x18: 0000000000000000
[ 6307.585434] x17: 0000000000000000 x16: 0000000000000000 x15: ffff00001904f1b0
[ 6307.585436] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 6307.585438] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800080b54fd0
[ 6307.585441] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000ab36ba6c1a74
[ 6307.585443] x5 : 0000ab36ba6c1a7c x4 : 0000000000000000 x3 : 000000000094002e
[ 6307.585446] x2 : 0000000000000008 x1 : ffff00001904f1b8 x0 : 0000ab36ba6c1a74
[ 6307.585448] Call trace:
[ 6307.585476]  __arch_copy_to_user+0xbc/0x240
[ 6307.585480]  __arm64_sys_ioctl+0x38c/0xa70
[ 6307.585696]  invoke_syscall.constprop.0+0x54/0x128
[ 6307.585793]  do_el0_svc+0x44/0xf0
[ 6307.585795]  el0_svc+0x24/0xb0
[ 6307.585821]  el0t_64_sync_handler+0x138/0x148
[ 6307.585823]  el0t_64_sync+0x14c/0x150
[ 6473.753316] rcu: INFO: rcu_sched self-detected stall on CPU
[ 6473.754448] rcu:     4-....: (59932 ticks this GP) idle=a414/1/0x4000000000000000 softirq=46243/46243 fqs=25367
[ 6473.756284] rcu:     (t=60006 jiffies g=249165 q=9163 ncpus=8)
[ 6473.757194] CPU: 4 PID: 6088 Comm: zsh Not tainted 6.7.7-WSL2-STABLE+ #1
[ 6473.757247] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 6473.757249] pc : __arch_copy_to_user+0xbc/0x240
[ 6473.757260] lr : tty_ioctl+0x200/0xa20
[ 6473.757266] sp : ffff800084c4bc90
[ 6473.757267] x29: ffff800084c4bd10 x28: ffff0000006be740 x27: 0000000000000000
[ 6473.757271] x26: 0000000000000000 x25: ffff00001904f000 x24: 0000ab36ba6c1a74
[ 6473.757273] x23: 0000000000005413 x22: ffff800080b54ac0 x21: ffff000020dffc00
[ 6473.757276] x20: ffff000020dffc00 x19: ffff00001904f0e8 x18: 0000000000000000
[ 6473.757278] x17: 0000000000000000 x16: 0000000000000000 x15: ffff00001904f1b0
[ 6473.757280] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 6473.757283] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800080b54fd0
[ 6473.757313] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000ab36ba6c1a74
[ 6473.757315] x5 : 0000ab36ba6c1a7c x4 : 0000000000000000 x3 : 000000000094002e
[ 6473.757318] x2 : 0000000000000008 x1 : ffff00001904f1b8 x0 : 0000ab36ba6c1a74
[ 6473.757320] Call trace:
[ 6473.757322]  __arch_copy_to_user+0xbc/0x240
[ 6473.757325]  __arm64_sys_ioctl+0x38c/0xa70
[ 6473.757329]  invoke_syscall.constprop.0+0x54/0x128
[ 6473.757333]  do_el0_svc+0x44/0xf0
[ 6473.757335]  el0_svc+0x24/0xb0
[ 6473.757336]  el0t_64_sync_handler+0x138/0x148
[ 6473.757338]  el0t_64_sync+0x14c/0x150
[ 6530.005623] rcu: INFO: rcu_sched self-detected stall on CPU
[ 6530.006820] rcu:     2-....: (15000 ticks this GP) idle=4094/1/0x4000000000000000 softirq=61611/61611 fqs=6377
[ 6530.008333] rcu:     (t=15001 jiffies g=249169 q=11009 ncpus=8)
[ 6530.009444] CPU: 2 PID: 361 Comm: systemd-logind Not tainted 6.7.7-WSL2-STABLE+ #1
[ 6530.009448] pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 6530.009450] pc : __arch_copy_to_user+0x60/0x240
[ 6530.009461] lr : inotify_read+0x260/0x3d8
[ 6530.009598] sp : ffff800082c4bc70
[ 6530.009599] x29: ffff800082c4bce0 x28: ffff0000049fe740 x27: 0000000000000010
[ 6530.009603] x26: ffff000004cbd220 x25: ffff00001a318e60 x24: 0000000000000110
[ 6530.009605] x23: 0000000000000001 x22: 0000ab287a102ad0 x21: ffff000004cbd200
[ 6530.009608] x20: ffff000004cbd20c x19: 0000ab287a102ad0 x18: 0000000000000000
[ 6530.009610] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800082c4bcc8
[ 6530.009612] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 6530.009615] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff8000803a05b4
[ 6530.009617] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000ab287a102ad0
[ 6530.009619] x5 : 0000ab287a102ae0 x4 : 0000000000000008 x3 : 0000000200000001
[ 6530.009622] x2 : 0000000000000008 x1 : ffff800082c4bcd0 x0 : 0000ab287a102ad0
[ 6530.009625] Call trace:
[ 6530.009626]  __arch_copy_to_user+0x60/0x240
[ 6530.009629]  vfs_read+0xb0/0x280
[ 6530.009663]  ksys_read+0x74/0x128
[ 6530.009664]  __arm64_sys_read+0x20/0x40
[ 6530.009666]  invoke_syscall.constprop.0+0x54/0x128
[ 6530.009669]  do_el0_svc+0xcc/0xf0
[ 6530.009701]  el0_svc+0x24/0xb0
[ 6530.009703]  el0t_64_sync_handler+0x138/0x148
[ 6530.009705]  el0t_64_sync+0x14c/0x150
[ 6530.009708] Sending NMI from CPU 2 to CPUs 5:
[ 6530.009743] NMI backtrace for cpu 5
[ 6530.009747] CPU: 5 PID: 5120 Comm: node Not tainted 6.7.7-WSL2-STABLE+ #1
[ 6530.009749] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 6530.009752] pc : clear_rseq_cs.isra.0+0x20/0x38
[ 6530.009815] lr : __rseq_handle_notify_resume+0x174/0x498
[ 6530.009817] sp : ffff800089cebc80
[ 6530.009818] x29: ffff800089cebcd0 x28: 000000000000000c x27: 0000000000000000
[ 6530.009821] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[ 6530.009823] x23: 0000ff75f82c9dfc x22: 0000000000000000 x21: 00000000ffffffff
[ 6530.009826] x20: ffff800089cebeb0 x19: ffff000011cfe740 x18: 0000000000000000
[ 6530.009828] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ff6665a0c780
[ 6530.009831] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 6530.009833] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800080258d8c
[ 6530.009835] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[ 6530.009837] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000fffffffffff8
[ 6530.009840] x2 : 0000000000000000 x1 : 0000ff75f6e0f7e8 x0 : 0000000000000000
[ 6530.009842] Call trace:
[ 6530.009843]  clear_rseq_cs.isra.0+0x20/0x38
[ 6530.009845]  do_notify_resume+0x1d0/0xe70
[ 6530.009850]  el0_svc+0x90/0xb0
[ 6530.009852]  el0t_64_sync_handler+0x138/0x148
[ 6530.009853]  el0t_64_sync+0x14c/0x150
[ 6696.177931] rcu: INFO: rcu_sched self-detected stall on CPU
[ 6696.179032] rcu:     5-....: (60005 ticks this GP) idle=3734/1/0x4000000000000000 softirq=60589/60589 fqs=25771
[ 6696.180732] rcu:     (t=60006 jiffies g=249169 q=15239 ncpus=8)
[ 6696.181856] Sending NMI from CPU 5 to CPUs 2:
[ 6696.181867] NMI backtrace for cpu 2
[ 6696.181871] CPU: 2 PID: 361 Comm: systemd-logind Not tainted 6.7.7-WSL2-STABLE+ #1
[ 6696.181875] pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 6696.181877] pc : __arch_copy_to_user+0x60/0x240
[ 6696.181888] lr : inotify_read+0x260/0x3d8
[ 6696.181896] sp : ffff800082c4bc70
[ 6696.181897] x29: ffff800082c4bce0 x28: ffff0000049fe740 x27: 0000000000000010
[ 6696.181900] x26: ffff000004cbd220 x25: ffff00001a318e60 x24: 0000000000000110
[ 6696.181903] x23: 0000000000000001 x22: 0000ab287a102ad0 x21: ffff000004cbd200
[ 6696.181905] x20: ffff000004cbd20c x19: 0000ab287a102ad0 x18: 0000000000000000
[ 6696.181907] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800082c4bcc8
[ 6696.181960] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 6696.181962] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff8000803a05b4
[ 6696.181964] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000ab287a102ad0
[ 6696.181966] x5 : 0000ab287a102ae0 x4 : 0000000000000008 x3 : 0000000200000001
[ 6696.181968] x2 : 0000000000000008 x1 : ffff800082c4bcd0 x0 : 0000ab287a102ad0
[ 6696.181971] Call trace:
[ 6696.182001]  __arch_copy_to_user+0x60/0x240
[ 6696.182005]  vfs_read+0xb0/0x280
[ 6696.182008]  ksys_read+0x74/0x128
[ 6696.182009]  __arm64_sys_read+0x20/0x40
[ 6696.182011]  invoke_syscall.constprop.0+0x54/0x128
[ 6696.182015]  do_el0_svc+0xcc/0xf0
[ 6696.182017]  el0_svc+0x24/0xb0
[ 6696.182019]  el0t_64_sync_handler+0x138/0x148
[ 6696.182021]  el0t_64_sync+0x14c/0x150
[ 6696.182865] CPU: 5 PID: 5120 Comm: node Not tainted 6.7.7-WSL2-STABLE+ #1
[ 6696.182868] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 6696.182870] pc : clear_rseq_cs.isra.0+0x20/0x38
[ 6696.182874] lr : __rseq_handle_notify_resume+0x174/0x498
[ 6696.182876] sp : ffff800089cebc80
[ 6696.182877] x29: ffff800089cebcd0 x28: 000000000000000c x27: 0000000000000000
[ 6696.182880] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[ 6696.182883] x23: 0000ff75f82c9dfc x22: 0000000000000000 x21: 00000000ffffffff
[ 6696.182885] x20: ffff800089cebeb0 x19: ffff000011cfe740 x18: 0000000000000000
[ 6696.182888] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ff6665a0c780
[ 6696.182890] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 6696.182892] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800080258d8c
[ 6696.182895] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[ 6696.182897] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000fffffffffff8
[ 6696.182900] x2 : 0000000000000000 x1 : 0000ff75f6e0f7e8 x0 : 0000000000000000
[ 6696.182902] Call trace:
[ 6696.182903]  clear_rseq_cs.isra.0+0x20/0x38
[ 6696.182905]  do_notify_resume+0x1d0/0xe70
[ 6696.182909]  el0_svc+0x90/0xb0
[ 6696.182910]  el0t_64_sync_handler+0x138/0x148
[ 6696.182912]  el0t_64_sync+0x14c/0x150

@david-nordvall
Copy link

I'm feeling like these RCU stalls are mainly happening for me after I've had my screen locked for a while, or even had my device suspend. Does that rhyme with experience others have?

Yes, if my computer is suspended, I encounter this issue 100 % of the times. It is, however, far from the only situation I encounter it. My use case is that I run dev containers in Docker with VS Code (typically C#/.NET + Node frontend stuff) and my experience is that if I work on really simple things (one project, a couple of source files) I cane work maybe an hour before issues (stalls, slowdowns) start to be a real problem. But if I work on more complex stuff (5-10 projects that compile in parallel, tens of thousands of lines of code) I can barely get VS Code to open and activate all extensions before WSL (and the dev container) is completely unusable.

@maxboone
Copy link

maxboone commented Mar 4, 2024

I can not seem to reproduce the problem with Hyper-V running stock Ubuntu 23.10, and I'm not sure what the real difference there is other than that WSL uses a different kernel.

It looks like there are some differences between the CPU that the virtual machine uses, the caches and listed extensions differ.

CPU info of WSL2 VM
➜  ~ dmesg | grep CPU
[    0.000000] Booting Linux on physical CPU 0x0000000000 [0x51df804e]
[    0.000000] Detected VIPT I-cache on CPU0
[    0.000000] CPU features: detected: GIC system register CPU interface
[    0.000000] CPU features: kernel page table isolation disabled by kernel configuration
[    0.000000] CPU features: detected: ARM errata 1165522, 1319367, or 1530923
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1
[    0.000000] rcu:     RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=8.
[    0.000000] GICv3: CPU0: found redistributor 0 region 0:0x00000000effee000
[    0.294328] smp: Bringing up secondary CPUs ...
[    0.307988] CPU features: detected: Spectre-v2
[    0.307995] CPU features: detected: Spectre-v4
[    0.308039] Detected VIPT I-cache on CPU1
[    0.308101] GICv3: CPU1: found redistributor 1 region 1:0x00000000f000e000
[    0.308201] CPU1: Booted secondary processor 0x0000000001 [0x51df804e]
[    0.308540] Detected VIPT I-cache on CPU2
[    0.308572] GICv3: CPU2: found redistributor 2 region 2:0x00000000f002e000
[    0.308661] CPU2: Booted secondary processor 0x0000000002 [0x51df804e]
[    0.308961] Detected VIPT I-cache on CPU3
[    0.308999] GICv3: CPU3: found redistributor 3 region 3:0x00000000f004e000
[    0.309087] CPU3: Booted secondary processor 0x0000000003 [0x51df804e]
[    0.309389] Detected VIPT I-cache on CPU4
[    0.309431] GICv3: CPU4: found redistributor 4 region 4:0x00000000f006e000
[    0.309518] CPU4: Booted secondary processor 0x0000000004 [0x51df804e]
[    0.309825] Detected VIPT I-cache on CPU5
[    0.309874] GICv3: CPU5: found redistributor 5 region 5:0x00000000f008e000
[    0.309960] CPU5: Booted secondary processor 0x0000000005 [0x51df804e]
[    0.310248] Detected VIPT I-cache on CPU6
[    0.310302] GICv3: CPU6: found redistributor 6 region 6:0x00000000f00ae000
[    0.310388] CPU6: Booted secondary processor 0x0000000006 [0x51df804e]
[    0.311743] Detected VIPT I-cache on CPU7
[    0.311804] GICv3: CPU7: found redistributor 7 region 7:0x00000000f00ce000
[    0.311890] CPU7: Booted secondary processor 0x0000000007 [0x51df804e]
[    0.313021] smp: Brought up 1 node, 8 CPUs
[    0.644306] CPU features: detected: 32-bit EL0 Support
[    0.657151] CPU features: detected: CRC32 instructions
[    0.670127] CPU features: detected: RCpc load-acquire (LDAPR)
[    0.684748] CPU features: detected: LSE atomic instructions
[    0.697564] CPU features: detected: Privileged Access Never
[    0.733989] CPU features: detected: Hardware dirty bit management on CPU1-2,4
[    0.751305] CPU: All CPU(s) started at EL1
[    2.963616] No ACPI PMU IRQ for CPU0
[    2.964194] No ACPI PMU IRQ for CPU1
[    2.964841] No ACPI PMU IRQ for CPU2
[    2.965283] No ACPI PMU IRQ for CPU3
[    2.965691] No ACPI PMU IRQ for CPU4
[    2.966302] No ACPI PMU IRQ for CPU5
[    2.966895] No ACPI PMU IRQ for CPU6
[    2.967408] No ACPI PMU IRQ for CPU7
➜  ~ lscpu
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 8
  On-line CPU(s) list:  0-7
Vendor ID:              Qualcomm
  Model:                14
  Thread(s) per core:   1
  Core(s) per socket:   1
  Socket(s):            1
  Stepping:             0xd
  BogoMIPS:             38.40
  Flags:                fp asimd aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
  Model:                14
  Thread(s) per core:   1
  Core(s) per socket:   7
  Socket(s):            1
  Stepping:             0xd
  BogoMIPS:             38.40
  Flags:                fp asimd aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
Caches (sum of all):
  L1d:                  256 KiB (8 instances)
  L1i:                  256 KiB (8 instances)
  L2:                   1 MiB (8 instances)
  L3:                   4 MiB (1 instance)
Vulnerabilities:
  Gather data sampling: Not affected
  Itlb multihit:        Not affected
  L1tf:                 Not affected
  Mds:                  Not affected
  Meltdown:             Vulnerable
  Mmio stale data:      Not affected
  Retbleed:             Not affected
  Spec rstack overflow: Not affected
  Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:           Mitigation; __user pointer sanitization
  Spectre v2:           Mitigation; Branch predictor hardening
  Srbds:                Not affected
  Tsx async abort:      Not affected
CPU info of Hyper-V VM
root@ubuntu0:~# dmesg | grep CPU
[    0.000000] Booting Linux on physical CPU 0x0000000000 [0x51df804e]
[    0.000000] Detected VIPT I-cache on CPU0
[    0.000000] CPU features: detected: GIC system register CPU interface
[    0.000000] CPU features: detected: Hardware dirty bit management
[    0.000000] CPU features: detected: Spectre-v2
[    0.000000] CPU features: detected: Spectre-v4
[    0.000000] CPU features: kernel page table isolation forced ON by KASLR
[    0.000000] CPU features: detected: Kernel page table isolation (KPTI)
[    0.000000] CPU features: detected: ARM erratum 1418040
[    0.000000] CPU features: detected: Qualcomm erratum 1009, or ARM erratum 1286807, 2441009
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1
[    0.000000] rcu:     RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=8.
[    0.000000] GICv3: CPU0: found redistributor 0 region 0:0x00000000effee000
[    0.001753] smp: Bringing up secondary CPUs ...
[    0.002120] Detected VIPT I-cache on CPU1
[    0.002190] GICv3: CPU1: found redistributor 1 region 1:0x00000000f000e000
[    0.002285] CPU1: Booted secondary processor 0x0000000001 [0x51df804e]
[    0.002729] Detected VIPT I-cache on CPU2
[    0.002782] GICv3: CPU2: found redistributor 2 region 2:0x00000000f002e000
[    0.003386] CPU2: Booted secondary processor 0x0000000002 [0x51df805e]
[    0.003823] Detected VIPT I-cache on CPU3
[    0.003880] GICv3: CPU3: found redistributor 3 region 3:0x00000000f004e000
[    0.003961] CPU3: Booted secondary processor 0x0000000003 [0x51df804e]
[    0.004327] Detected VIPT I-cache on CPU4
[    0.004387] GICv3: CPU4: found redistributor 4 region 4:0x00000000f006e000
[    0.004479] CPU4: Booted secondary processor 0x0000000004 [0x51df804e]
[    0.004788] Detected VIPT I-cache on CPU5
[    0.004915] GICv3: CPU5: found redistributor 5 region 5:0x00000000f008e000
[    0.005144] CPU5: Booted secondary processor 0x0000000005 [0x51df805e]
[    0.005658] Detected VIPT I-cache on CPU6
[    0.005789] GICv3: CPU6: found redistributor 6 region 6:0x00000000f00ae000
[    0.005866] CPU6: Booted secondary processor 0x0000000006 [0x51df805e]
[    0.006169] Detected VIPT I-cache on CPU7
[    0.006255] GICv3: CPU7: found redistributor 7 region 7:0x00000000f00ce000
[    0.006322] CPU7: Booted secondary processor 0x0000000007 [0x51df804e]
[    0.012949] smp: Brought up 1 node, 8 CPUs
[    0.012983] CPU features: detected: 32-bit EL0 Support
[    0.012987] CPU features: detected: CRC32 instructions
[    0.012989] CPU features: detected: Data cache clean to Point of Persistence
[    0.012993] CPU features: detected: RCpc load-acquire (LDAPR)
[    0.012996] CPU features: detected: LSE atomic instructions
[    0.012998] CPU features: detected: Privileged Access Never
[    0.024078] CPU: All CPU(s) started at EL1
[    0.570000] ledtrig-cpu: registered to indicate activity on CPUs
[    0.570310] No ACPI PMU IRQ for CPU0
[    0.570315] No ACPI PMU IRQ for CPU1
[    0.570317] No ACPI PMU IRQ for CPU2
[    0.570318] No ACPI PMU IRQ for CPU3
[    0.570320] No ACPI PMU IRQ for CPU4
[    0.570322] No ACPI PMU IRQ for CPU5
[    0.570323] No ACPI PMU IRQ for CPU6
[    0.570325] No ACPI PMU IRQ for CPU7
root@ubuntu0:~# lscpu
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 8
  On-line CPU(s) list:  0-7
Vendor ID:              Qualcomm
  BIOS Vendor ID:       Qualcomm Technologies Inc
  Model name:           Kryo-4XX-Gold
    BIOS Model name:    Microsoft SQ2 @ 3.15 GHz None CPU @ 1.5GHz
    BIOS CPU family:    280
    Model:              14
    Thread(s) per core: 1
    Core(s) per socket: 8
    Socket(s):          1
    Stepping:           0xd
    BogoMIPS:           38.40
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
Caches (sum of all):
  L1d:                  512 KiB (8 instances)
  L1i:                  512 KiB (8 instances)
  L2:                   4 MiB (8 instances)
  L3:                   4 MiB (1 instance)
NUMA:
  NUMA node(s):         1
  NUMA node0 CPU(s):    0-7
Vulnerabilities:
  Gather data sampling: Not affected
  Itlb multihit:        Not affected
  L1tf:                 Not affected
  Mds:                  Not affected
  Meltdown:             Mitigation; PTI
  Mmio stale data:      Not affected
  Retbleed:             Not affected
  Spec rstack overflow: Not affected
  Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:           Mitigation; __user pointer sanitization
  Spectre v2:           Mitigation; Branch predictor hardening
  Srbds:                Not affected
  Tsx async abort:      Not affected

@maxboone
Copy link

maxboone commented Mar 5, 2024

Edited, didn't work, cheered too soon

On advice of a thread on the kernel mailing list on RCU. I compiled the kernel with CONFIG_RSEQ=n and it has been 20 minutes of heavy tmux and neovim use, going into suspend and resuming without a RCU stall so far for me. Too soon to tell, but I'm hopeful.

Feel free to test:

With RSEQ disabled, the crashes still occur, however it feels like it takes considerably longer before they occur, it might still be worth it to try the linked kernel. Unfortunately my window with dmesg -w open also froze, so I have no call traces.

[ 1559.425979] rcu:     7-....: (14977 ticks this GP) idle=d4ec/1/0x4000000000000000 softirq=18636/18636 fqs=5263
[ 1559.431367] rcu:     (t=15002 jiffies g=67965 q=36939 ncpus=8)
[ 1559.432083] rcu: rcu_sched kthread starved for 2866 jiffies! g67965 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[ 1559.433645] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 1559.434604] rcu: RCU grace-period kthread stack dump:
[ 1559.435549] rcu: Stack dump where RCU GP kthread last ran:
[ 1739.451891] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1739.452511] rcu:     7-....: (59616 ticks this GP) idle=d4ec/1/0x4000000000000000 softirq=18636/18636 fqs=5263
[ 1739.453498] rcu:     (t=60008 jiffies g=67965 q=36939 ncpus=8)
[ 1739.454053] rcu: rcu_sched kthread starved for 47871 jiffies! g67965 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[ 1739.455135] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 1739.456110] rcu: RCU grace-period kthread stack dump:
[ 1739.456687] rcu: Stack dump where RCU GP kthread last ran:
[ 1919.467822] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1919.468776] rcu:     7-....: (104010 ticks this GP) idle=d4ec/1/0x4000000000000000 softirq=18636/18636 fqs=5263
[ 1919.470405] rcu:     (t=105012 jiffies g=67965 q=36941 ncpus=8)
[ 1919.472599] rcu: rcu_sched kthread starved for 92875 jiffies! g67965 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[ 1919.474013] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 1919.474977] rcu: RCU grace-period kthread stack dump:
[ 1919.475641] rcu: Stack dump where RCU GP kthread last ran:

And now with a later occurring trace!

[  675.812339] rcu: INFO: rcu_sched self-detected stall on CPU
[  675.814587] rcu:     3-....: (14893 ticks this GP) idle=762c/1/0x4000000000000000 softirq=6920/6920 fqs=6610
[  675.815606] rcu:     (t=15001 jiffies g=50497 q=1304 ncpus=8)
[  675.816520] CPU: 3 PID: 232 Comm: snapfuse Not tainted 6.7.7-WSL2-STABLE+ #2
[  675.816550] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  675.816553] pc : __arch_copy_to_user+0x1a0/0x240
[  675.817689] lr : _copy_to_iter+0xf0/0x560
[  675.818069] sp : ffff800082ceba80
[  675.818070] x29: ffff800082cebac0 x28: 0000000001b2c000 x27: 0000000000000005
[  675.818074] x26: 0000000000000000 x25: ffff00004c491000 x24: 0000000000000000
[  675.818076] x23: 0000000000001000 x22: 0000040000000000 x21: ffff800082cebd30
[  675.818079] x20: ffff800082cebd30 x19: 0000000000001000 x18: 0000000000000000
[  675.818081] x17: 0000000000000000 x16: 0000000000000000 x15: ffff00004c491000
[  675.818083] x14: 9887db4ae914c054 x13: 6bcd444ce14effe5 x12: 0b22b481c6001041
[  675.818086] x11: 7513c0250d7df247 x10: b85affa4063b12c7 x9 : 368beb85bc648557
[  675.818088] x8 : 217c88df9795370e x7 : a16d77942052b4ab x6 : 0000aaf844516fff
[  675.818090] x5 : 0000aaf844517e2f x4 : 0000000000000000 x3 : 0000000000003daf
[  675.818092] x2 : 0000000000000dc0 x1 : ffff00004c491210 x0 : 0000aaf844516e2f
[  675.818096] Call trace:
[  675.818143]  __arch_copy_to_user+0x1a0/0x240
[  675.818147]  copy_page_to_iter+0xbc/0x140
[  675.818150]  filemap_read+0x1b0/0x398
[  675.818427]  generic_file_read_iter+0x48/0x168
[  675.818429]  ext4_file_read_iter+0x58/0x288
[  675.818681]  vfs_read+0x1e8/0x280
[  675.818804]  ksys_pread64+0x90/0xf0
[  675.818806]  __arm64_sys_pread64+0x24/0x48
[  675.818807]  invoke_syscall.constprop.0+0x54/0x128
[  675.818912]  do_el0_svc+0x44/0xf0
[  675.818914]  el0_svc+0x24/0xb0
[  675.819041]  el0t_64_sync_handler+0x138/0x148
[  675.819043]  el0t_64_sync+0x14c/0x150
[  681.501178] block sda: the capability attribute has been deprecated.
[  741.700330] rcu: INFO: rcu_sched self-detected stall on CPU
[  741.701707] rcu:     4-....: (14940 ticks this GP) idle=8074/1/0x4000000000000000 softirq=13021/13037 fqs=6400
[  741.703152] rcu:     (t=15001 jiffies g=50713 q=5093 ncpus=8)
[  741.704017] CPU: 4 PID: 194 Comm: systemd-journal Not tainted 6.7.7-WSL2-STABLE+ #2
[  741.704047] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  741.704050] pc : __arch_copy_to_user+0x190/0x240
[  741.704424] lr : _copy_to_iter+0xf0/0x560
[  741.704565] sp : ffff800082cfb870
[  741.704566] x29: ffff800082cfb8b0 x28: ffff00000fe96b18 x27: 0000000000000085
[  741.704569] x26: 0000000000000000 x25: ffff0000061e1c00 x24: 0000000000000000
[  741.704571] x23: 0000000000000085 x22: ffff000117d2a600 x21: ffff800082cfbd90
[  741.704574] x20: ffff0000061e1c00 x19: 0000000000000085 x18: 0000000000000000
[  741.704608] x17: 0000000000000000 x16: 0000000000000000 x15: ffff0000061e1c00
[  741.704610] x14: 62616c6961766120 x13: 7365746164707520 x12: 6f6e207361682070
[  741.704612] x11: 616e73203a687365 x10: 7266657220746f6e x9 : 6e6163203a313937
[  741.704614] x8 : 3a6f672e73726570 x7 : 6c656865726f7473 x6 : 0000ab58c96fd6f0
[  741.704617] x5 : 0000ab58c96fd775 x4 : 0000000000000000 x3 : 0000000000000000
[  741.704619] x2 : 0000000000000005 x1 : ffff0000061e1c40 x0 : 0000ab58c96fd6f0
[  741.704621] Call trace:
[  741.704647]  __arch_copy_to_user+0x190/0x240
[  741.704651]  simple_copy_to_iter+0x48/0x98
[  741.704939]  __skb_datagram_iter+0x7c/0x280
[  741.704941]  skb_copy_datagram_iter+0x48/0xc8
[  741.704943]  unix_stream_read_actor+0x30/0x68
[  741.705137]  unix_stream_read_generic+0x304/0xb70
[  741.705139]  unix_stream_recvmsg+0xc0/0xd0
[  741.705140]  sock_recvmsg+0x88/0x108
[  741.705170]  ____sys_recvmsg+0x78/0x198
[  741.705171]  ___sys_recvmsg+0x80/0xf0
[  741.705173]  __sys_recvmsg+0x5c/0xd0
[  741.705175]  __arm64_sys_recvmsg+0x28/0x50
[  741.705177]  invoke_syscall.constprop.0+0x54/0x128
[  741.705316]  do_el0_svc+0xcc/0xf0
[  741.705317]  el0_svc+0x24/0xb0
[  741.705369]  el0t_64_sync_handler+0x138/0x148
[  741.705371]  el0t_64_sync+0x14c/0x150
[  743.232431] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 4-.... } 15347 jiffies s: 721 root: 0x10/.
[  743.234297] rcu: blocking rcu_node structures (internal RCU debug):
[  743.235477] Sending NMI from CPU 1 to CPUs 4:
[  743.235491] NMI backtrace for cpu 4
[  743.235531] CPU: 4 PID: 194 Comm: systemd-journal Not tainted 6.7.7-WSL2-STABLE+ #2
[  743.235535] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  743.235537] pc : __arch_copy_to_user+0x190/0x240
[  743.235598] lr : _copy_to_iter+0xf0/0x560
[  743.235603] sp : ffff800082cfb870
[  743.235604] x29: ffff800082cfb8b0 x28: ffff00000fe96b18 x27: 0000000000000085
[  743.235607] x26: 0000000000000000 x25: ffff0000061e1c00 x24: 0000000000000000
[  743.235610] x23: 0000000000000085 x22: ffff000117d2a600 x21: ffff800082cfbd90
[  743.235612] x20: ffff0000061e1c00 x19: 0000000000000085 x18: 0000000000000000
[  743.235614] x17: 0000000000000000 x16: 0000000000000000 x15: ffff0000061e1c00
[  743.235617] x14: 62616c6961766120 x13: 7365746164707520 x12: 6f6e207361682070
[  743.235619] x11: 616e73203a687365 x10: 7266657220746f6e x9 : 6e6163203a313937
[  743.235621] x8 : 3a6f672e73726570 x7 : 6c656865726f7473 x6 : 0000ab58c96fd6f0
[  743.235623] x5 : 0000ab58c96fd775 x4 : 0000000000000000 x3 : 0000000000000000
[  743.235626] x2 : 0000000000000005 x1 : ffff0000061e1c40 x0 : 0000ab58c96fd6f0
[  743.235628] Call trace:
[  743.235630]  __arch_copy_to_user+0x190/0x240
[  743.235632]  simple_copy_to_iter+0x48/0x98
[  743.235636]  __skb_datagram_iter+0x7c/0x280
[  743.235639]  skb_copy_datagram_iter+0x48/0xc8
[  743.235641]  unix_stream_read_actor+0x30/0x68
[  743.235644]  unix_stream_read_generic+0x304/0xb70
[  743.235646]  unix_stream_recvmsg+0xc0/0xd0
[  743.235647]  sock_recvmsg+0x88/0x108
[  743.235650]  ____sys_recvmsg+0x78/0x198
[  743.235651]  ___sys_recvmsg+0x80/0xf0
[  743.235653]  __sys_recvmsg+0x5c/0xd0
[  743.235655]  __arm64_sys_recvmsg+0x28/0x50
[  743.235657]  invoke_syscall.constprop.0+0x54/0x128
[  743.235661]  do_el0_svc+0xcc/0xf0
[  743.235663]  el0_svc+0x24/0xb0
[  743.235667]  el0t_64_sync_handler+0x138/0x148
[  743.235668]  el0t_64_sync+0x14c/0x150

@david-nordvall
Copy link

I did give it a try. I started up a dev container with a rather large code base, which I usually won't be able to compile, much less start a debugging session, before WSL hangs. With this kernel, however, I was able to compile, debug and actually run. As you note, I still got the crashes but it took much longer for them to occur and it seems WSL was able to recover more quickly. A big step forward but not quite usable :-).

It also didn't solve the problem where WSL always completely hangs and uses 99% CPU when closing VS Code and the dev container.

I will try to collect some logs and post them here.

@maxboone
Copy link

maxboone commented Mar 5, 2024

@pmartincic further digging shows that the problems are alleviated quite a bit by disabling RSEQ on a recent kernel and the problems are possibly related to copy_to_user fault handling.

See also this thread / reply: https://lore.kernel.org/rcu/[email protected]/

I do see some patches submitted upstream for hv timer support recently (among which a commit of yours, congrats!), do these fix anything related to this issue?

@maxboone
Copy link

See #11274 (comment)

This issue is fixed in 24H2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests