Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent RCU Stalls on ARM64 #11210

Closed
1 of 2 tasks
PeronGH opened this issue Feb 25, 2024 · 9 comments
Closed
1 of 2 tasks

Frequent RCU Stalls on ARM64 #11210

PeronGH opened this issue Feb 25, 2024 · 9 comments

Comments

@PeronGH
Copy link

PeronGH commented Feb 25, 2024

Windows Version

Microsoft Windows [Version 10.0.22635.3209]

WSL Version

2.1.3.0

Are you using WSL 1 or WSL 2?

  • WSL 2
  • WSL 1

Kernel Version

5.15.146.1-2

Distro Version

Debian GNU/Linux trixie trixie/sid aarch64

Other Software

htop 3.3.0
GNU bash, version 5.2.21(1)-release (aarch64-unknown-linux-gnu)
vscode 1.86.2
(Actually not specifc to any of them)

Repro Steps

The trigger is not clear, but it seems related to a process using too much resources or system sleep. It is triggered multilpe times a day, sometimes when compiling some Rust projects, somethimes when closing VS Code. It seems not related to any specific setup, as I encoutered it in the fresh install of different distros, kernels and configs.

Expected Behavior

This should not happen.

Actual Behavior

The process will suddenly become not responsive, and uses 100% CPU. It is also impossible to kill the process, the only way to stop it is to shutdown WSL. I've already tried different distros, but the bug persists on both Ubuntu and Debian. I also tried different branches of Debian, including the testing branch I'm using now. And I used different shells too, zsh is also unable to avoid the bug. I even tried different kernels, including Nevuly/WSL2-Linux-Kernel-Rolling-LTS, but it cannot resolve the issue too. I also disbaled WSLg, swap, systemd and everything suspicious, still failed to solve the issue.

Here is the screenshot of this time:

image

Here are previous screenshots:

image
(Caused by a process in WSLg container)

image

Diagnostic Logs

I tried attaching to it using strace, but the strace became not responsive too. I also tried lsof, which also became not responsive and unable to be killed.

@OneBlue
Copy link
Collaborator

OneBlue commented Feb 26, 2024

That's interesting. Can you try to have an strace running on that process and have its output redirected to a file, then reproduce the problem with the high CPU usage, and then kill the process and see if you get a proper strace that way ?

@PeronGH
Copy link
Author

PeronGH commented Feb 26, 2024

That's interesting. Can you try to have an strace running on that process and have its output redirected to a file, then reproduce the problem with the high CPU usage, and then kill the process and see if you get a proper strace that way ?

I haven't found a way to reproduce the issue reliablely. It is too random. And I just encoutered it again while using VS Code. Some processes of VS Code stopped responding and took 100% CPU. I tried running strace with stdout redirected to a file, the strace was not responsive again, and I could not quit it using control + c. Then I tried starting another shell tab, but the shell didn't start too. So I can only run wsl --shutdown, however, this has took minutes and still unable to finish. the VmmemWSL is still taking 100% of a few CPU cores and I guess my only choice is to restart my PC.

@PeronGH
Copy link
Author

PeronGH commented Feb 26, 2024

After rebooting and entering WSL again, I didn't find the file to be redirected to. I will see what else info I can collect next time I encounter it.

@ericblade
Copy link

I have been encountering something similar when running bitbake, if it runs long enough, eventually WSL grinds to a halt, consuming all available CPUs given to it, and nothing actually runs. wsl --shutdown is the only way I've found to recover, once it's begun, as there's no way to run any diagnostics inside WSL at that point.

@PeronGH
Copy link
Author

PeronGH commented Feb 28, 2024

After some investigation, I find that any process that attaches to the process not responsive will be stuck, and can't be quitted using conrol + c. But I can still quit it using kill command.

I'm looking for some methods that can collect diagnostic logs without attaching to the process.

@maxboone
Copy link

maxboone commented Mar 2, 2024

@PeronGH do you see RCU stalls in your kernel logs / dmesg when this happens?

@PeronGH
Copy link
Author

PeronGH commented Mar 9, 2024

@PeronGH do you see RCU stalls in your kernel logs / dmesg when this happens?

Yes, I do. This happened when I wake up the system from sleep:

Mar 09 17:33:53 Peron-Book kernel:  ____sys_recvmsg+0x74/0x188
Mar 09 17:33:53 Peron-Book kernel:  ___sys_recvmsg+0x7c/0xe8
Mar 09 17:33:53 Peron-Book kernel:  __sys_recvmsg+0x58/0xc8
Mar 09 17:33:53 Peron-Book kernel:  __arm64_sys_recvmsg+0x28/0x50
Mar 09 17:33:53 Peron-Book kernel:  invoke_syscall.constprop.0+0x54/0x128
Mar 09 17:33:53 Peron-Book kernel:  do_el0_svc+0xcc/0xf0
Mar 09 17:33:53 Peron-Book kernel:  el0_svc+0x24/0xb0
Mar 09 17:33:53 Peron-Book kernel:  el0t_64_sync_handler+0x138/0x148
Mar 09 17:33:53 Peron-Book kernel:  el0t_64_sync+0x14c/0x150
Mar 09 17:33:53 Peron-Book kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Mar 09 17:33:53 Peron-Book kernel: rcu:         7-....: (239894 ticks this GP) idle=5904/1/0x4000000000000000 softirq=4723/4723 fqs=115724
Mar 09 17:33:53 Peron-Book kernel: rcu:         (t=240034 jiffies g=12009 q=734 ncpus=8)
Mar 09 17:33:53 Peron-Book kernel: CPU: 7 PID: 1190 Comm: systemd-journal Not tainted 6.7.9-WSL2-STABLE+ #1
Mar 09 17:33:53 Peron-Book kernel: pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Mar 09 17:33:53 Peron-Book kernel: pc : __arch_copy_to_user+0x104/0x240
Mar 09 17:33:53 Peron-Book kernel: lr : _copy_to_iter+0xe8/0x558
Mar 09 17:33:53 Peron-Book kernel: sp : ffff80008dfa3910
Mar 09 17:33:53 Peron-Book kernel: x29: ffff80008dfa3910 x28: ffff0001b8cc5880 x27: 0000000000000065
Mar 09 17:33:53 Peron-Book kernel: x26: 0000000000006007 x25: ffff000002246800 x24: 0000000000000000
Mar 09 17:33:53 Peron-Book kernel: x26: 0000000000006007 x25: ffff000002246800 x24: 0000000000000000
Mar 09 17:33:53 Peron-Book kernel: x23: 0000000000000065 x22: ffff000000f1f300 x21: ffff80008dfa3dc0
Mar 09 17:33:53 Peron-Book kernel: x20: ffff000002246800 x19: 0000000000000065 x18: 0000000000000000
Mar 09 17:33:53 Peron-Book kernel: x17: 0000000000000000 x16: 0000000000000000 x15: ffff000002246800
Mar 09 17:33:53 Peron-Book kernel: x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
Mar 09 17:33:53 Peron-Book kernel: x11: 0000000000000000 x10: 0000000000000000 x9 : ffff80008032c618
Mar 09 17:33:53 Peron-Book kernel: x8 : 37313a3731203920 x7 : 2072614d3e36383c x6 : 0000ab81c89cc210
Mar 09 17:33:53 Peron-Book kernel: x5 : 0000ab81c89cc275 x4 : 0000000000000000 x3 : 0000000000000000
Mar 09 17:33:53 Peron-Book kernel: x2 : ffffffffffffffe5 x1 : ffff000002246810 x0 : 0000ab81c89cc210
Mar 09 17:33:53 Peron-Book kernel: Call trace:
Mar 09 17:33:53 Peron-Book kernel:  __arch_copy_to_user+0x104/0x240
Mar 09 17:33:53 Peron-Book kernel:  simple_copy_to_iter+0x48/0x98
Mar 09 17:33:53 Peron-Book kernel:  __skb_datagram_iter+0x7c/0x280
Mar 09 17:33:53 Peron-Book kernel:  skb_copy_datagram_iter+0x48/0xc8
Mar 09 17:33:53 Peron-Book kernel:  __unix_dgram_recvmsg+0x164/0x4f0
Mar 09 17:33:53 Peron-Book kernel:  unix_dgram_recvmsg+0x5c/0x88
Mar 09 17:33:53 Peron-Book kernel:  sock_recvmsg+0x88/0x108
Mar 09 17:33:53 Peron-Book kernel:  ____sys_recvmsg+0x74/0x188
Mar 09 17:33:53 Peron-Book kernel:  ___sys_recvmsg+0x7c/0xe8
Mar 09 17:33:53 Peron-Book kernel:  __sys_recvmsg+0x58/0xc8
Mar 09 17:33:53 Peron-Book kernel:  __arm64_sys_recvmsg+0x28/0x50
Mar 09 17:33:53 Peron-Book kernel:  invoke_syscall.constprop.0+0x54/0x128
Mar 09 17:33:53 Peron-Book kernel:  do_el0_svc+0xcc/0xf0
Mar 09 17:33:53 Peron-Book kernel:  el0_svc+0x24/0xb0
Mar 09 17:33:53 Peron-Book kernel:  el0t_64_sync_handler+0x138/0x148
Mar 09 17:33:53 Peron-Book kernel:  el0t_64_sync+0x14c/0x150
Mar 09 17:33:53 Peron-Book systemd-journald[444]: Collecting audit messages is disabled.
Mar 09 17:33:53 Peron-Book systemd-journald[444]: Journal started
Mar 09 17:33:53 Peron-Book systemd-journald[444]: System Journal (/var/log/journal/afbad49b043d473bb637af651af5b1dc) is 490.5M, max 4.0G, 3.5G free.
Mar 09 17:17:01 Peron-Book CRON[440]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Mar 09 17:19:51 Peron-Book systemd[1]: systemd-journald.service: Watchdog timeout (limit 3min)!
Mar 09 17:17:01 Peron-Book CRON[439]: pam_unix(cron:session): session closed for user root
Mar 09 17:19:51 Peron-Book systemd[1]: systemd-journald.service: Killing process 36 (systemd-journal) with signal SIGABRT.
Mar 09 17:33:53 Peron-Book systemd[1]: apt-daily.service: Deactivated successfully.
Mar 09 17:33:53 Peron-Book systemd[1]: Finished apt-daily.service - Daily apt download activities.

And this is when compiling an Android App:

Mar 09 17:48:40 Peron-Book kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Mar 09 17:48:40 Peron-Book kernel: rcu:         6-....: (14997 ticks this GP) idle=70e4/1/0x4000000000000000 softirq=11943/11943 fqs=7270
Mar 09 17:48:40 Peron-Book kernel: rcu:         (t=15000 jiffies g=31397 q=994318 ncpus=8)
Mar 09 17:48:40 Peron-Book kernel: CPU: 6 PID: 3354 Comm: Monitor Deflati Not tainted 6.7.9-WSL2-STABLE+ #1
Mar 09 17:48:40 Peron-Book kernel: pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Mar 09 17:48:40 Peron-Book kernel: pc : clear_rseq_cs.isra.0+0x20/0x38
Mar 09 17:48:40 Peron-Book kernel: lr : __rseq_handle_notify_resume+0x16c/0x490
Mar 09 17:48:40 Peron-Book kernel: sp : ffff80008b9abc80
Mar 09 17:48:40 Peron-Book kernel: x29: ffff80008b9abc80 x28: 000000000000080c x27: 0000000000000000
Mar 09 17:48:40 Peron-Book kernel: x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
Mar 09 17:48:40 Peron-Book kernel: x23: 0000ff7fc087dbb8 x22: 0000000000000000 x21: 00000000ffffffff
Mar 09 17:48:40 Peron-Book kernel: x20: ffff80008b9abeb0 x19: ffff000040f1d880 x18: 0000000000000000
Mar 09 17:48:40 Peron-Book kernel: x17: 0000000000000000 x16: 0000000000000000 x15: 0000ff0b6520c790
Mar 09 17:48:40 Peron-Book kernel: x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
Mar 09 17:48:40 Peron-Book kernel: x11: 0000000000000000 x10: 0000000000000000 x9 : ffff80008025728c
Mar 09 17:48:40 Peron-Book kernel: x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
Mar 09 17:48:40 Peron-Book kernel: x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000fffffffffff8
Mar 09 17:48:40 Peron-Book kernel: x2 : 0000000000000000 x1 : 0000ff7f9520d8c8 x0 : 0000000000000000
Mar 09 17:48:40 Peron-Book kernel: Call trace:
Mar 09 17:48:40 Peron-Book kernel:  clear_rseq_cs.isra.0+0x20/0x38
Mar 09 17:48:40 Peron-Book kernel:  do_notify_resume+0x1cc/0xe68
Mar 09 17:48:40 Peron-Book kernel:  el0_svc+0x90/0xb0
Mar 09 17:48:40 Peron-Book kernel:  el0t_64_sync_handler+0x138/0x148
Mar 09 17:48:40 Peron-Book kernel:  el0t_64_sync+0x14c/0x150

The kernel is https://github.com/Nevuly/WSL2-Linux-Kernel-Rolling/releases/tag/linux-wsl-stable-6.7.9 but the issue happened on the official WSL kernel too. If required, I can switch back to official kernel and wait for the issue to happen in order to collect logs.

@PeronGH PeronGH changed the title Random Process Constantly Becomes Not Responsive and Takes 100% CPU Frequent RCU Stalls on ARM64 Mar 9, 2024
@PeronGH
Copy link
Author

PeronGH commented Mar 9, 2024

Closed in favour of #11274

@PeronGH PeronGH closed this as completed Mar 9, 2024
@maxboone
Copy link

maxboone commented Mar 9, 2024

Alright, well unfortunately, welcome to the club. You can try compiling the kernel with CONFIG_RSEQ disabled which seriously alleviates the problem, and on a Hyper-V VM this doesn't happen (however you'll miss all the WSL2 magic).

I see (while typing this reply) that you found my latest issue - hopefully the Microsoft team will respond there soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants