Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WSL2 vim hang - RCU stall #7913

Closed
1 of 2 tasks
usedbytes opened this issue Jan 10, 2022 · 4 comments
Closed
1 of 2 tasks

WSL2 vim hang - RCU stall #7913

usedbytes opened this issue Jan 10, 2022 · 4 comments

Comments

@usedbytes
Copy link

Version

Microsoft Windows [Version 10.0.19042.1415]

WSL Version

  • WSL 2
  • WSL 1

Kernel Version

Linux version 5.10.60.1-microsoft-standard-WSL2 (oe-user@oe-host) (aarch64-msft-linux-gcc (GCC) 9.3.0, GNU ld (GNU Binutils) 2.34.0.20200220) #1 SMP Wed Aug 25 23:20:18 UTC 2021

Distro Version

Ubuntu-20.04

Other Software

tmux, vim

Repro Steps

I'm not exactly sure - I had a vim (specifically vim.gtk3, but in a terminal) process running inside a tmux session. When I came back to it after some hours, it was non-responsive.

I have seen this same issue in the past with the same set-up, but I don't know what triggers it.

Expected Behavior

vim/terminal remains responsive. The kernel doesn't report RCU stalls.

If it becomes unresponsive, I expect to be able to kill the process via kill/killall etc.

Actual Behavior

The process is completely non-responsive.

top and /proc say the process is running, at 100% CPU:

top - 17:46:34 up  7:06,  0 users,  load average: 3.00, 2.95, 2.15
Tasks:  14 total,   2 running,  12 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 12.4 sy,  0.0 ni, 86.8 id,  0.0 wa,  0.4 hi,  0.3 si,  0.0 st
MiB Mem :  12417.9 total,   8655.4 free,    401.6 used,   3360.8 buff/cache
MiB Swap:   4096.0 total,   4096.0 free,      0.0 used.  11900.4 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  6146 XXXXXX  20   0   56080  19136  13436 R  98.7   0.2  23:31.67 vim

dmesg reports RCU stalls:

[25546.968589] rcu: INFO: rcu_sched self-detected stall on CPU
[25546.968599] rcu:     5-....: (327445 ticks this GP) idle=166/1/0x4000000000000002 softirq=92025/92025 fqs=157547
[25546.968602]  (t=330021 jiffies g=430477 q=135508)
[25546.968611] Task dump for CPU 5:
[25546.968612] task:vim             state:R  running task     stack:    0 pid: 6295 ppid:   153 flags:0x00000003
[25546.968689] Call trace:
[25546.968701]  dump_backtrace+0x0/0x1c8
[25546.968704]  show_stack+0x1c/0x28
[25546.968711]  sched_show_task+0x140/0x160
[25546.968718]  dump_cpu_task+0x48/0x54
[25546.968720]  rcu_dump_cpu_stacks+0xc0/0x100
[25546.968723]  rcu_sched_clock_irq+0x7bc/0x9b8
[25546.968726]  update_process_times+0x64/0x2a0
[25546.968729]  tick_sched_handle.isra.0+0x38/0x50
[25546.968731]  tick_sched_timer+0x50/0xa0
[25546.968733]  __hrtimer_run_queues+0x11c/0x3a8
[25546.968735]  hrtimer_interrupt+0x118/0x300
[25546.968739]  hv_stimer0_isr+0x28/0x30
[25546.968741]  hv_stimer0_percpu_isr+0x14/0x20
[25546.968744]  handle_percpu_devid_irq+0x8c/0x210
[25546.968745]  generic_handle_irq+0x34/0x48
[25546.968747]  __handle_domain_irq+0x68/0xc0
[25546.968749]  gic_handle_irq+0x5c/0x120
[25546.968750]  el1_irq+0xc8/0x180
[25546.968752]  setup_rt_frame+0x8d0/0xbf0
[25546.968754]  do_notify_resume+0x294/0x650
[25546.968755]  work_pending+0xc/0x208

I am unable to terminate the process with kill 6146/kill -9 6146/killall vim.
I am unable to attach to the process with strace/gdb -p 6146 - it just hangs at Attaching to process 6146

Windows Task Manager shows ~7% CPU usage in Vmmem, on an otherwise idle system.

This is a Surface Pro X with the SQ2 chip (Windows on Arm)

Diagnostic Logs

No response

@benhillis
Copy link
Member

@tyhicks - we've seen some RCU stalls in our test automation in the past. Is there something that we can do to root cause these?

@usedbytes
Copy link
Author

More data, I didn't see a hang, but I checked dmesg this morning and it's full of RCU stalls - I guess they freed up because the WSL2 instance is still responsive.

There seems to be a pretty wide range of tasks (kworker, rcu_sched, khugepaged, swapper, neomutt, tmux, localhost(?), ksoftirqd...), so I suspect it's not to do with any particular application.

[41445.560635] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[41445.560647] rcu:     3-...!: (0 ticks this GP) idle=9d2/1/0x4000000000000002 softirq=89675/89675 fqs=2
[41445.560651]  (detected by 2, t=205367 jiffies, g=334085, q=323)
[41445.560658] Task dump for CPU 3:
[41445.560660] task:khugepaged      state:R  running task     stack:    0 pid:   57 ppid:     2 flags:0x00000028
[41445.560664] Call trace:
[41445.560672]  __switch_to+0x84/0xc0
[41445.560675]  0xfffffdffffe00000
[41445.560677] rcu: rcu_sched kthread starved for 205361 jiffies! g334085 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=5
[41445.560678] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[41445.560679] rcu: RCU grace-period kthread stack dump:
[41445.560680] task:rcu_sched       state:I stack:    0 pid:   12 ppid:     2 flags:0x00000028
[41445.560682] Call trace:
[41445.560683]  __switch_to+0x84/0xc0
[41445.560687]  __schedule+0x348/0x730
[41445.560689]  schedule+0x4c/0xd0
[41445.560691]  schedule_timeout+0x1a0/0x328
[41445.560695]  rcu_gp_kthread+0x41c/0xbc8
[41445.560701]  kthread+0x100/0x130
[41445.560704]  ret_from_fork+0x10/0x18
[41445.560752] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[41445.560756] rcu:     3-...!: (1 ticks this GP) idle=9d2/1/0x4000000000000000 softirq=89675/89675 fqs=2
[41445.560758]  (detected by 7, t=1854634 jiffies, g=334085, q=323)
[41445.560762] Task dump for CPU 3:
[41445.560763] task:khugepaged      state:R  running task     stack:    0 pid:   57 ppid:     2 flags:0x0000002a
[41445.560766] Call trace:
[41445.560769]  __switch_to+0x84/0xc0
[41445.560771]  0xfffffdffffe00000
[41445.560773] rcu: rcu_sched kthread starved for 1854628 jiffies! g334085 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=1
[41445.560773] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[41445.560774] rcu: RCU grace-period kthread stack dump:
[41445.560775] task:rcu_sched       state:R  running task     stack:    0 pid:   12 ppid:     2 flags:0x00000028
[41445.560781] Call trace:
[41445.560783]  __switch_to+0x84/0xc0
[41445.560809]  __schedule+0x348/0x730
[41445.560811]  schedule+0x4c/0xd0
[41445.560812]  schedule_timeout+0x1a0/0x328
[41445.560814]  rcu_gp_kthread+0x41c/0xbc8
[41445.560816]  kthread+0x100/0x130
[41445.560821]  ret_from_fork+0x10/0x18
[41445.569823] WSL2: Performing memory compaction.
[49548.284171] rcu: INFO: rcu_sched self-detected stall on CPU
[49548.284177] rcu:     6-...!: (1 GPs behind) idle=92e/1/0x4000000000000002 softirq=87466/87471 fqs=0
[49548.284180]  (t=177580 jiffies g=334341 q=264)
[49548.284187] rcu: rcu_sched kthread starved for 177580 jiffies! g334341 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=5
[49548.284188] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[49548.284188] rcu: RCU grace-period kthread stack dump:
[49548.284190] task:rcu_sched       state:I stack:    0 pid:   12 ppid:     2 flags:0x00000028
[49548.284193] Call trace:
[49548.284201]  __switch_to+0x84/0xc0
[49548.284203] rcu: INFO: rcu_sched self-detected stall on CPU
[49548.284206]  __schedule+0x348/0x730
[49548.284208]  schedule+0x4c/0xd0
[49548.284211]  schedule_timeout+0x1a0/0x328
[49548.284214] rcu:     5-...!: (2 ticks this GP) idle=606/0/0x3 softirq=95374/95374 fqs=0
[49548.284217]  rcu_gp_kthread+0x41c/0xbc8
[49548.284218]  (t=727336 jiffies g=334341 q=264)
[49548.284220] rcu: rcu_sched kthread starved for 727336 jiffies! g334341 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=5
...
...
[66128.883585] rcu: rcu_sched kthread starved for 1439749 jiffies! g338001 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=1
[66128.883586] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[66128.883587] rcu: RCU grace-period kthread stack dump:
[66128.883587] task:rcu_sched       state:R  running task     stack:    0 pid:   12 ppid:     2 flags:0x00000028
[66128.883588] Call trace:
[66128.883589]  __switch_to+0x84/0xc0
[66128.883590]  __schedule+0x348/0x730
[66128.883591]  schedule+0x4c/0xd0
[66128.883592]  schedule_timeout+0x1a0/0x328
[66128.883593]  rcu_gp_kthread+0x41c/0xbc8
[66128.883594]  kthread+0x100/0x130
[66128.883595]  ret_from_fork+0x10/0x18
[70256.496667] WSL2: Performing memory compaction.
[71018.375306] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[71018.375312] rcu:     4-...!: (0 ticks this GP) idle=338/0/0x0 softirq=98633/98633 fqs=1  (false positive?)
[71018.375317]  (detected by 0, t=175088 jiffies, g=338801, q=5)
[71018.375322] Task dump for CPU 4:
[71018.375325] task:swapper/4       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x00000028
[71018.375329] Call trace:
[71018.375336]  __switch_to+0x84/0xc0
[71018.375343]  tick_nohz_idle_stop_tick+0x1f4/0x2a8
[71018.375347]  do_idle+0xc0/0x128
[71018.375349] rcu: rcu_sched kthread starved for 175086 jiffies! g338801 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=7
[71018.375350] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[71018.375350] rcu: RCU grace-period kthread stack dump:
[71018.375352] task:rcu_sched       state:I stack:    0 pid:   12 ppid:     2 flags:0x00000028
[71018.375353] Call trace:
[71018.375368]  __switch_to+0x84/0xc0
[71018.375371]  __schedule+0x348/0x730
[71018.375372]  schedule+0x4c/0xd0
[71018.375374]  schedule_timeout+0x1a0/0x328
[71018.375375]  rcu_gp_kthread+0x41c/0xbc8
[71018.375377]  kthread+0x100/0x130
[71018.375379]  ret_from_fork+0x10/0x18

@hwine
Copy link

hwine commented Jun 14, 2022

I'm also seeing this on aarch64 on a regular basis. A few more general observations, and I'd be happy to provide logs if you point me to instructions.

  • this happens "for sure" with tmux
  • often, only the active tmux window is impacted -- I may be able to do ^B^W and find other working windows.
  • it will also happen with bash
  • I've experienced this in "Windows Terminal", "Windows Terminal Preview", and a terminal window in VSCode
  • Often, after the hang, wsl shutdown will not complete, and I have to terminate all wsl*.exe tasks, then vmmemWSL before I get a successful restart of WSL.
    • after a number of "force quits", I need to reboot (I haven't tried logout).
  • sometimes, wsl shutdown will complete, but VmmemWSL will still be running at 20-30% CPU. I have to wait for that process to terminate (or kill it) before I can get a new working session.

I think it is less likely to hang if the terminal window is not the active window in the app. E.g. if I switch to a power shell tab in windows terminal, or close the terminal pane in vscode. But I haven't been keeping timings, so ....

key value
kernel Linux TABLET-935H4GMO 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux (Microsoft Surface X)
OS Ubuntu 20.04.4 LTS

Copy link
Contributor

This issue has been automatically closed since it has not had any activity for the past year. If you're still experiencing this issue please re-file this as a new issue or feature request.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants