Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xen 4.14: X crashes/hangs #6097

Closed
0spinboson opened this issue Oct 1, 2020 · 22 comments
Closed

Xen 4.14: X crashes/hangs #6097

0spinboson opened this issue Oct 1, 2020 · 22 comments
Labels
affects-4.1 This issue affects Qubes OS 4.1. C: Xen P: major Priority: major. Between "default" and "critical" in severity. R: self-closed Voluntarily closed by the person who opened it before another resolution occurred. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@0spinboson
Copy link

0spinboson commented Oct 1, 2020

Qubes OS version

Qubes 4.1 up to date with current-testing
rx vega56, amdgpu driver, xorg-x11-driver-amdgpu installed
kernel 5.8.10

Affected component(s) or functionality

kde plasma 5.18

Brief summary

when I try to log in to kde plasma desktop, it either hangs for about a minute before crashing back to login screen (after which I sometimes can no longer connect to running VMs, or start new ones), or it hangs or reboots.

To Reproduce

Steps to reproduce the behavior:

  1. Install xen 4.14.0-1 or up, have working kde plasma setup
  2. reboot
  3. try to login

Expected behavior

desktop works as it should

Actual behavior

hangs, reboots, x crash
starting kde applications (including the settings manager) while using xfce can also lead to an unresponsive system.

Additional context

Xorg.log.txt
xsessold.txt

Solutions you've tried

xfce works fine, reverting to xen 4.13 also works.

@0spinboson 0spinboson added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Oct 1, 2020
@andrewdavidwong andrewdavidwong added C: desktop-linux-kde Support for the K Desktop Environment (KDE) needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Oct 1, 2020
@andrewdavidwong andrewdavidwong added this to the Release 4.1 milestone Oct 1, 2020
@0spinboson
Copy link
Author

Seems I spoke too soon. xfce also hangs, it just takes longer, and is fully usable for a few hours.

@0spinboson 0spinboson changed the title Xen 4.14 kde plasma crash/hang Xen 4.14 x crash/hang Oct 1, 2020
@0spinboson 0spinboson changed the title Xen 4.14 x crash/hang Xen 4.14: X crashes/hangs Oct 1, 2020
@andrewdavidwong andrewdavidwong added C: Xen and removed C: desktop-linux-kde Support for the K Desktop Environment (KDE) labels Oct 1, 2020
@ghost
Copy link

ghost commented Oct 5, 2020

It also freezes with external monitor throuth HDMI.

@tasket
Copy link

tasket commented Oct 5, 2020

I'm having significant KDE startup issues w certain desktop components starting sometimes / sometimes not. In particular, the desktop seems to fail to start or crash, leaving a black area with no contents other than app windows. BTW installing the amdgpu driver hasn't helped this (although seems to help in other areas).

A related problem is that the desktop wallpaper and desktop icon settings are forgotten.

The problem is aggravated by the qubes-vm startup service. I have disabled the instances for sys-net and sys-usb, and added a command to a KDE user startup script: systemd-run sh -c 'sleep 4; qvm-start sys-net'

@tasket
Copy link

tasket commented Oct 5, 2020

Also, with the qubes-vm service startup, the sys-usb vm would fail to start about half the time and I would need to reboot the system to get it running at all. Definite race condition(s) happening during startup.

Switching to my delayed user script seems to increase the sys-usb success rate a lot.

@andrewdavidwong andrewdavidwong added P: major Priority: major. Between "default" and "critical" in severity. and removed P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. labels Oct 5, 2020
@fepitre
Copy link
Member

fepitre commented Nov 20, 2020

@0spinboson to confirm that would be the same, a wordaround for the issue I have is to add dom0=pvh at xen cmdline boot args. But...I cannot start any HVM having PCI attached...

@DemiMarie
Copy link

This is eerily reminiscent of a bug I had running kernel-latest on R4.0. I will try to reproduce.

@0spinboson
Copy link
Author

@0spinboson to confirm that would be the same, a wordaround for the issue I have is to add dom0=pvh at xen cmdline boot args. But...I cannot start any HVM having PCI attached...

hm. I tried the Xen 4.14.0-9 build, and i'm now also seeing the lockups.
lockup.txt

@0spinboson
Copy link
Author

hm. looks like Xen 4.14.1 contains no relevant fixes, unless I'm missing something? Shame.

@marmarek
Copy link
Member

marmarek commented Jan 6, 2021

Try adding sched=credit to Xen boot options.

@marmarek
Copy link
Member

marmarek commented Jan 6, 2021

You can also try dropping smt=0 option (but note it raise a risk for speculative execution type bugs).

@0spinboson
Copy link
Author

yeah that isn't too much of a worry for me, since I'm using an AMD cpu. but will try, thank you.

@0spinboson
Copy link
Author

Try adding sched=credit to Xen boot options.

it at least is better than it was before, haven't seen any CPU hangs yet with about 4h uptime; which is an improvement. Though I understood from fepitre's xen-devel mails that for him it only delayed the inevitable, and I don't see any sched-related fixes for x86? But will let you know how long it lasts. Kernel still 5.8.16, will try 5.10.5 once this proves stable for 30h or so.

@0spinboson
Copy link
Author

I made it through 24h without any strange issues, which is unheard of compared to before. kernel 5.10 and 5.9 both are unstable as hell though, but idk what to do with that since the crashes don't leave any logs.

@0spinboson
Copy link
Author

0spinboson commented Jan 11, 2021

Must say the xen documentation isn't great. https://wiki.xenproject.org/wiki/Xen_Project_4.12_Feature_List says credit2 is the 'new default scheduler', and so does https://xenbits.xen.org/docs/4.14-testing/features/sched_credit2.html, yet https://xenbits.xen.org/docs/4.14-testing/SUPPORT.html#credit-scheduler says credit is the default. I guess the latter isn't update, but why does this page even exist if it's not used any more? Ugh.

@marmarek
Copy link
Member

kernel 5.10 and 5.9 both are unstable as hell though

5.10.7 brings been quite a few fixes for common regressions in 5.9 and earlier 5.10 (at least on Intel HW). It's in current-testing already, can you give it a try?

@0spinboson
Copy link
Author

0spinboson commented Jan 13, 2021

okay. it seems to have something to do with my dual monitor setup. if I only have 1 monitor enabled during login, no problem. If I turn the second monitor on and then reboot the VM, it even 'gets' the correct display dimensions so I can click anywhere no matter where I put the VM window. But as soon as I login with 2 4k displays enabled, instant reboot.
Using KDE, and lightdm.

@0spinboson
Copy link
Author

0spinboson commented Jan 14, 2021

was stable once I got past login until I opened "too many windows" at the same time, at which point I got another hard reboot. Not sure which component / interaction causes this problem, but it's an issue that doesn't occur at all with kernel 5.8.16, and it seems to have to do with drawing windows.

@0spinboson
Copy link
Author

0spinboson commented Feb 23, 2021

tested this weekend using xen 4.14.1-2, kernel 5.4.98.

Kernel 5.9 and up crash as soon as I open too many windows at the same time, but are stable so long as I'm in dom0 with no VMs spun up. Kernel 5.8.16-1 works fine.

I decided to test kernel 5.4 over the weekend, and I've noticed something similar happening there, although this kernel series does result in a system that's usable/stable for at least a day or two.

I've attached a log file with more info, but this is part of what happens:

Feb 20 22:59:46 dom0 kernel: Xorg: page allocation failure: order:6, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
Feb 20 22:59:46 dom0 kernel: CPU: 1 PID: 3570 Comm: Xorg Not tainted 5.4.98-1.fc32.qubes.x86_64 #1
Feb 20 22:59:46 dom0 kernel: Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 5603 07/28/2020
Feb 20 22:59:46 dom0 kernel: Call Trace:
Feb 20 22:59:46 dom0 kernel: dump_stack+0x64/0x7c
Feb 20 22:59:46 dom0 kernel: warn_alloc.cold+0x7b/0xdf
Feb 20 22:59:46 dom0 kernel: ? __alloc_pages_direct_compact+0x171/0x180
Feb 20 22:59:46 dom0 kernel: __alloc_pages_slowpath+0xa98/0xae0
Feb 20 22:59:46 dom0 kernel: ? get_page_from_freelist+0x18b/0x340
Feb 20 22:59:46 dom0 kernel: __alloc_pages_nodemask+0x30e/0x360
Feb 20 22:59:46 dom0 kernel: kmalloc_order+0x1b/0x80
Feb 20 22:59:46 dom0 kernel: kmalloc_order_trace+0x1d/0xa0
Feb 20 22:59:46 dom0 kernel: gntdev_alloc_map+0x64/0x250 [xen_gntdev]
Feb 20 22:59:46 dom0 kernel: gntdev_ioctl_map_grant_ref+0x73/0x1d0 [xen_gntdev]
Feb 20 22:59:46 dom0 kernel: do_vfs_ioctl+0x2fb/0x490
Feb 20 22:59:46 dom0 kernel: ? syscall_trace_enter+0x1d1/0x2c0
Feb 20 22:59:46 dom0 kernel: ksys_ioctl+0x5e/0x90
Feb 20 22:59:46 dom0 kernel: __x64_sys_ioctl+0x16/0x20
Feb 20 22:59:46 dom0 kernel: do_syscall_64+0x5b/0xf0
Feb 20 22:59:46 dom0 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Feb 20 22:59:46 dom0 kernel: RIP: 0033:0x74775a7db17b

near-full log: log.txt

(I should probably rename this issue as it seems to be a memory allocation or grant page issue.)

@marmarek
Copy link
Member

gntdev_ioctl_map_grant_ref and Comm: Xorg says it is about mapping window contents (composition buffers) from a VM.

This looks like this mapping is using actual dom0 memory, not only mapping VM's memory into its own address space. This sounds similar to the situation described in https://xenbits.xen.org/xsa/advisory-300.html.
When this happens, can you check grant table usage with xl debug-key g; xl dmesg? I'm mostly interested in headers for each domain (.. frames (... max), ... maptrack frames (... max) lines).

@0spinboson
Copy link
Author

I changed the buffer size in the manner and to the size suggested by brendan, but I still only get two such headers. Is that normal, or should I increase it further?

marmarek added a commit to marmarek/qubes-linux-kernel that referenced this issue Mar 8, 2021
Allow to not use otherwise usable RAM pages to map foreign
pages (including grant mappings). This is especially useful for dom0 (or
GUI domain) that maps a lot of foreign pages.

QubesOS/qubes-issues#6097
marmarek added a commit to QubesOS/qubes-linux-kernel that referenced this issue Mar 8, 2021
Allow to not use otherwise usable RAM pages to map foreign
pages (including grant mappings). This is especially useful for dom0 (or
GUI domain) that maps a lot of foreign pages.

QubesOS/qubes-issues#6097

(cherry picked from commit 3e71ce9)
@0spinboson
Copy link
Author

I'm still trying to capture the debug-keys output, but it looks like kernel 5.11(.4-1) is much more stable again, not tried 5.10.21 yet. I've never made it past 4hrs uptime since 5.9.x, and currently i'm at 21h.

jevank pushed a commit to jevank/qubes-linux-kernel-gvt that referenced this issue Feb 2, 2022
Allow to not use otherwise usable RAM pages to map foreign
pages (including grant mappings). This is especially useful for dom0 (or
GUI domain) that maps a lot of foreign pages.

QubesOS/qubes-issues#6097

(cherry picked from commit 3e71ce9ed6f0c57129e767f42d16fce5a3c7b1e9)
@andrewdavidwong andrewdavidwong added the affects-4.1 This issue affects Qubes OS 4.1. label Aug 8, 2023
@andrewdavidwong andrewdavidwong removed this from the Release 4.1 updates milestone Aug 13, 2023
@andrewdavidwong andrewdavidwong added R: self-closed Voluntarily closed by the person who opened it before another resolution occurred. and removed needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.1 This issue affects Qubes OS 4.1. C: Xen P: major Priority: major. Between "default" and "critical" in severity. R: self-closed Voluntarily closed by the person who opened it before another resolution occurred. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

7 participants
@marmarek @DemiMarie @andrewdavidwong @tasket @0spinboson @fepitre and others