Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atheros 928x PCI passthrough not working #3609

Closed
awokd opened this issue Feb 19, 2018 · 31 comments
Closed

Atheros 928x PCI passthrough not working #3609

awokd opened this issue Feb 19, 2018 · 31 comments
Labels
C: other T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@awokd
Copy link

awokd commented Feb 19, 2018

Qubes OS version:

R4.0

Affected TemplateVMs:


Steps to reproduce the behavior:

Try to attach AR9280 to sys-net or other HVM. AR9287 also reported to have same behavior.

Expected behavior:

ath9k driver loads without crashing

Actual behavior:

ath9k driver crashes HVM with

(XEN) AMD-Vi: Setup I/O page table: device id = 0x200, type = 0x1, root
table = 0x264921000, domain = 9, paging mode = 3
...
(XEN) svm.c:1540:d9v0 SVM violation gpa 0x000000f2020040, mfn 0xf0100, type 5
(XEN) domain_crash called from svm.c:1541

General notes:

Filing this here because passing through the same device on the same hardware to an HVM on Xen 4.8.2 and 4.8.3 on Fedora 26 works, as does using it in dom0 in that configuration and under stock Debian Stretch. Not sure if it affects a "broad range" of users as much as Intel wireless, though if there's a bug in handling this type of PCI device it could also affect other similar devices under Qubes. The fix could well be to buy a new device, but it might be helpful to understand why it doesn't work.

https://stackoverflow.com/questions/38387504/xen-guest-atheros-wifi-driver-load-causes-memory-paging-failure has a good description of the problem. He was encountering it under Xen 4.6 instead of Qubes, but I had the same issue (kernel crash instead of domU) when trying to pass it through to a PV under Qubes:

it seems the iomap of PCI BAR for the device returns a a mapping f which first 0x1000 bytes are read only and that causes access violation when trying to write registers mapped to this area (all the regs with offset < 0x1000) - why this happens i still don't know. Register writes with offsets > 0x1000 are fine.

According to the datasheet, this device uses a PCI Express 1.0a Configuration space of 0x00-0x62, DMA accessed registers from 0x0000-0x0FFC, and other registers from 0x1000-0x98FC. For example, the offset 0x40 PCI Express Configuration space register is used for Power Management Capability, while offset 0x0040 DMA device register is used for MIB Control. It has a single 64K BAR and no defined I/O port.

It's that first page of DMA registers that is causing problems. From Xen's perspective, the VM is trying to do an IO write to a page flagged as memory mapped (if I understand the error right), so it crashes. I verified this by commenting out the first couple register writes that were to offsets <0x1000 in the ath9k driver and recompiling it. The crash then occurred later in the driver initialization, but at a different <0x1000 location. Multiple writes to >0x1000 locations during driver initialization were processed successfully.


Related issues:

@awokd
Copy link
Author

awokd commented Feb 19, 2018

Currently attempting to get Qubes 4.0's xen-hvm-stubdom-linux running on Xen 4.8.3 to see if it's a stubdom issue.

@schnurentwickler
Copy link

schnurentwickler commented Feb 19, 2018

With atheros I had my issues as well. atheros was not usable even after reboots if the computer was in standby mode. Only a shutdown and even WITH power supply attached at boot brought it back to work.
The power supply issue I could not solve, but the standby issue I managed with nohwcrypt as module option. See https://bugs.launchpad.net/ubuntu/+source/linux/+bug/568090
Maybe Xen does have heavier problems to load and assign a device with strange responses even for normal linux setups.
I could not get an atheros device to work in qubes 3.2. Should be noted in qubes first information page for a release to avoid atheros device modules.

@awokd
Copy link
Author

awokd commented Feb 19, 2018

It's not all Atheros devices; I have a 9565 that works with Qubes 4.0 (although I never tested suspend). But you are probably right and the list of not working ones is longer than just 928x. I know Intel has issues with sleep mode too.

@andrewdavidwong andrewdavidwong added T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. C: other labels Feb 20, 2018
@andrewdavidwong andrewdavidwong added this to the Release 4.0 milestone Feb 20, 2018
@marmarek
Copy link
Member

This is weird. The difference with plain Fedora setup may be usage of stubdomain at all. Running linux-based stubdomain require some libxl patching, but mini-os based one should work out of the box on non-qubes system.
Another thing we do differently, is enabling e820_host option in guest configuration - you can disable it with qvm-features sys-net pci-e820-host ''. I doubt it will help, but those are differences I'm aware of.

/cc @HW42

@awokd
Copy link
Author

awokd commented Feb 21, 2018

Tried qvm-features sys-net pci-e820-host '' but unfortunately, no effect.

Not sure if it's relevant, but the working version appears to be using MSI-X but it's MSI or legacy under Qubes. Basing this observation on the IRQ numbers only, not entirely positive how to decrypt the lspci -vvv output.

To summarize:

Version Result
Debian 9 works
Xen 4.6 PV fails (per Stackoverflow link)
Xen 4.8.3 dom0 works
Xen 4.8.3 PV spent 6 hours trying to get it to boot and xl console to connect, will try again later (not a tech support request but the learning curve sure is steep)
Xen 4.8.3 HVM works except can't scan wireless networks
Xen 4.8.3 HVM traditional stubdomain fault inside the stubdomain even with a very basic config and nothing passed through
Qubes 4.0 PV fails similarly to Stackoverflow link
Qubes 4.0 HVM fails with svm.c domain_crash
Qubes 4.0 HVM w/9565 works

@HW42
Copy link

HW42 commented Feb 21, 2018

@awokd: Could you please post lspci -vvv -xxxx -s XX:XX.X (replace XX:XX.X with the device) from both dom0 as well as from inside the VM.

@HW42
Copy link

HW42 commented Feb 21, 2018

FWIW: The ath9k card I have laying around (AR9287 according to lspci) works for me.

@HW42
Copy link

HW42 commented Feb 21, 2018

@awokd: You wrote "Xen 4.8.3 HVM" works. Could you try to pass pci=nomsi to the VM kernel a see if it still works?

@awokd
Copy link
Author

awokd commented Feb 21, 2018

Attached the files- I'm able to boot qubes domu with the ath9k module blacklisted. My AR9280 is on a corebooted AMD and the other user that told me about the AR9287 not working was as well. Tried pci=nomsi on the xen domu and it had no effect- verified it on the boot log options line and the IRQ was still 36. Had also tried that before on the qubes domu with no change, still the svm.c crash.

qdom0.txt
qdomu.txt
xdom0.txt
xdomu.txt

@awokd
Copy link
Author

awokd commented Feb 21, 2018

I should clarify what I mean by "works" for the Xen HVM- the ath9k driver loads without crashing and I can poke at the card with iw commands and set and get data. Can't actually scan wireless networks but it looks like that's a common problem with multiple possible solutions, so I haven't spent much time on it yet.

@HW42
Copy link

HW42 commented Feb 22, 2018

My AR9280 is on a corebooted AMD and the other user that told me about the AR9287 not working was as well.

AFAIK @h01ger also has problems with an ath9k card on a coreboot machine. That has a Intel CPU. So this sounds like a coreboot problem. Can you try this on an non-coreboot machine (or even better stock BIOS on the same machine)?

@h01ger
Copy link

h01ger commented Feb 22, 2018 via email

@HW42
Copy link

HW42 commented Feb 22, 2018

one problem with thinkpads is, that they only allow intel wlan cards with the stock bios.

Ugh.

Let's see what @awokd reports.

@awokd
Copy link
Author

awokd commented Feb 22, 2018

Yes, it's a Lenovo too with a whitelist firmware, so I couldn't run this card on it if I flash it back. But should the domU's lspci output differ between Qubes and Xen?
qdomu: Capabilities blocks 40, 50, 60, legacy INT(?)
xdomu: Capability block 40, MSI-X

@awokd
Copy link
Author

awokd commented Feb 22, 2018

This could be an edge case too, in which case I apologize for wasting everyone's time. But I've seen similar reports of MSI interrupts being flaky on some devices under Qubes over the past few months I've been working on this (not solidly, but still...). Maybe it's a duplicate issue?

PS I've edited the test results table above with additional results I forgot to include.

one example
and #3217

[ 2.361791] iwlwifi 0000:00:01.0: Xen PCI mapped GSI17 to IRQ27
[ 2.365431] iwlwifi 0000:00:01.0: pci frontend enable msi failed for dev 0:8
[ 2.365465] iwlwifi 0000:00:01.0: Xen PCI frontend error: -22!
[ 2.365694] iwlwifi 0000:00:01.0: pci_enable_msi failed - -22

and #3235

Oct 27 07:56:09 sys-net kernel: iwlwifi 0000:00:00.0: Xen PCI mapped GSI18 to IRQ26
Oct 27 07:56:09 sys-net kernel: iwlwifi 0000:00:00.0: pci frontend enable msi failed for dev 0:0
Oct 27 07:56:09 sys-net kernel: iwlwifi 0000:00:00.0: Xen PCI frontend error: -22!
Oct 27 07:56:09 sys-net kernel: iwlwifi 0000:00:00.0: pci_enable_msi failed - -22

@HW42
Copy link

HW42 commented Feb 22, 2018

But should the domU's lspci output differ between Qubes and Xen?

That's expected since vanilla Xen doesn't use a stubdom by default (and we have a custom Linux based stubdom).

qdomu: Capabilities blocks 40, 50, 60, legacy INT(?)
xdomu: Capability block 40, MSI-X

Are you sure you didn't swap xdomu.txt and qdomu.txt? I would expect them the other way around.

Also I don't see MSI-X in neither (in Qubes that's expected). Why do you think it's using MSI-X?

@awokd
Copy link
Author

awokd commented Feb 22, 2018

Yes, I'm sure I didn't swap them. Note the lack of Kernel driver in use: ath9k in the Qubes one.

Because it's on IRQ 36. My understanding is Legacy interrupt values go up to 16, MSI up to 32, and MSI-X up to 2048 (but maybe that is folklore).

@HW42
Copy link

HW42 commented Feb 22, 2018

Anyway, I think it's rather not interrupt related but:

Region 0: Memory at f0100000 (64-bit, non-prefetchable) [disabled] [size=64K]

Note the disabled. Please post xl dmesg (ideally with loglvl=all. Dom0 dmesg also doesn't hurt but probably not needed)

@marmarek
Copy link
Member

I gave one ath9k card to marmarek, but I think he wasnt able to test it just yet.

I've tried and the card isn't even visible on lspci in dom0. But it may be something with my laptop...

@awokd
Copy link
Author

awokd commented Feb 22, 2018

That [disabled] is interesting. I'd assumed it was an artefact of Qubes hiding PCI devices but when I tested Xen with xen-pciback.hide=(02:00.0) just now, it continued to be enabled.
Attaching the xl dmesg from both.

qdmesg.txt
xdmesg.txt

@h01ger
Copy link

h01ger commented Feb 22, 2018 via email

@marmarek
Copy link
Member

Yes. I'll try another card in that slot (the slot that is working with the intel wifi is too small for this one).

@awokd
Copy link
Author

awokd commented Feb 24, 2018

@HW42 : Noticed something else in that qdomu.txt file- it has the 50 and 60 MSI capabilities but only the standard PCI configuration space (the -xxxx dump only goes up to 0xff). In qdom0 it shows the PCIe extended config space in the dump. Attempting to follow the logic in xen-4.8.3/tools/qemu-xen/hw/pci/pcie.c was uninformative, so not sure if one has anything to do with the other (or if I'm even in the right area). Could this also be related to the [disabled] memory?
The "missing" configuration space also seem to line up with the range of memory registers the driver crashes on when it attempts to write.

@marmarek
Copy link
Member

marmarek commented Mar 7, 2018

Ok, I've tried the card in another slot and it is visible. And crashes sys-net very similar way: EPT violation (-w-/r-x). When I switch sys-net to PV, it also crashes, but with more useful message, very similar to the one from stackoverflow:

[    4.324539] BUG: unable to handle kernel paging request at ffffc90001c70040
[    4.324585] IP: iowrite32+0x2b/0x30
[    4.324607] PGD 18818067 P4D 18818067 PUD 18817067 PMD 11beb067 PTE 80100000f1500075
[    4.324665] Oops: 0003 [#1] SMP NOPTI
[    4.324688] Modules linked in: ath9k(+) ath9k_common ath9k_hw mac80211 ath cfg80211 rfkill e1000e ptp pps_core intel_rapl x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel intel_rapl_perf pcspkr xen_pcifront xenfs xen_privcmd xen_gntdev xen_gntalloc xen_blkback xen_evtchn u2mfn(O) xen_blkfront
[    4.324842] CPU: 0 PID: 233 Comm: kworker/0:2 Tainted: G           O    4.14.18-1.pvops.qubes.x86_64 #1
[    4.324891] Workqueue: events work_for_cpu_fn
[    4.324918] task: ffff88001059db80 task.stack: ffffc900019d4000
[    4.324952] RIP: e030:iowrite32+0x2b/0x30
[    4.324973] RSP: e02b:ffffc900019d7cc0 EFLAGS: 00010296
[    4.325008] RAX: 0000000000000000 RBX: ffff880010f78028 RCX: 0000000000000005
[    4.325048] RDX: ffffc90001c70040 RSI: ffffc90001c70040 RDI: 0000000000000000
[    4.325077] RBP: ffff880010f78078 R08: 0000000000000000 R09: 00000000ffffff90
[    4.325090] R10: 000000000000003f R11: 0000000000000000 R12: ffffffffc03467d0
[    4.325104] R13: 0000000000000002 R14: 0000000000000100 R15: ffff880010f78028
[    4.325127] FS:  0000000000000000(0000) GS:ffff880013a00000(0000) knlGS:0000000000000000
[    4.325142] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[    4.325153] CR2: ffff80000078a800 CR3: 0000000010a58000 CR4: 0000000000042660
[    4.325175] Call Trace:
[    4.325195]  ath9k_enable_mib_counters+0x4a/0x80 [ath9k_hw]
[    4.325212]  ath9k_hw_init+0x632/0xb00 [ath9k_hw]
[    4.325226]  ? __queue_work+0x420/0x420
[    4.325241]  ath9k_init_device+0x5fb/0xdb0 [ath9k]
[    4.325256]  ? request_threaded_irq+0xfa/0x160
[    4.325272]  ath_pci_probe+0x20e/0x3d0 [ath9k]
[    4.325287]  local_pci_probe+0x3f/0x90
[    4.325297]  ? __schedule+0x3d3/0x850
[    4.325307]  work_for_cpu_fn+0x10/0x20
[    4.325318]  process_one_work+0x181/0x390
[    4.325328]  worker_thread+0x1d7/0x3c0
[    4.325337]  kthread+0xfc/0x130
[    4.325347]  ? process_one_work+0x390/0x390
[    4.325357]  ? kthread_create_on_node+0x70/0x70
[    4.325368]  ret_from_fork+0x35/0x40
[    4.325378] Code: 48 81 fe ff ff 03 00 48 89 f2 77 1f 48 81 fe 00 00 01 00 76 07 0f b7 d6 89 f8 ef c3 48 c7 c6 5c 8d 0d 82 48 89 d7 e9 95 fe ff ff <89> 3e c3 66 90 48 81 ff ff ff 03 00 77 28 48 81 ff 00 00 01 00 
[    4.325431] RIP: iowrite32+0x2b/0x30 RSP: ffffc900019d7cc0
[    4.325441] CR2: ffffc90001c70040
[    4.325452] ---[ end trace 4c9dd820b875aec9 ]---
[    4.325460] Kernel panic - not syncing: Fatal exception
[    4.325472] Kernel Offset: disabled

@h01ger
Copy link

h01ger commented Mar 10, 2018

I pointed @nbd168 at this and this is what he said:

I dont believe that the drivers writes into wrong memory areas
I rather think that the pci ranges are not set up correctly
which is why legitimate accesses are blocked
but I know too little about pci to know what exactly happens there
but the register writes on addr < 0x1000 are definitly valid
who/what is setting up those pci ranges?
i think the BARs which the pci driver reads from the config registers
so either the BARs are broken themselves, or they are interpreted differently

@awokd
Copy link
Author

awokd commented Mar 10, 2018

https://lists.gt.net/xen/devel/439033?page=last

Is this BAR the same BAR which has the MSI-X table in? For safety, Xen
has to trap and emulate updates to the MSI/MSI-X configuration. It is
possible that that logic has gone wrong.

Looks like that thread might be from the same Stackoverflow poster. Seems like his MSI-X interrupts might have been disabled as well. Can I force them somewhere in Qubes? Maybe it's an upstream bug that only shows up with legacy interrupts, but I still don't get why my device and others' are falling back to using legacy ints under Qubes HVM but not Xen.

@marmarek
Copy link
Member

MSI/MSI-X is broken in PV mode (#3217). But on Qubes HVM, MSI should work...
Relevant changes (possibly breaking MSI for PV) were part of XSA-237. But it was only about explicit enabling MSI/MSI-X by a hypercall, not direct config space write. The point about some trap on config space seems plausible.
There is possibly related code in Xen sources in arch/x86/hvm/vmsi.c, especially functions listed in msixtbl_mmio_ops structure.
I don't have that card plugged in anywhere right now to verify that hypothesis or collect more info. If you have, try collecting lspci -vv output before inserting the module. And also look at the address at which write fails. If that matches MSI address from lspci output, that's probably it.

@awokd
Copy link
Author

awokd commented Mar 14, 2018

Looking at the PCI bridge in front of the empty slot, it says the same thing under Xen and Qubes:

I/O behind bridge: 0000f000-00000fff [empty]
Memory behind bridge: fff00000-000fffff [empty]
Prefetchable memory behind bridge: fff00000-000fffff [empty]

Crash is at:

(XEN) svm.c:1540:d9v0 SVM violation gpa 0x000000f2020040, mfn 0xf0100, type 5

A different PCI bridge reports Cap [a0] with an MSI address, but the one associated with 02:00.0 reports no MSI capabilities (at least without a module installed.) Oddly, when I put in a different (Express v2) module, it gets an MSI-X interrupt assigned inside the Qubes HVM, the bridge still reports no MSI capabilities, but the device works perfectly.

I'll keep digging, thank you for the suggestions!

@awokd
Copy link
Author

awokd commented Feb 6, 2021

Ended up working around the issue by switching to a slightly newer model of Atheros. Suspect this older one has a draft implementation of PCIe which confuses Xen et. al.

@awokd awokd closed this as completed Feb 6, 2021
@h4xor666
Copy link

Ended up working around the issue by switching to a slightly newer model of Atheros. Suspect this older one has a draft implementation of PCIe which confuses Xen et. al.

Can I ask which one you got? I'm having literally the exact same issue.

@awokd
Copy link
Author

awokd commented Jul 11, 2021

Can I ask which one you got? I'm having literally the exact same issue.

AR5BHB116/AR9382

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: other T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

7 participants