Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can’t mount XFS filesystem from a partition after upgrade to 4.2 #9293

Closed
duncancmt opened this issue Jun 8, 2024 · 20 comments · Fixed by QubesOS/qubes-linux-kernel#968
Closed
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: kernel C: storage C: Xen diagnosed Technical diagnosis has been performed (see issue comments). P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. pr submitted A pull request has been submitted for this issue. r4.1-dom0-stable r4.2-host-stable r4.3-host-cur-test T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@duncancmt
Copy link

Qubes OS release

4.2.1

Brief summary

Prior to upgrading to 4.2, I was able to mount a partition inside a VM. Now I am not. This is a separate partition on the same NVMe drive as my Qubes BTRFS root. Surprisingly, the partition mounts just fine in dom0.

https://forum.qubes-os.org/t/cant-mount-xfs-filesystem-from-a-partition-after-upgrade-to-4-2/26809

Steps to reproduce

Create partition. Format as XFS. Attempt to mount in a VM

Expected behavior

It should mount

Actual behavior

Error thrown

$ sudo mount -t xfs /dev/xvdi /mnt/mountpoint
mount: /mnt/mountpoint: can't read superblock on /dev/xvdi.
       dmesg(1) may have more information after failed mount system call.
$ sudo dmesg
<snip>
[ 1825.885414] blkfront: xvdi: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled; bounce buffer: enabled
[ 1837.175245] SGI XFS with ACLs, security attributes, realtime, scrub, quota, no debug enabled
[ 1837.176949] I/O error, dev xvdi, sector 0 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
[ 1837.176996] XFS (xvdi): SB validate failed with error -5.
@duncancmt duncancmt added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Jun 8, 2024
@andrewdavidwong andrewdavidwong added needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. C: storage affects-4.2 This issue affects Qubes OS 4.2. labels Jun 10, 2024
@rustybird
Copy link

rustybird commented Jun 11, 2024

This is a separate partition on the same NVMe drive as my Qubes BTRFS root.

Using the same setup, I haven't been able to reproduce this problem.

Steps to reproduce

Create partition. Format as XFS.

Can you post the exact dom0 commands? The qvm-block attach command too.

[ 1837.176949] I/O error, dev xvdi, sector 0 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0

Odd. Are you by any chance involving dm-integrity somehow, especially cryptsetup luksFormat --integrity-no-wipe?

@duncancmt
Copy link
Author

I apologize for the glibness of the steps to reproduce. I partitioned this drive quite a while ago, so I don't have the exact commands I ran. It was something along the lines of using gdisk to partition in dom0 and:

$ sudo mkfs.xfs -f -m bigtime=1 -m rmapbt=1 -m reflink=1 /dev/xvdi

to format the drive inside the VM. This is the same SSD that holds dom0's root and boot (root is btrfs, encrypted of course; in this case LUKS2 xchacha12,aes-adiantum-plain64, no integrity), but the badly-behaved partition is just that: a partition. It's not a btrfs subvolume.


As far as commands I use to attempt to mount the drive:

dom0

$ qvm-block attach ethereum dom0:nvme0n1p4
$ echo $?
0

ethereum

$ sudo mount -m /dev/xvdi /mnt/ethereum
mount: /mnt/ethereum: can't read superblock on /dev/xvdi.
       dmesg(1) may have more information after failed mount system call.
$ sudo dmesg
<snip>
[   44.598154] SGI XFS with ACLs, security attributes, realtime, scrub, quota, no debug enabled
[   44.602523] I/O error, dev xvdi, sector 0 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
[   44.602582] XFS (xvdi): SB validate failed with error -5.

Then I can detach the volume from the ethereum VM, shut it down, and back in dom0 run:

$ sudo mount -m /dev/nvme0n1p4 /mnt/ethereum
$ echo $?
0
$ ls /mnt/ethereum
<some files>

Odd. Are you by any chance involving dm-integrity somehow, especially cryptsetup luksFormat --integrity-no-wipe?

As implied by the commands above, this is just a bare partition that has been formatted XFS. No RAID. No LUKS. No integrity.

And for the record, I get the same behavior on Fedora 38, Fedora 39, and Debian 12.

@rustybird
Copy link

Are there any dom0 kernel messages during the period where you are attaching the partition and attempting to mount it inside the VM?

I also wonder if it's possible to take XFS out of the equation: If you attach the partition and run something like sudo head -c 100M /dev/xvdi | sha256sum does it result in the same read error?

@duncancmt
Copy link
Author

duncancmt commented Jun 15, 2024

$ sudo dd if=/dev/xvdi of=/dev/null bs=1 count=1M status=progress
1048576+0 records in
1048576+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.440201 s, 2.4 MB/s
$ echo $?
0

dmesg shows nothing of note

$ sudo mount -m /dev/xvdi /mnt/ethereum
mount: /mnt/ethereum: can't read superblock on /dev/xvdi.
       dmesg(1) may have more information after failed mount system call.
$ sudo dmesg
<snip>
[   68.055045] SGI XFS with ACLs, security attributes, realtime, scrub, quota, no debug enabled
[   68.057308] I/O error, dev xvdi, sector 0 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
[   68.057333] XFS (xvdi): SB validate failed with error -5.
$ sudo head -c 100M /dev/xvdi | sha256sum
REDACTED  -
$ echo $?
0

so it appears to be XFS-specific

and for what it's worth, shasum-ing the first 100M of the partition in dom0 returns the same hash

@duncancmt
Copy link
Author

The only logline in dom0 dmesg that appears during the process is (sorry hand copied):

[206838.262156] xen-blkback: backend/vbd/15/51840: using 4 queues, protocol 1 (x86_64-abi) persistent grants

which seems uninteresting

@rustybird
Copy link

This is so intriguing! I'm out of ideas though :(

Maybe try the linux-xfs mailing list?

@rustybird
Copy link

Apparently I'm now experiencing the issue myself: A read error at sector 0 happening only when I attempt to mount the attached block device, but not otherwise. However my block device contains an ext4 filesystem instead of XFS.

This has occurred almost all the time with kernel-latest-qubes-vm (6.9.4 and 6.9.2). I haven't been able to reproduce it with kernel-qubes-vm (6.6.33) so far.

@rustybird
Copy link

rustybird commented Jun 24, 2024

My source device in dom0 is a loop device with 4096 byte logical+physical block size, which in the failing case is attached to the VM with 512 byte logical+physical blocks. Can you try this in your setup (substituting your NVMe device for my loop12 device) @duncancmt?

[user@dom0 ~]$ head /sys/block/loop12/queue/*_block_size
==> /sys/block/loop12/queue/logical_block_size <==
4096

==> /sys/block/loop12/queue/physical_block_size <==
4096

[user@dom0 bin]$ qvm-run -p the-vm 'head /sys/block/xvdi/queue/*_block_size'
==> /sys/block/xvdi/queue/logical_block_size <==
512

==> /sys/block/xvdi/queue/physical_block_size <==
512

@rustybird
Copy link

rustybird commented Jun 24, 2024

read error at sector 0 happening only when I attempt to mount the attached block device, but not otherwise

The discrepancy is due to the read during the mount attempt happening with direct I/O turned on. I also get it for dd if=/dev/xvdi of=/dev/null count=1 with vs. without iflag=direct

@marmarek
Copy link
Member

Kernel regression then?

@rustybird
Copy link

Yeah, last good one appears to be 6.8.8-1. Unfortunately kernel-latest 6.9.2-1 is already in stable :(

@duncancmt
Copy link
Author

duncancmt commented Jun 24, 2024

Can you try this in your setup?

[user@dom0 ~]$ head /sys/block/nvme0n1/queue/*_block_size
==> /sys/block/nvme0n1/queue/logical_block_size <==
4096

==> /sys/block/nvme0n1/queue/physical_block_size <==
4096

[user@dom0 ~]$ qvm-run -p ethereum 'head /sys/block/xvdi/queue/*_block_size'
==> /sys/block/xvdi/queue/logical_block_size <==
4096

==> /sys/block/xvdi/queue/physical_block_size <==
4096

so that's different 🤔

I also get it for dd if=/dev/xvdi of=/dev/null count=1 with vs. without iflag=direct

[user@ethereum ~]$ sudo dd if=/dev/xvdi of=/dev/null bs=1 count=1 status=progress
1+0 records in
1+0 records out
1 byte copied, 0.0119412 s, 0.1 kB/s
[user@ethereum ~]$ sudo dd if=/dev/xvdi of=/dev/null bs=1 count=1 status=progress iflag=direct
/usr/bin/dd: error reading '/dev/xvdi': Invalid argument
0+0 records in
0+0 records out
0 bytes copied, 2.7803e-05 s, 0.0 kB/s

so that's different as well

EDIT: I get the same error in dom0 with iflag=direct; setting bs=4096 makes dd happy, but bs=512 does not

@duncancmt
Copy link
Author

Oh wait. I got ahead of myself and downgraded the troublesome VM to 6.8.8-1. With the vm on 6.9.2-1, I get:


[user@dom0 ~]$ qvm-run -p ethereum 'head /sys/block/xvdi/queue/*_block_size'
==> /sys/block/xvdi/queue/logical_block_size <==
512

==> /sys/block/xvdi/queue/physical_block_size <==
512
[user@ethereum ~]$ sudo dd if=/dev/xvdi of=/dev/null count=1 status=progress
1+0 records in
1+0 records out
512 bytes copied, 0.00148875 s, 344 kB/s
[user@ethereum ~]$ sudo dd if=/dev/xvdi of=/dev/null count=1 status=progress iflag=direct
/usr/bin/dd: error reading '/dev/xvdi': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.000170858 s, 0.0 kB/s
[user@ethereum ~]$ sudo dd if=/dev/xvdi of=/dev/null bs=4096 count=1 status=progress iflag=direct
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.011916 s, 344 kB/s

@marmarek
Copy link
Member

marmarek commented Jun 24, 2024 via email

@rustybird
Copy link

rustybird commented Jun 24, 2024

Oh wait. I got ahead of myself and downgraded the troublesome VM to 6.8.8-1. With the vm on 6.9.2-1, I get:

Thank God 😆

There were 5 commits to xen-blkfront.c in February that all landed in kernel 6.9. The last one has logical/physical block size stuff in the diff, although the first one is already related to queue limits.

@marmarek
Copy link
Member

Thanks @rustybird ! I've forwarded the info to relevant maintainers: https://lore.kernel.org/xen-devel/Znl5FYI9CC37jJLX@mail-itl/T/#u

@andrewdavidwong andrewdavidwong added diagnosed Technical diagnosis has been performed (see issue comments). waiting for upstream This issue is waiting for something from an upstream project to arrive in Qubes. Remove when closed. C: Xen and removed needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Jun 24, 2024
@marmarek
Copy link
Member

@rustybird @duncancmt the above linked thread has a proposed fix from Christoph Hellwig already, care to try (and preferably report back in the email thread)?

@rustybird
Copy link

rustybird commented Jun 24, 2024

Welp, I can't even get builderv2 to download the repo (there's a different number of missing bytes every time):

15:35:15,176 [executor:local:/home/user/tmp/138233767774496c9ed3ed0/builder] output: args: (128, ['git', 'clone', '-n', '-q', '-b', 'main', 'https://github.com/QubesOS/qubes-linux-kernel', '/home/user/tmp/138233767774496c9ed3ed0/builder/linux-kernel-latest'])
15:35:15,177 [executor:local:/home/user/tmp/138233767774496c9ed3ed0/builder] output: stdout: b''
15:35:15,177 [executor:local:/home/user/tmp/138233767774496c9ed3ed0/builder] output: stderr: b'error: 3213 bytes of body are still expected\nfetch-pack: unexpected disconnect while reading sideband packet\nfatal: early EOF\nfatal: fetch-pack: invalid index-pack output\n'

Maybe I can just build the xen_blkfront module manually somehow.

Edit: Managed to download the repo and everything, it's building

Edit 2: The patch works: https://lore.kernel.org/xen-devel/Znndj9W_bCsFTxkz@mutt/

@qubesos-bot
Copy link

Automated announcement from builder-github

The component linux-kernel-latest (including package kernel-latest-6.9.7-1.qubes.fc32) has been pushed to the r4.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

@andrewdavidwong andrewdavidwong added pr submitted A pull request has been submitted for this issue. and removed waiting for upstream This issue is waiting for something from an upstream project to arrive in Qubes. Remove when closed. labels Jul 2, 2024
@qubesos-bot
Copy link

Automated announcement from builder-github

The component linux-kernel-latest (including package kernel-latest-6.9.7-1.qubes.fc32) has been pushed to the r4.1 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: kernel C: storage C: Xen diagnosed Technical diagnosis has been performed (see issue comments). P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. pr submitted A pull request has been submitted for this issue. r4.1-dom0-stable r4.2-host-stable r4.3-host-cur-test T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants