Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable direct IO on Btrfs based storage #9488

Closed
ImBearChild opened this issue Oct 2, 2024 · 17 comments
Closed

Disable direct IO on Btrfs based storage #9488

ImBearChild opened this issue Oct 2, 2024 · 17 comments
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: storage diagnosed Technical diagnosis has been performed (see issue comments). P: major Priority: major. Between "default" and "critical" in severity. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@ImBearChild
Copy link

How to file a helpful issue

Qubes OS release

Qubes Release 4.2.3 (R4.2)

Brief summary

There shouldn't be a use of direct-io on Btrfs because it can lead to various problems.

This would cause CSUM error, which is described in the documentation of PROXMOX and there's also an related issue on the Bugzilla for Fedora. What's more, there are also related discussions on the Btrfs mailing list.

Even if direct IO does not cause errors, this setting would also prevent the built-in compression functionality of Btrfs from working properly.

Steps to reproduce

Run sudo losetup --list on a Btrfs based Qubes OS to see that DIO is listed as having a value of 1.

Expected behavior

Direct IO should be disabled.

Actual behavior

It's enabled on Btrfs based installation.

Note

The commit states that this behavior has been reverted, but I assume it's still being used since the not-script.c introduced in R4.2, which uses LO_FLAGS_DIRECT_IO. (I'm unfamiliar with the code of Qubes OS, so this might be incorrect.)

And maybe it leads to some regression on Btrfs, making qubes on Btrfs slow. I made this conjecture because the file "not-script.c" does not exist in R4.1. I'm not sure about this, because I can't simply disable the DIO for testing purposes; if someone could tell me how to do it, I'd be happy to try.

@ImBearChild ImBearChild added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Oct 2, 2024
@andrewdavidwong andrewdavidwong added needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. C: storage affects-4.2 This issue affects Qubes OS 4.2. labels Oct 2, 2024
@marmarek
Copy link
Member

marmarek commented Oct 2, 2024

@DemiMarie @rustybird ?

@rustybird
Copy link

rustybird commented Oct 2, 2024

My understanding is that Btrfs used to be more prone to checksum errors with direct I/O mainly because direct I/O exercises substantially different parts of the Btrfs codebase that had some bugs. Since then more attention has been paid to the combination, bugs have been fixed, and it's pretty calm nowadays.

OTOH, besides bugs in the Btrfs codebase itself, these checksum errors also happen(ed?) when a "misbehaving (...) application modifies a buffer in-flight when doing an O_DIRECT write". In our case the application is xen-blkback on a loop device, so I don't know if that's relevant?

Personally I haven't experienced any such corruption. I've been running Btrfs with direct I/O enabled on my main Qubes OS notebook since R4.1, hacked into the old block script.

And maybe it leads to some regression on Btrfs, making qubes on Btrfs slow.

It's plausible that direct I/O leads to different allocation patterns, which could be worse or even pathological in some cases, maybe linked to specific SSD hardware (handling fragmentation especially badly) or maybe not. But if that's a common problem to the point where it makes direct I/O a bad trade-off, shouldn't we have seen way more reports of such a regression with R4.2 from users who had R4.1 working fine with Btrfs?

Even if direct IO does not cause errors, this setting would also prevent the built-in compression functionality of Btrfs from working properly.

Huh, that's an important caveat (previously stated on the linux-btrfs mailing list) I didn't have on my radar, because I never use compression.

Maybe the not-script should have a mechanism to disable direct I/O. That same mechanism could also be useful to override the 512 byte logical block size.

Or I guess for disabling direct I/O it could also automatically look at the file to see if it's hosted on Btrfs and has compression enabled?

@ImBearChild
Copy link
Author

ImBearChild commented Oct 2, 2024

OTOH, besides bugs in the Btrfs codebase itself, these checksum errors also happen(ed?) when a "misbehaving (...) application modifies a buffer in-flight when doing an O_DIRECT write". In our case the application is xen-blkback on a loop device, so I don't know if that's relevant?

Thank you for your hint. I did indeed encounter this issue. After some testing, it turned out that using the Xen PV Block driver in Windows qubes would cause a CSUM error, but Linux qubes and those not using the Xen PV Block driver wouldn't have such an issue. I have attached the error logs for reference:

dom0 kernel: BTRFS warning (device dm-1): csum failed root 256 ino 255232 off 4297732096 csum 0xc73039cd expected csum 0xaf05393c mirror 1
dom0 kernel: BTRFS warning (device dm-1): csum failed root 256 ino 255232 off 4297728000 csum 0xc73039cd expected csum 0xbea01aaf mirror 1
dom0 kernel: BTRFS error (device dm-1): bdev /dev/mapper/luks-fedc3ffe-b16b-45ec-b968-2a929b8f5a81 errs: wr 0, rd 0, flush 0, corrupt 268806, gen 0
dom0 kernel: BTRFS error (device dm-1): bdev /dev/mapper/luks-fedc3ffe-b16b-45ec-b968-2a929b8f5a81 errs: wr 0, rd 0, flush 0, corrupt 268807, gen 0
dom0 kernel: BTRFS warning (device dm-1): direct IO failed ino 255232 op 0x0 offset 0x1002a2000 len 4096 err no 10
dom0 kernel: BTRFS warning (device dm-1): direct IO failed ino 255232 op 0x0 offset 0x1002a3000 len 4096 err no 10
dom0 kernel: I/O error, dev loop28, sector 8393992 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 2

I've been running Btrfs with direct I/O enabled on my main Qubes OS notebook since R4.1, hacked into the old block script.

Cool! To compare, can you reveal your Btrfs configuration? My Btrfs mount uses compress=zstd:1. Additionally, my dom0 shutdown time is longer than the LVM configuration on same hardware. Journalctl shows that file system synchronization took a long time and systemd had to kill corresponding processes in order for it to shut down. Maybe I should try not using compression? Or is using the option "autodefrag" helpful in alleviating this problem?

Also, I want to know why the default behavior of Qubes OS is to use COW on disk image files while Fedora and openSUSE disable COW for libvirtd's disk images.

Maybe the not-script should have a mechanism to disable direct I/O. That same mechanism could also be useful to override the 512 byte logical block size.

Making users have more choices is better indeed. Many virtual machine management programs support adjusting the cache mode. So it's good to have this feature in Qubes OS.

@rustybird
Copy link

rustybird commented Oct 2, 2024

"misbehaving (...) application modifies a buffer in-flight when doing an O_DIRECT write". In our case the application is xen-blkback on a loop device

it turned out that using the Xen PV Block driver in Windows qubes would cause a CSUM error, but Linux qubes and those not using the Xen PV Block driver wouldn't have such an issue.

Looks like the untrusted qube side can cause checksum errors on the trusted dom0 side. I was hoping that the dom0 backend prevents this.

Can you easily reproduce the errors with a new Windows qube?

To compare, can you reveal your Btrfs configuration?

Sure: kernel-latest, default mount options plus noatime and since R4.2 discard=async (the default on modern kernels, but the installer adds discard which unfortunately now means discard=sync for Btrfs). Also since R4.2 I create the filesystem with the --checksum=xxhash.

I don't use Windows qubes though.

Additionally, my dom0 shutdown time is longer than the LVM configuration on same hardware. Journalctl shows that file system synchronization took a long time and systemd had to kill corresponding processes in order for it to shut down.

Try shutting down the qubes individually and/or take a look at the qubesd journal to see which .img files are causing long delays and could benefit from manual defragmentation. (Definitely don't use autodefrag for file-reflink pools. On Btrfs, defragmentation duplicates data shared across snapshots/reflinks so it must be done selectively.)

Also, I want to know why the default behavior of Qubes OS is to use COW on disk image files while Fedora and openSUSE disable COW for libvirtd's disk images.

There's currently an inherent amount of COW required by the Qubes OS storage API: #8767

So if a nocow file is reflinked into another, Btrfs still has to do a COW operation whenever a block is then modified for the first time in one of the files. nocow only avoids subsequent COW operations after that first write per block - but also totally disables data checksums (protection against bit rot) :(

@DemiMarie
Copy link

"misbehaving (...) application modifies a buffer in-flight when doing an O_DIRECT write". In our case the application is xen-blkback on a loop device

it turned out that using the Xen PV Block driver in Windows qubes would cause a CSUM error, but Linux qubes and those not using the Xen PV Block driver wouldn't have such an issue.

Looks like the untrusted qube side can cause checksum errors on the trusted dom0 side. I was hoping that the dom0 backend prevents this.

Is this a security vulnerability from an upstream Xen PoV?

@ImBearChild
Copy link
Author

Try shutting down the qubes individually and/or take a look at the qubesd journal to see which .img files are causing long delays and could benefit from manual defragmentation.

It seems that the qubesd shutdown process is relatively quick on my machine; from the first "Refilinked file" message to "qubesdb-daemon: terminating," it takes only 8 seconds. However, systemd complains about "Syncing filesystems and block devices" taking too long (30 seconds), which causes it to be killed. Manual defragmentation does not seem to affect this "syncing" issue, but it has helped shorten my qubesd shutdown time.

dom0 systemd-shutdown[1]: Syncing filesystems and block devices.
dom0 systemd-shutdown[1]: Syncing filesystems and block devices - timed out, issuing SIGKILL to PID 17446.

Is this a security vulnerability from an upstream Xen PoV?

Maybe not that bad? As far as I know, a CSUM error will prevent further read access, making the qube unstable. (But not damage the filesystem.) However, after I turned off the Windows qube, btrfs scrub and btrfs check show no errors found.

Can you easily reproduce the errors with a new Windows qube?

Yes. Here are steps to reproduce:

  1. Prepare a new Windows qube
  2. Install XEN PV Block driver from qubes tools installer and reboot
  3. See Btrfs CSUM error in dmesg.

@apparatius
Copy link

I can reproduce this as well in Windows 10 qube with testsigning on:

bcdedit.exe /set testsigning on

And latest xenbus and xenvbd drivers installed from here:
https://xenbits.xenproject.org/pvdrivers/win/

@rustybird
Copy link

rustybird commented Oct 3, 2024

Looks like the untrusted qube side can cause checksum errors on the trusted dom0 side. I was hoping that the dom0 backend prevents this.

Is this a security vulnerability from an upstream Xen PoV?

It would allow a malicious VM to DOS any backup procedure that (like the Qubes OS one) fails at the first read error and then doesn't back up the rest of the system (e.g. remaining VMs).

Although the more alarming angle might be dom0 causing a VM to unintentionally corrupt its internal data. Even if that's not necessarily a security vulnerability.

@rustybird
Copy link

rustybird commented Oct 3, 2024

IMHO we should indeed disable direct I/O in the not-script at the moment. @DemiMarie @marmarek

It's difficult to figure out who's technically wrong here in this stack of blkfront-on-Windows, blkback, loop devices, and Btrfs. Not to mention various manpages/docs that could all be more explicit about interactions with direct I/O.

But in practice, it looks like direct I/O can interfere with system backups, nullify compression, and break Windows VMs.

And it can't be ruled out that technically misused (by whatever part of the stack) direct I/O could cause real data corruption beyond a benign(?) checksum mismatch, especially considering that file-reflink can be hosted by XFS/ZFS/bcachefs too or even any filesystem, in degraded mode. There's also the still not quite dead legacy 'file' driver that's compatible with arbitrary filesystems, edit: does it use the not-script?

@DemiMarie
Copy link

@rustybird I can have the not-script check the filesystem type (via fstatfs(2)) and enable direct I/O only on ext4 or XFS (where it is a free performance win). There might be other filesystems (bcachefs?) where it is safe. ZFS users should be using the ZFS storage driver, which doesn’t have this issue.

@andrewdavidwong andrewdavidwong added P: major Priority: major. Between "default" and "critical" in severity. and removed P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. labels Oct 4, 2024
@rustybird
Copy link

Do we know that it is always safe for XFS and ext4 though? That Btrfs commit said it's simply application misbehavior to modify the buffer during a direct I/O write. (Which isn't explictly mentioned for O_DIRECT in open(2), but it makes sense. Come to think of it, blkback modifying the buffer during a write doesn't seem great even without direct I/O, right?) As non data checksumming filesystems, they have one less reason to look at the buffer twice. But maybe they do in some circumstances, e.g. depending on journalling options.

It's silly but I like to treat filesystems as if they're developed by ruthless "language lawyers" who will see the slightest incorrect use as an opportunity to deploy an optimization destroying the maximum allowable amount of data. So now I'm a bit spooked about direct I/O until it's been diagnosed what exactly is going on here.

@DemiMarie
Copy link

Modifying a buffer that is in-flight is a straightforward data race and undefined behavior. No application should ever do this. The only difference here is that when the junk data is read, BTRFS returns an error, while XFS/ext4 return garbage.

Theoretically, what BTRFS is doing is better. However, most applications don’t handle I/O errors well, so the BTRFS behavior is a persistent denial of service.

rustybird added a commit to rustybird/qubes-linux-utils that referenced this issue Oct 8, 2024
marmarek pushed a commit to QubesOS/qubes-linux-utils that referenced this issue Nov 6, 2024
Pending diagnosis of QubesOS/qubes-issues#9488

(cherry picked from commit 8cdd664)
@Atrate
Copy link

Atrate commented Nov 24, 2024

Will linux-utils v4.3.5 be available for Qubes 4.2 or do we have to wait for 4.3 to be released? I think I just experienced disk corruption (recoverable, fortunately) due to the usage of DIO, for the second time in my last 1 months' worth of Qubes usage.

@rustybird
Copy link

The change was cherry-picked for the R4.2 branch three weeks ago, but the qubes-utils package version hasn't been bumped yet. @marmarek

I think I just experienced disk corruption (recoverable, fortunately) due to the usage of DIO, for the second time in my last 1 months' worth of Qubes usage.

This happened with Windows VMs both times?

@Atrate
Copy link

Atrate commented Nov 25, 2024

The change was cherry-picked for the R4.2 branch three weeks ago, but the qubes-utils package version hasn't been bumped yet. @marmarek

I think I just experienced disk corruption (recoverable, fortunately) due to the usage of DIO, for the second time in my last 1 months' worth of Qubes usage.

This happened with Windows VMs both times?

Linux AppVMs. Technically I am not fully sure that this issue is caused by DIO (no meaningful logs), but the AppVMs are pretty standard and I have not made any significant modifications to Qubes other than installing it on BTRFS, so it could be due to that.

Also, maybe unrelated, but installing the Xen driver (through QWT) on Windows 10 causes so many r/w errors that the VM bluescreens in a couple of minutes. May be related.

@rustybird
Copy link

rustybird commented Nov 26, 2024

I think I just experienced disk corruption [...]

This happened with Windows VMs both times?

Linux AppVMs.

You might want to keep an eye on this even after updating to the dom0 qubes-utils-4.2.18 package, which disables direct I/O. Maybe Btrfs correctly alerted you to a hardware problem. (Just recently I experienced something similar: A faulty power supply had damaged my mainboard, and Btrfs was the first to complain via checksum errors.)

installing the Xen driver (through QWT) on Windows 10 causes so many r/w errors that the VM bluescreens in a couple of minutes.

That part seems like it could be direct I/O related.

@andrewdavidwong andrewdavidwong removed the needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. label Nov 29, 2024
@andrewdavidwong andrewdavidwong added the diagnosed Technical diagnosis has been performed (see issue comments). label Nov 29, 2024
@Atrate
Copy link

Atrate commented Dec 10, 2024

A week or two ago I've updated by setting dom0 repositories to testing, it seems to be fixed, compression works. I'll test out the Windows bugs once the new QWT are released. Any idea as to when I'll be able to set the updates back to stable or security-testing without risking a downgrade?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: storage diagnosed Technical diagnosis has been performed (see issue comments). P: major Priority: major. Between "default" and "critical" in severity. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

7 participants