Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Today's update to ca0141f325ec706d38a06f9aeb8e5eb6c6a8d09a (almost identical to current 2.3.0 RC) caused permanent pool corruption #16631

Open
Rudd-O opened this issue Oct 9, 2024 · 42 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@Rudd-O
Copy link
Contributor

Rudd-O commented Oct 9, 2024

One of my machines (FC40) recently received a kernel update

  • from kernel-6.8.11-300.fc40.x86_64 to kernel-6.10.12-200.fc40.x86_64, and
  • from ZFS+DKMS+Dracut master lineage, from commit 02c5aa9 to ca0141f

simultaneously. This took place earlier today. The pool was healthy, in use, and recently scrubbed multiple times. No error anywhere, in the kernel log, or in the journal.

Mere minutes after I rebooted to the new kernel and ZFS, my Prometheus setup alerted me to 30 checksum errors, several write errors, and 4 data errors. Upon inspection:

[root@penny ~]# zpool status -v
  pool: chest
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
...abbreviated...

	NAME                                                                                                STATE     READ WRITE CKSUM
...abbreviated...

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x16>
        <metadata>:<0x3c>
        <metadata>:<0x44>
        <metadata>:<0x594>

The kernel also didn't show any hardware errors in the kernel ring buffer.

I rebooted to the older kernel and ZFS module, and started a scrub. It's still ongoing, but it has not found any problem nor produced any WRITE or CKSUM errors.

Interestingly, neither the cache nor the ZIL devices had that error. The boot drive also seemed unaffected.

This, to me, indicates a software issue, probably related to the LUKS write path (we've had these before) or mirroring (only the pool with the mirrored drives were hit — the single boot drive was not hit by the problem despite being the same otherwise).

The affected LUKS2 devices are all whole-disk formatted with 4K sector size, and my pool is ashift 12. (The unaffected root pool, cache device and ZIL device for the affected pool are not formatted with LUKS 4K sector size).

In the interest of seeing if it makes a difference, the affected LUKS devices are tuned with the following persistent flags:

Flags:       	allow-discards same-cpu-crypt submit-from-crypt-cpus no-read-workqueue no-write-workqueue 

The pool has the following features:

NAME   PROPERTY                       VALUE                          SOURCE
chest  size                           10.9T                          -
chest  capacity                       67%                            -
chest  altroot                        -                              default
chest  health                         ONLINE                         -
chest  guid                           2537396116593781450            -
chest  version                        -                              default
chest  bootfs                         -                              default
chest  delegation                     on                             default
chest  autoreplace                    off                            default
chest  cachefile                      -                              default
chest  failmode                       wait                           default
chest  listsnapshots                  off                            default
chest  autoexpand                     off                            default
chest  dedupratio                     1.00x                          -
chest  free                           3.55T                          -
chest  allocated                      7.34T                          -
chest  readonly                       off                            -
chest  ashift                         12                             local
chest  comment                        -                              default
chest  expandsize                     -                              -
chest  freeing                        0                              -
chest  fragmentation                  9%                             -
chest  leaked                         0                              -
chest  multihost                      off                            default
chest  checkpoint                     -                              -
chest  load_guid                      16604087848420727134           -
chest  autotrim                       off                            default
chest  compatibility                  off                            default
chest  bcloneused                     0                              -
chest  bclonesaved                    0                              -
chest  bcloneratio                    1.00x                          -
chest  feature@async_destroy          enabled                        local
chest  feature@empty_bpobj            active                         local
chest  feature@lz4_compress           active                         local
chest  feature@multi_vdev_crash_dump  enabled                        local
chest  feature@spacemap_histogram     active                         local
chest  feature@enabled_txg            active                         local
chest  feature@hole_birth             active                         local
chest  feature@extensible_dataset     active                         local
chest  feature@embedded_data          active                         local
chest  feature@bookmarks              enabled                        local
chest  feature@filesystem_limits      enabled                        local
chest  feature@large_blocks           enabled                        local
chest  feature@large_dnode            enabled                        local
chest  feature@sha512                 enabled                        local
chest  feature@skein                  enabled                        local
chest  feature@edonr                  enabled                        local
chest  feature@userobj_accounting     active                         local
chest  feature@encryption             enabled                        local
chest  feature@project_quota          active                         local
chest  feature@device_removal         enabled                        local
chest  feature@obsolete_counts        enabled                        local
chest  feature@zpool_checkpoint       enabled                        local
chest  feature@spacemap_v2            active                         local
chest  feature@allocation_classes     enabled                        local
chest  feature@resilver_defer         enabled                        local
chest  feature@bookmark_v2            enabled                        local
chest  feature@redaction_bookmarks    enabled                        local
chest  feature@redacted_datasets      enabled                        local
chest  feature@bookmark_written       enabled                        local
chest  feature@log_spacemap           active                         local
chest  feature@livelist               enabled                        local
chest  feature@device_rebuild         enabled                        local
chest  feature@zstd_compress          enabled                        local
chest  feature@draid                  enabled                        local
chest  feature@zilsaxattr             disabled                       local
chest  feature@head_errlog            disabled                       local
chest  feature@blake3                 disabled                       local
chest  feature@block_cloning          disabled                       local
chest  feature@vdev_zaps_v2           disabled                       local
chest  feature@redaction_list_spill   disabled                       local
chest  feature@raidz_expansion        disabled                       local

This is where my pool currently sits:

[root@penny ~]# zpool status
  pool: chest
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Wed Oct  9 22:37:54 2024
	900G / 7.35T scanned at 950M/s, 136G / 7.35T issued at 143M/s
	0B repaired, 1.80% done, 14:39:54 to go
config:

	NAME                                                                                                STATE     READ WRITE CKSUM
	chest                                                                                               ONLINE       0     0     0
	  mirror-0                                                                                          ONLINE       0     0     0
	    dm-uuid-CRYPT-LUKS2-sda  ONLINE       0     0     0
	    dm-uuid-CRYPT-LUKS2-sdb  ONLINE       0     0     0
	  mirror-3                                                                                          ONLINE       0     0     0
	    dm-uuid-CRYPT-LUKS2-sdc  ONLINE       0     0     0
	    dm-uuid-CRYPT-LUKS2-sdd  ONLINE       0     0     0
	logs	
	  dm-uuid-CRYPT-LUKS2-sde    ONLINE       0     0     0
	cache
	  dm-uuid-CRYPT-LUKS2-sdf    ONLINE       0     0     0

errors: 4 data errors, use '-v' for a list

Update: got good news! After reverting to the prior kernel + commit mentioned above, I am very happy to report that the scrub found no errors, and the data errors listed previously simply disappeared. So not a single bit of data loss!

The less good news: this does indicate, strongly, that under these use cases, there is a software defect in OpenZFS.

@Rudd-O Rudd-O added the Type: Defect Incorrect behavior (e.g. crash, hang) label Oct 9, 2024
@robn
Copy link
Member

robn commented Oct 9, 2024

I see listed got pool status for pool chest but pool props for pool penny. Just making sure you've got the right copypasta here.

Next time it happens, it'd be good to see any errors in the kernel log, that is, the zio ... lines, and any time-related entries from drivers etc. Also any relevant entries from zpool events -v (these are lost on reboot, so grab those first).

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Oct 11, 2024

Good catch, let me fix the comment above. In a minute.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Oct 11, 2024

Updated.

@behlendorf
Copy link
Contributor

Any chance you have a test system available which could be used to help us bisect where this was introduced? I'd suggest starting by testing with ca0141f and the kernel-6.8.11-300.fc40.x86_64 kernel.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Oct 12, 2024

Hm! I have a laptop with ZFS atop LUKS and I think it has mirroring set up. It was one of the systems I was ready to upgrade, although that kernel might not boot on Xen (which is what the laptop boots first). Let me get back to you soonish with more info.

@thulle
Copy link

thulle commented Oct 12, 2024

Tried rc1 with proposed rc2 patchset on root on zfs on luks w kernel 6.10.
I got 800 write errors with lines like this in the log:
zio pool=kc3000 vdev=/dev/mapper/root error=5 type=2 offset=25950006681088 size=4096 flags=317825
SSD is presenting 4k sectors and luks2 is also using 4k sectors. Looks very much like #14533, with discussion ongoing in #15646, but that it's now failing on metadata writes too.

edit: after rebooting with zfs-2.2.6 and running a scrub all errors disappeared, so I guess there was at least one working copy of all metadata.

edit2: after another restart even 2.2.6 generates write errors, scrub ends in 2 seconds and there's a permanent error in kc3000/ROOT/gentoo:<0xfffffffffffffffe>
seems like something actually broke permanently

@ochilan
Copy link

ochilan commented Oct 12, 2024

I was able to reproduce the write errors pretty easily on Arch Linux with kernels 6.6.56 (LTS) and 6.11.3 with zfs 2.3.0 rc1 (commit 3a9fca9) even on a single, non-mirrored LUKS device. I wasn't able to reproduce it without LUKS so far. I used default settings for everything (cryptsetup, zpool) with a freshly created pool; the LUKS device automatically decided to use 4096 byte blocks. I then rsynced my home directory to the pool and observed "zpool status" where write errors started to appear after a few seconds.

@ochilan
Copy link

ochilan commented Oct 13, 2024

So, I tried to find a commit which worked at some point on the master branch, but I wasn't able to find a good commit so far. The 2.2 branch does work for me, but it has diverged a lot from master in both directions (ahead/behind).

I have a question: Did the fixes that were performed on 2.2 for issues like #15533 or #15646 find their way back into master and thus into 2.3? If so, I would be interested in a pointer which commit to start from on master to try finding a future commit that breaks, relative to that point in time.

@behlendorf
Copy link
Contributor

behlendorf commented Oct 15, 2024

Did the fixes that were performed on 2.2 for issues like #15533 or #15646 find their way back into master and thus into 2.3?

Yes, all of the work for these issues was originally done in master and then backported to the 2.2 releases. So both 2.2 and 2.3 include these changes. Where things do differ is that 2.2 releases default to using the classic IO disk code, while 2.3 defaults to the new method implemented in #15588. Would you mind testing both with zfs_vdev_disk_classic=1 (2.2 default) and zfs_vdev_disk_classic=0 (2.3 default) module option. That may help us narrow it down.

     zfs_vdev_disk_classic=0|1 (uint)
             If set to 1, OpenZFS will submit IO to Linux using the method it used
             in 2.2 and earlier.  This "classic" method has known issues with highly
             fragmented IO requests and is slower on many workloads, but it has been
             in use for many years and is known to be very stable.  If you set this
             parameter, please also open a bug report why you did so, including the
             workload involved and any error messages.

             This parameter and the classic submission method will be removed once
             we have total confidence in the new method.

             This parameter only applies on Linux, and can only be set at module
             load time.

@ochilan
Copy link

ochilan commented Oct 15, 2024

Thank you for your reply. I tested some combinations with zfs 2.3.0 rc1, and zfs_vdev_disk_classic does indeed make a difference in the amount of errors. Also, I was only able to get errors when using LUKS with 4k sectors, not with 512 bytes:

kernel classic sector size errors
6.6.56 0 4k low
6.6.56 1 4k high
6.6.56 0 512 none
6.6.56 1 512 none
6.1.122 0 4k low

where "low" errors means single-digit write errors in about 30 seconds, "high" means thousands of errors.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Oct 15, 2024

Yes, it's only with 4K LUKS formatted devices. I could not repro with 512 byte LUKS formatted devices. How curious.

With newer LUKS tooling and OS installers defaulting to making LUKS volumes LUKS2, and making LUKS sector sizes the same as advanced format disks (minimum 4K), this is going to begin biting real hard on future ZFS users (as well as current ZFS users who buy new drives).

@ochilan
Copy link

ochilan commented Oct 15, 2024

I re-tested with zfs 2.2.6 quickly just to make sure, and I'm not able to reproduce the issue there with 4K sectors (on a freshly created pool, as with all tests I did), neither with vdev_disk_classic 1 nor with 0.

@mabod
Copy link

mabod commented Oct 15, 2024

I tested 2.3.0-rc2 with kernel 6.11.3 today on endeavouros and got data corruptions too. All devices in my pools are LUKS2 devices (sector: 4096 [bytes]). After a few rsync and zfs send/receive operations I got this message from zed:

ZFS has finished a resilver:

   eid: 406
 class: resilver_finish
  host: rakete
  time: 2024-10-15 14:13:52+0200
  pool: zstore
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 16K in 00:00:11 with 0 errors on Tue Oct 15 14:13:52 2024
config:

	NAME                                     STATE     READ WRITE CKSUM
	zstore                                   ONLINE       0     0     0
	  mirror-0                               ONLINE       0     8     0
	    WD40EFRX-68N32N0-WCC7K4JKN36U        ONLINE       0     9     1
	    WD40EFRX-68N32N0-WCC7K6YX6SAA        ONLINE       0     8     0
	  mirror-1                               ONLINE       0     4     0
	    WD40EFRX-68N32N0-WCC7K2YAJA69        ONLINE       0     8     3
	    WD40EFRX-68N32N0-WCC7K6XALFKX        ONLINE       0     4     0
	cache
	  ata-Verbatim_Vi550_S3_493504108895630  ONLINE       0     0     0

errors: 1 data errors, use '-v' for a list

And in the journal I see plenty of this:

Okt 15 13:38:15 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K2YAJA69 error=5 type=2 offset=3449457299456 size=4096 flags=3145856
Okt 15 13:38:15 rakete zed[19356]: eid=44 class=io pool='zstore' vdev=WD40EFRX-68N32N0-WCC7K2YAJA69 size=4096 offset=3449457299456 priority=3 err=5 flags=0x300080 bookmark=36123:267484:0:0
Okt 15 13:38:21 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K6YX6SAA error=5 type=2 offset=3185924694016 size=4096 flags=3145856
Okt 15 13:38:21 rakete zed[19377]: eid=45 class=io pool='zstore' vdev=WD40EFRX-68N32N0-WCC7K6YX6SAA size=4096 offset=3185924694016 priority=3 err=5 flags=0x300080 bookmark=36123:284249:0:0
Okt 15 13:38:21 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K4JKN36U error=5 type=2 offset=3185924694016 size=4096 flags=3145856
Okt 15 13:38:21 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K6YX6SAA error=5 type=2 offset=3185924694016 size=4096 flags=3178625
Okt 15 13:38:21 rakete zed[19382]: eid=46 class=io pool='zstore' vdev=WD40EFRX-68N32N0-WCC7K4JKN36U size=4096 offset=3185924694016 priority=3 err=5 flags=0x300080 bookmark=36123:284249:0:0
Okt 15 13:38:21 rakete zed[19383]: eid=47 class=io pool='zstore' size=4096 offset=3185920499712 priority=3 err=5 flags=0x204080 bookmark=36123:284249:0:0
Okt 15 13:38:21 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K4JKN36U error=5 type=2 offset=3185924694016 size=4096 flags=3178625
Okt 15 13:38:21 rakete zed[19387]: eid=48 class=data pool='zstore' priority=3 err=5 flags=0xc001 bookmark=36123:284249:0:0

Also affected is my external backup pool which is a RAID1 hosted in a QNAP expernal USB case.

@mabod
Copy link

mabod commented Oct 16, 2024

I did another test today and could reproduce the data corruption immediately. Interesting: It only happened on the RAID10 pool "zstore", while my tests with the single drive pool "ztank" finished without error. Same with my previous post: The other affected pool mentioned before ('zsea') is also RAID1.

I have captured zevents before reboot.
zpool-events-v-2.3.0.txt
In it you see all my pools (zM2, zHome, zstore and ztank). Only zstore and ztank were tested with write operations (zfs send from zM2 to zstore or ztank).

Outcome was:

  pool: zstore
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 3.37G in 00:00:39 with 0 errors on Wed Oct 16 07:59:05 2024
config:

	NAME                                     STATE     READ WRITE CKSUM
	zstore                                   DEGRADED     0     0     0
	  mirror-0                               DEGRADED     0    14     0
	    WD40EFRX-68N32N0-WCC7K4JKN36U        ONLINE       0    16     2
	    WD40EFRX-68N32N0-WCC7K6YX6SAA        DEGRADED     0    15     0  too many errors
	  mirror-1                               DEGRADED     0     9     0
	    WD40EFRX-68N32N0-WCC7K2YAJA69        DEGRADED     0    22    13  too many errors
	    WD40EFRX-68N32N0-WCC7K6XALFKX        ONLINE       0    23     0
	cache
	  ata-Verbatim_Vi550_S3_493504108895630  ONLINE       0     0     0

errors: No known data errors

The journal is full of zio errors:

...
Okt 16 07:58:02 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K2YAJA69 error=5 type=2 offset=3646354886656 size=4096 flags=3145856
Okt 16 07:58:02 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K6XALFKX error=5 type=2 offset=3646354886656 size=4096 flags=3145856
Okt 16 07:58:02 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K6XALFKX error=5 type=2 offset=3646354886656 size=4096 flags=3178625
Okt 16 07:58:02 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K2YAJA69 error=5 type=2 offset=3646354886656 size=4096 flags=3178625
...

@mabod
Copy link

mabod commented Oct 16, 2024

Yes, it's only with 4K LUKS formatted devices. I could not repro with 512 byte LUKS formatted devices. How curious.

With newer LUKS tooling and OS installers defaulting to making LUKS volumes LUKS2, and making LUKS sector sizes the same as advanced format disks (minimum 4K), this is going to begin biting real hard on future ZFS users (as well as current ZFS users who buy new drives).

Could you reproduce it on a single disc pool or is it all mirrors on your side?

@behlendorf
Copy link
Contributor

I re-tested with zfs 2.2.6 quickly just to make sure, and I'm not able to reproduce the issue there with 4K sectors (on a freshly created pool, as with all tests I did), neither with vdev_disk_classic 1 nor with 0.

That's helpful to know and actually a bit surprising. Could someone use git bisect to determine where this was introduced on the master branch. I'd suggest starting with commit c6be6ce which introduced the new vdev disk code.

@tonyhutter
Copy link
Contributor

I've been able to reproduce 100% on Fedora 40 with master (38a04f0) using this:

#!/bin/bash
# We assume this script is being run from inside a ZFS source dir

VDEV=/tmp/file

truncate -s 2G $VDEV

PASS="fdsjfosdijfsdkjsldfjdlk"

echo "setting up LUKS"
echo $PASS | sudo cryptsetup luksFormat --type luks2 $VDEV
echo $PASS | sudo cryptsetup luksOpen $VDEV tankfile

sudo ./zpool create tank /dev/mapper/tankfile

# Copy ZFS source to /tank to generate writes
sudo cp -a . /tank

sudo ./zpool status -v

sudo ./zpool destroy tank
sudo cryptsetup luksClose /dev/mapper/tankfile
sudo rm -f $VDEV
~/zfs$ sudo ./reproduce.sh 
setting up LUKS
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:

	NAME        STATE     READ WRITE CKSUM
	tank        ONLINE       0     0     0
	  tankfile  ONLINE       0     8     0

errors: Permanent errors have been detected in the following files:

        /tank/module/icp/api/kcf_cipher.c
        /tank/man/man8/zfs-snapshot.8
        /tank/module/Makefile.in
        /tank/module/icp/asm-x86_64/aes/aes_aesni.o
        /tank/module/icp/asm-x86_64/aes/libicp_la-aes_aesni.o
        /tank/lib/libzpool/Makefile.am
        /tank/include/os/linux/spl/sys/cred.h

NOTE: I was able to reproduce these errors with zfs_vdev_disk_classic set to either 0 or 1!

Next step is to bisect it down.

@ochilan
Copy link

ochilan commented Oct 16, 2024

I just tried with c6be6ce and get pretty much the same behavior. With vdev_disk_classic=0, the errors counts appear to be even slightly higher compared to 2.3.0rc1 for me, but that could also be randomness. I don't know if I'm doing something fundamentally wrong, it would be nice if someone else could verify.

Having said that: In the discussion around issue #15533, it was mentioned that reverting commit bd7a02c would be a possible solution, or having the "rewrite of the BIO filling machinery". Taking a look at zio.c, do I see it correctly that the mentioned commit was and is actually reverted in the 2.2 branch, but it is not reverted in master? I have not tried reverting that commit in master or 2.3.0rc yet, but maybe that would be interesting.

Edit: As a matter of fact, @Rudd-O stated exactly what I just said in the linked issue in January already regarding the revert not being in master.

@tonyhutter
Copy link
Contributor

tonyhutter commented Oct 16, 2024

I just tried my reproducer on Almalinux 9 and was able to hit the errors, but it took a few more times than normal.

@mabod
Copy link

mabod commented Oct 16, 2024

I was reading inssue #15533 mentioned my @ochilan. Seems to be very much related to this one here.

Interestingly it mentions:

Also, while looking for a reliable way to reproduce this, I noticed that plain old writes, like:

sudo dd if=/dev/urandom of=/test/test.bin bs=1M count=102400 status=progress

weren't sufficient. That never triggered the WRITE errors. However, I had a 100% reproducibility success rate by receiving zfs snapshots.

This is exactly what I am experiencing. rsync does not reproduce the issue for me, but receiving a snapshot triggers the errors almost immediately.

@tonyhutter
Copy link
Contributor

As others have mentioned, bd7a02c appears to be the issue. Using AlmaLinux 9, I checked out bd7a02c and the reproducer hit immediately. I then checked out the commit before that (e82e684) and ran the reproducer 10 times without hitting any errors.

@ochilan
Copy link

ochilan commented Oct 16, 2024

Since I wasn't able to quickly and automatically revert bd7a02c in master due to huge changes since then (shouldn't be too difficult, but my brain is almost in shut down mode for today), I at least tried commit c6be6ce plus the revert, and I don't get errors so far. Even though the new vdev_disk stuff improves the situation a lot, it seems like the issues with that commit are still not completely mitigated.

@tonyhutter
Copy link
Contributor

Note that we did revert bd7a02c in zfs-2.2.2 with 89fcb8c. Unfortunately that revert doesn't apply cleanly to master. Pinging @amotin

@amotin
Copy link
Member

amotin commented Oct 16, 2024

@tonyhutter Do you see anything wrong about that patch other than it makes something else to misbehave? If you have some reproduction, then we should try to collect maximum details about those failed ZIOs, looking if something is misaligned there. Can we dump maximum details including page offsets, lengths, etc?

It also looks suspicious to me that the problem is reproducible only with LUKs, but not on any actual hardware. If as was told it happens only on LUKS with 4KB sectors, does ZFS properly sets ashift=12 there?

@tonyhutter
Copy link
Contributor

I updated my reproducer to vary the write size and the write offset. I also manually set ashift=12:

#!/bin/bash

# We assume this script is being run from inside a ZFS source dir

VDEV=/tmp/file

truncate -s 100M $VDEV

PASS="fdsjfosdijfsdkjsldfjdlk"

echo "setting up LUKS"
echo $PASS | sudo cryptsetup luksFormat --type luks2 $VDEV
echo $PASS | sudo cryptsetup luksOpen $VDEV tankfile

sudo ./zpool create -o ashift=12 tank /dev/mapper/tankfile

CPUS=$(nproc)
for SIZE in {1..100} ; do
	for OFF in {1..100} ; do
		for i in {1..$CPUS} ; do
			dd if=/dev/urandom of=/tank/file$i-bs$SIZE-off$OFF seek=$OFF bs=$SIZE count=1 &>/dev/null &
		done
		wait

		errors=$(sudo ./zpool status -j --json-int --json-flat-vdevs | jq '.pools[].vdevs.tankfile.write_errors')

		if [ "$errors" != "$old_errors" ] && [ "$errors" -gt 0 ] ; then
			echo "ERROR: errors=$errors, Wrote $SIZE bytes to offset $(($OFF * $SIZE))"
		fi
		old_errors=$errors

	done
	sudo rm -fr /tank/*
done

sudo ./zpool status -v
sudo ./zpool destroy tank

sudo cryptsetup luksClose /dev/mapper/tankfile
sudo rm -f $VDEV

It seems some writes sizes and offsets will strongly trigger the errors and others will not. Examples:

ERROR: errors=3, Wrote 75 bytes to offset 5100
ERROR: errors=4, Wrote 77 bytes to offset 5159
ERROR: errors=7, Wrote 79 bytes to offset 4898
ERROR: errors=11, Wrote 81 bytes to offset 4050
ERROR: errors=15, Wrote 84 bytes to offset 6552
ERROR: errors=20, Wrote 86 bytes to offset 5332
ERROR: errors=29, Wrote 89 bytes to offset 7120
ERROR: errors=33, Wrote 91 bytes to offset 5460
ERROR: errors=39, Wrote 94 bytes to offset 7144
ERROR: errors=40, Wrote 97 bytes to offset 7663

ERROR: errors=3, Wrote 73 bytes to offset 5767
ERROR: errors=5, Wrote 74 bytes to offset 6660
ERROR: errors=6, Wrote 75 bytes to offset 7200
ERROR: errors=8, Wrote 85 bytes to offset 5695
ERROR: errors=11, Wrote 86 bytes to offset 5848
ERROR: errors=12, Wrote 87 bytes to offset 5568
ERROR: errors=15, Wrote 90 bytes to offset 5310
ERROR: errors=17, Wrote 91 bytes to offset 5551

ERROR: errors=2, Wrote 70 bytes to offset 5810
ERROR: errors=6, Wrote 78 bytes to offset 6162
ERROR: errors=14, Wrote 79 bytes to offset 6715
ERROR: errors=16, Wrote 80 bytes to offset 6960
ERROR: errors=17, Wrote 83 bytes to offset 7138
ERROR: errors=18, Wrote 87 bytes to offset 6612
ERROR: errors=19, Wrote 88 bytes to offset 6512
ERROR: errors=24, Wrote 89 bytes to offset 6052
ERROR: errors=26, Wrote 90 bytes to offset 6210
ERROR: errors=27, Wrote 91 bytes to offset 6006
ERROR: errors=28, Wrote 92 bytes to offset 5796
ERROR: errors=30, Wrote 94 bytes to offset 4982

@amotin
Copy link
Member

amotin commented Oct 22, 2024

I'm sorry, Tony, but how do you expect to see a correlation between user file writes and ZFS disk writes, considering the second are issued asynchronously by sync threads? If nothing else, I would insert zpool sync somewhere in between user writes and check for errors.

@tonyhutter
Copy link
Contributor

@amotin I'm just reporting what I'm seeing. I found it interesting the reproducer script would generate a lot of write errors in some size/offset ranges, and none in other size/offset ranges. I don't have any insights beyond that.

@robn
Copy link
Member

robn commented Oct 22, 2024

I did indeed spend the afternoon on it, and the evening, and now it's 2am and I really must go to sleep. Tomorrow (today 😬) I travel home from my vacation, which is not going to be amazing when I'm this groggy.

Good news though. @tonyhutter's repro script is working nicely, and I made it a lot more aggressive for own testing, and now it starts raining errors pretty much as soon as the loop starts.

Better news though. I think I know what's going on. Short version: disk sectors split across memory pages. Again.

Background: Linux expects to be able to "split" a BIO by picking some point in the segment list, cutting the list in two, and giving one side of it to another BIO. (Ok, not quite like that, but that's the principle). So, there is an expectation that when split, a BIO will retain the correct alignment for the target device's block size.

OpenZFS used to quite aggressively split memory pages across BIO segments, and also had a tendency to overfill BIOs during heavy write loads, requiring the kernel to split the BIO to bring it under its fill limit. For 512-byte sectors, any split is fine, as we never loaded less than 512 bytes into a single BIO segment. For 4K though, an arbitrary split can cut on a non-4K boundary. This misaligned BIO will be rejected by many drivers.

#15588 and it's followups tried to address this in a few ways:

  • loading the largest amount of data possible into a single BIO segment
  • not overfilling the BIO, instead creating additional ones and filling them separately
  • inspecting a data allocation up front and, if it is misaligned, linearising it before loading it into the BIO.

There's some amount of belt-and-suspenders here: we try to make sure we create BIOs that are always safe to split, while at the same time trying to never put the kernel in a position where it has to split. This has worked well so far.

That brings us to dm-crypt. Inside, crypt_convert_block_skcipher() opens with:

	/* Reject unexpected unaligned bio. */
	if (unlikely(bv_in.bv_len & (cc->sector_size - 1)))
		return -EIO;

I threw a bunch of logging on this, and sure enough, the segment it was looking at was < 4K. I believe this to be the entire issue.

There is one place where where we don't load full pages into a BIO: when the data is smaller than the logical block size (say, 4K) and it's already split across two pages, of (say) 2K each. This seemed reasonable to me at the time, because it's impossible for a BIO maximum segments to be less than 16 (iirc), so it seemed there would never be a reason where the kernel would have to split a two-segment BIO.

Turns out dm-crypt throws a spanner in this. It has pretty exacting requirements about the shape of the data it recieves, mostly because size and offset are part of the math. So it will split or remap a BIO to get things into the shape it wants. Except, it seems it never expects an LBS worth of data to span two pages, so rejects it.

The big "fix" is probably to always allocate IO memory out of whole LBS-sized chunks, never the smaller packed allocations that we sometimes use. The more targetted one, I believe, is going to be to detect when we're about to load less than an LBS-sized chunk into a BIO, and instead, get an LBS-sized allocation directly from the page cache, copy the data in, copy the rest from the next page, and submit that page to the BIO instead. Sort of like covering visible seams with a patch.

Theory untested, but it feels right. I'll sleep on it and take a swing at it tomorrow on the train.

@amotin
Copy link
Member

amotin commented Oct 22, 2024

Allowing @robn to focus on Linux-specific side, I've focused my investigation on a possible source of misaligned writes.

Generally zio_init() code after bd7a02c tries to align all the linear buffers according to their size. I.e. 512B buffers should always be aligned to 512B, 4KB to 4KB, while 6KB only to 2KB. The last is correct since 6KB I/Os can not be issued to a vdev with ashift of more than 2KB, but should be fine there.

But I think I've found one place when it triggers a problem. If user writes 6KB file, ZFS allocates 6KB buffer with the mentioned 2KB alignment. But if compression of the buffer produces <= 4KB, zio_write_compress() uses zio_push_transform() to issue 4KB I/O from this 6KB buffer. Later code in zio_vdev_io_start() sees that ZIO is properly size for the vdev's ashift and does nothing. Would the buffer be uncompressible, it would be reallocated to 8KB and the problem would not happen. Later code in Linux vdev_disk.c should have detected the misalignment and reallocate the buffer, but according to @robn 's post above the implemented logic is insufficient for LUKS.

While I still think the proper solution should include fix for Linux vdev_disk.c, we may be able to workaround the problem (at least in this part) by banning buffers bigger than PAGE_SIZE not multiple of PAGE_SIZE (at this time for x86 there is only one 6KB) by this patch:

diff --git a/module/zfs/zio.c b/module/zfs/zio.c
index a5daf73d59ba..4541cd919dd5 100644
--- a/module/zfs/zio.c
+++ b/module/zfs/zio.c
@@ -187,6 +187,9 @@ zio_init(void)
                        continue;
 #endif
 
+               if (size > PAGESIZE && !IS_P2ALIGNED(size, PAGESIZE))
+                       continue;
+
                if (IS_P2ALIGNED(size, PAGESIZE))
                        align = PAGESIZE;
                else

@tonyhutter Rather than reverting whole bd7a02c lets try the above patch, if you think it must be work-arounded ASAP. I haven't tested it on Linux, but applying I can no longer detect misaligned writes with dtrace on FreeBSD while writing ZFS sources onto the pool with ashift=12, few of which I detected each time before.

@robn
Copy link
Member

robn commented Oct 22, 2024

With @amotin's patch I haven't been able to trip this yet. Granted, it's only a handful of runs, but still.

mcmilk pushed a commit to mcmilk/zfs that referenced this issue Oct 22, 2024
This reverts commit bd7a02c which
can trigger an unlikely existing bio alignment issue on Linux.
This change is good, but the underlying issue it exposes needs to
be resolved before this can be re-applied.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#15533
Issue openzfs#16631
mcmilk pushed a commit to mcmilk/zfs that referenced this issue Oct 22, 2024
Add a LUKS sanity test to trigger:
openzfs#16631

Signed-off-by: Tony Hutter <[email protected]>
@ochilan
Copy link

ochilan commented Oct 22, 2024

@robn I would be curious if it does not trip with both zfs_vdev_disk_classic=0 and =1 with above change, can you maybe give it a try?

mcmilk pushed a commit to mcmilk/zfs that referenced this issue Oct 22, 2024
This reverts commit bd7a02c which
can trigger an unlikely existing bio alignment issue on Linux.
This change is good, but the underlying issue it exposes needs to
be resolved before this can be re-applied.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#15533
Issue openzfs#16631
mcmilk pushed a commit to mcmilk/zfs that referenced this issue Oct 22, 2024
Add a LUKS sanity test to trigger:
openzfs#16631

Signed-off-by: Tony Hutter <[email protected]>
@tonyhutter
Copy link
Contributor

@amotin @robn thanks, I'll hold off on the revert and will test the new fix.

@robn
Copy link
Member

robn commented Oct 22, 2024

@ochilan already have, and it does not, at least so far (absence of evidence etc). I wouldn't expect it to on this load.

For avoidance of doubt, the revert in #16676 and @amotin's narrower patch won't fix all these kind of issues with LUKS, just that they're easier to hit without it. The 2.2 series since 2.2.1 and all of 2.0 and 2.1 never had the implicated allocator change, and this kind of problem has been seen with LUKS there, just much harder to hit.

If I'm right and the heart of the problem data blocks split across memory pages, then that's still possible to get, just much much harder in normal situations, ie, back to where we were a few weeks ago, when life was quieter and less urgent. And now I've got a bit more information to go on :)

@amotin
Copy link
Member

amotin commented Oct 22, 2024

Thinking again, I wonder if the same problem might happen also for 1.5KB and 3KB buffers on vdevs with ashift of 10 and 11 respectively, if such exist, since after compression the buffers may also end up crossing page boundary. But banning those sizes also sounds like even bigger waste of memory efficiency.

@robn
Copy link
Member

robn commented Oct 22, 2024

@amotin yeah, maybe. Honestly though, who is running on ashift 10/11 and on LUKS? Maybe five people? 😆 I probably wouldn't bother, at least not yet.

@tonyhutter
Copy link
Contributor

@amotin I tried your patch with the reproducer and was not able to trigger the errors 👍 Could you put together a PR for it?

Honestly though, who is running on ashift 10/11 and on LUKS?

I just tried the reproducer with ashift=10, and it gave me a bunch of errors and hung the pool :

[21871.301340] zio pool=tank vdev=/dev/mapper/tankfile error=5 type=2 offset=4204544 size=1024 flags=3162240
[21871.301348] zio pool=tank vdev=/dev/mapper/tankfile error=5 type=2 offset=4205568 size=1024 flags=3162240
[21871.301359] VERIFY3(*countp > 0) failed (0 > 0)
[21871.301624] PANIC at zio.c:808:zio_notify_parent()
[21871.301872] Showing stack for process 1165801
[21871.301874] CPU: 2 PID: 1165801 Comm: z_wr_iss Tainted: P           OE      6.10.3-200.fc40.x86_64 #1
[21871.301875] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-4.module+el8.9.0+19570+14a90618 04/01/2014
[21871.301876] Call Trace:
[21871.301878]  <TASK>
[21871.301879]  dump_stack_lvl+0x5d/0x80
[21871.301885]  spl_panic+0xf4/0x10b [spl]
[21871.301895]  ? arc_write_ready+0x298/0xa00 [zfs]
[21871.302041]  ? _raw_spin_unlock+0xe/0x30
[21871.302043]  zio_ready+0x45a/0x790 [zfs]
[21871.302169]  zio_execute+0xdb/0x200 [zfs]
[21871.302286]  taskq_thread+0x279/0x5a0 [spl]
[21871.302293]  ? __pfx_default_wake_function+0x10/0x10
[21871.302296]  ? __pfx_zio_execute+0x10/0x10 [zfs]
[21871.302408]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[21871.302415]  kthread+0xcf/0x100
[21871.302417]  ? __pfx_kthread+0x10/0x10
[21871.302418]  ret_from_fork+0x31/0x50
[21871.302419]  ? __pfx_kthread+0x10/0x10
[21871.302421]  ret_from_fork_asm+0x1a/0x30
[21871.302423]  </TASK>
[21871.302427] zio pool=tank vdev=/dev/mapper/tankfile error=5 type=2 offset=20982784 size=1024 flags=3162240
[21871.302427] zio pool=tank vdev=/dev/mapper/tankfile error=5 type=2 offset=4206592 size=1024 flags=3162240

But I agree that there's probably very few users running with that 🤷‍♂️

@amotin
Copy link
Member

amotin commented Oct 22, 2024

@tonyhutter #16678 . I'll think if anything can be done cleaner from the compression code side, but not today.

behlendorf pushed a commit that referenced this issue Oct 25, 2024
Add a LUKS sanity test to trigger: #16631

Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes #16681
@ochilan
Copy link

ochilan commented Oct 25, 2024

Regarding the question whether the alignment issues could also affect "actual hardware" vdevs: Does anyone of you have access to 4k native HDDs (without 512 byte emulation) to try whether the reproducer causes an issue in that case, even without LUKS in between? Just curious.

behlendorf pushed a commit that referenced this issue Nov 1, 2024
It seems out our notion of "properly" aligned IO was incomplete. In
particular, dm-crypt does its own splitting, and assumes that a logical
block will never cross an order-0 page boundary (ie, the physical page
size, not compound size). This effectively means that it needs to be
possible to split a BIO at any page or block size boundary and have it
work correctly.

This updates the alignment check function to enforce these rules (to the
extent possible).

Our response to misaligned data is to make some new allocation that is
properly aligned, and copy the data into it. It turns out that
linearising (via abd_borrow_buf()) is not enough, because we allocate eg
4K blocks from a general purpose slab, and so may receive (or already
have) a 4K block that crosses pages.

So instead, we allocate a new ABD, which is guaranteed to be aligned
properly to block sizes, and then copy everything into it, and back out
on the way back.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #16687 #16631 #15646 #15533 #14533
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Nov 1, 2024
Add a LUKS sanity test to trigger: openzfs#16631

Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes openzfs#16681
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Nov 1, 2024
It seems out our notion of "properly" aligned IO was incomplete. In
particular, dm-crypt does its own splitting, and assumes that a logical
block will never cross an order-0 page boundary (ie, the physical page
size, not compound size). This effectively means that it needs to be
possible to split a BIO at any page or block size boundary and have it
work correctly.

This updates the alignment check function to enforce these rules (to the
extent possible).

Our response to misaligned data is to make some new allocation that is
properly aligned, and copy the data into it. It turns out that
linearising (via abd_borrow_buf()) is not enough, because we allocate eg
4K blocks from a general purpose slab, and so may receive (or already
have) a 4K block that crosses pages.

So instead, we allocate a new ABD, which is guaranteed to be aligned
properly to block sizes, and then copy everything into it, and back out
on the way back.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16687 openzfs#16631 openzfs#15646 openzfs#15533 openzfs#14533
robn pushed a commit to robn/zfs that referenced this issue Nov 5, 2024
Add a LUKS sanity test to trigger: openzfs#16631

Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes openzfs#16681
robn added a commit to robn/zfs that referenced this issue Nov 5, 2024
It seems out our notion of "properly" aligned IO was incomplete. In
particular, dm-crypt does its own splitting, and assumes that a logical
block will never cross an order-0 page boundary (ie, the physical page
size, not compound size). This effectively means that it needs to be
possible to split a BIO at any page or block size boundary and have it
work correctly.

This updates the alignment check function to enforce these rules (to the
extent possible).

Our response to misaligned data is to make some new allocation that is
properly aligned, and copy the data into it. It turns out that
linearising (via abd_borrow_buf()) is not enough, because we allocate eg
4K blocks from a general purpose slab, and so may receive (or already
have) a 4K block that crosses pages.

So instead, we allocate a new ABD, which is guaranteed to be aligned
properly to block sizes, and then copy everything into it, and back out
on the way back.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16687 openzfs#16631 openzfs#15646 openzfs#15533 openzfs#14533
(cherry picked from commit 63bafe6)
ixhamza pushed a commit to truenas/zfs that referenced this issue Nov 11, 2024
It seems out our notion of "properly" aligned IO was incomplete. In
particular, dm-crypt does its own splitting, and assumes that a logical
block will never cross an order-0 page boundary (ie, the physical page
size, not compound size). This effectively means that it needs to be
possible to split a BIO at any page or block size boundary and have it
work correctly.

This updates the alignment check function to enforce these rules (to the
extent possible).

Our response to misaligned data is to make some new allocation that is
properly aligned, and copy the data into it. It turns out that
linearising (via abd_borrow_buf()) is not enough, because we allocate eg
4K blocks from a general purpose slab, and so may receive (or already
have) a 4K block that crosses pages.

So instead, we allocate a new ABD, which is guaranteed to be aligned
properly to block sizes, and then copy everything into it, and back out
on the way back.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16687 openzfs#16631 openzfs#15646 openzfs#15533 openzfs#14533
ptr1337 pushed a commit to CachyOS/zfs that referenced this issue Nov 14, 2024
Add a LUKS sanity test to trigger: openzfs#16631

Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes openzfs#16681
ptr1337 pushed a commit to CachyOS/zfs that referenced this issue Nov 14, 2024
It seems out our notion of "properly" aligned IO was incomplete. In
particular, dm-crypt does its own splitting, and assumes that a logical
block will never cross an order-0 page boundary (ie, the physical page
size, not compound size). This effectively means that it needs to be
possible to split a BIO at any page or block size boundary and have it
work correctly.

This updates the alignment check function to enforce these rules (to the
extent possible).

Our response to misaligned data is to make some new allocation that is
properly aligned, and copy the data into it. It turns out that
linearising (via abd_borrow_buf()) is not enough, because we allocate eg
4K blocks from a general purpose slab, and so may receive (or already
have) a 4K block that crosses pages.

So instead, we allocate a new ABD, which is guaranteed to be aligned
properly to block sizes, and then copy everything into it, and back out
on the way back.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16687 openzfs#16631 openzfs#15646 openzfs#15533 openzfs#14533
(cherry picked from commit 63bafe6)
@Rudd-O
Copy link
Contributor Author

Rudd-O commented Dec 14, 2024

I would like to verify that these changes fixed the issue for me.

@behlendorf
Copy link
Contributor

@Rudd-O that would be great. The easiest way to confirm they resolve the issue would be to use the 2.3.0-rc4 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants