-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Today's update to ca0141f325ec706d38a06f9aeb8e5eb6c6a8d09a (almost identical to current 2.3.0 RC) caused permanent pool corruption #16631
Comments
I see listed got pool status for pool Next time it happens, it'd be good to see any errors in the kernel log, that is, the |
Good catch, let me fix the comment above. In a minute. |
Updated. |
Any chance you have a test system available which could be used to help us bisect where this was introduced? I'd suggest starting by testing with ca0141f and the kernel-6.8.11-300.fc40.x86_64 kernel. |
Hm! I have a laptop with ZFS atop LUKS and I think it has mirroring set up. It was one of the systems I was ready to upgrade, although that kernel might not boot on Xen (which is what the laptop boots first). Let me get back to you soonish with more info. |
Tried rc1 with proposed rc2 patchset on root on zfs on luks w kernel 6.10. edit: after rebooting with zfs-2.2.6 and running a scrub all errors disappeared, so I guess there was at least one working copy of all metadata. edit2: after another restart even 2.2.6 generates write errors, scrub ends in 2 seconds and there's a permanent error in kc3000/ROOT/gentoo:<0xfffffffffffffffe> |
I was able to reproduce the write errors pretty easily on Arch Linux with kernels 6.6.56 (LTS) and 6.11.3 with zfs 2.3.0 rc1 (commit 3a9fca9) even on a single, non-mirrored LUKS device. I wasn't able to reproduce it without LUKS so far. I used default settings for everything (cryptsetup, zpool) with a freshly created pool; the LUKS device automatically decided to use 4096 byte blocks. I then rsynced my home directory to the pool and observed "zpool status" where write errors started to appear after a few seconds. |
So, I tried to find a commit which worked at some point on the master branch, but I wasn't able to find a good commit so far. The 2.2 branch does work for me, but it has diverged a lot from master in both directions (ahead/behind). I have a question: Did the fixes that were performed on 2.2 for issues like #15533 or #15646 find their way back into master and thus into 2.3? If so, I would be interested in a pointer which commit to start from on master to try finding a future commit that breaks, relative to that point in time. |
Yes, all of the work for these issues was originally done in master and then backported to the 2.2 releases. So both 2.2 and 2.3 include these changes. Where things do differ is that 2.2 releases default to using the classic IO disk code, while 2.3 defaults to the new method implemented in #15588. Would you mind testing both with
|
Thank you for your reply. I tested some combinations with zfs 2.3.0 rc1, and zfs_vdev_disk_classic does indeed make a difference in the amount of errors. Also, I was only able to get errors when using LUKS with 4k sectors, not with 512 bytes:
where "low" errors means single-digit write errors in about 30 seconds, "high" means thousands of errors. |
Yes, it's only with 4K LUKS formatted devices. I could not repro with 512 byte LUKS formatted devices. How curious. With newer LUKS tooling and OS installers defaulting to making LUKS volumes LUKS2, and making LUKS sector sizes the same as advanced format disks (minimum 4K), this is going to begin biting real hard on future ZFS users (as well as current ZFS users who buy new drives). |
I re-tested with zfs 2.2.6 quickly just to make sure, and I'm not able to reproduce the issue there with 4K sectors (on a freshly created pool, as with all tests I did), neither with vdev_disk_classic 1 nor with 0. |
I tested 2.3.0-rc2 with kernel 6.11.3 today on endeavouros and got data corruptions too. All devices in my pools are LUKS2 devices (sector: 4096 [bytes]). After a few rsync and zfs send/receive operations I got this message from zed:
And in the journal I see plenty of this:
Also affected is my external backup pool which is a RAID1 hosted in a QNAP expernal USB case. |
I did another test today and could reproduce the data corruption immediately. Interesting: It only happened on the RAID10 pool "zstore", while my tests with the single drive pool "ztank" finished without error. Same with my previous post: The other affected pool mentioned before ('zsea') is also RAID1. I have captured zevents before reboot. Outcome was:
The journal is full of zio errors:
|
Could you reproduce it on a single disc pool or is it all mirrors on your side? |
That's helpful to know and actually a bit surprising. Could someone use |
I've been able to reproduce 100% on Fedora 40 with master (38a04f0) using this: #!/bin/bash
# We assume this script is being run from inside a ZFS source dir
VDEV=/tmp/file
truncate -s 2G $VDEV
PASS="fdsjfosdijfsdkjsldfjdlk"
echo "setting up LUKS"
echo $PASS | sudo cryptsetup luksFormat --type luks2 $VDEV
echo $PASS | sudo cryptsetup luksOpen $VDEV tankfile
sudo ./zpool create tank /dev/mapper/tankfile
# Copy ZFS source to /tank to generate writes
sudo cp -a . /tank
sudo ./zpool status -v
sudo ./zpool destroy tank
sudo cryptsetup luksClose /dev/mapper/tankfile
sudo rm -f $VDEV ~/zfs$ sudo ./reproduce.sh
setting up LUKS
pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
tankfile ONLINE 0 8 0
errors: Permanent errors have been detected in the following files:
/tank/module/icp/api/kcf_cipher.c
/tank/man/man8/zfs-snapshot.8
/tank/module/Makefile.in
/tank/module/icp/asm-x86_64/aes/aes_aesni.o
/tank/module/icp/asm-x86_64/aes/libicp_la-aes_aesni.o
/tank/lib/libzpool/Makefile.am
/tank/include/os/linux/spl/sys/cred.h NOTE: I was able to reproduce these errors with Next step is to bisect it down. |
I just tried with c6be6ce and get pretty much the same behavior. With vdev_disk_classic=0, the errors counts appear to be even slightly higher compared to 2.3.0rc1 for me, but that could also be randomness. I don't know if I'm doing something fundamentally wrong, it would be nice if someone else could verify. Having said that: In the discussion around issue #15533, it was mentioned that reverting commit bd7a02c would be a possible solution, or having the "rewrite of the BIO filling machinery". Taking a look at zio.c, do I see it correctly that the mentioned commit was and is actually reverted in the 2.2 branch, but it is not reverted in master? I have not tried reverting that commit in master or 2.3.0rc yet, but maybe that would be interesting. Edit: As a matter of fact, @Rudd-O stated exactly what I just said in the linked issue in January already regarding the revert not being in master. |
I just tried my reproducer on Almalinux 9 and was able to hit the errors, but it took a few more times than normal. |
I was reading inssue #15533 mentioned my @ochilan. Seems to be very much related to this one here. Interestingly it mentions:
This is exactly what I am experiencing. rsync does not reproduce the issue for me, but receiving a snapshot triggers the errors almost immediately. |
Since I wasn't able to quickly and automatically revert bd7a02c in master due to huge changes since then (shouldn't be too difficult, but my brain is almost in shut down mode for today), I at least tried commit c6be6ce plus the revert, and I don't get errors so far. Even though the new vdev_disk stuff improves the situation a lot, it seems like the issues with that commit are still not completely mitigated. |
@tonyhutter Do you see anything wrong about that patch other than it makes something else to misbehave? If you have some reproduction, then we should try to collect maximum details about those failed ZIOs, looking if something is misaligned there. Can we dump maximum details including page offsets, lengths, etc? It also looks suspicious to me that the problem is reproducible only with LUKs, but not on any actual hardware. If as was told it happens only on LUKS with 4KB sectors, does ZFS properly sets ashift=12 there? |
I updated my reproducer to vary the write size and the write offset. I also manually set ashift=12: #!/bin/bash
# We assume this script is being run from inside a ZFS source dir
VDEV=/tmp/file
truncate -s 100M $VDEV
PASS="fdsjfosdijfsdkjsldfjdlk"
echo "setting up LUKS"
echo $PASS | sudo cryptsetup luksFormat --type luks2 $VDEV
echo $PASS | sudo cryptsetup luksOpen $VDEV tankfile
sudo ./zpool create -o ashift=12 tank /dev/mapper/tankfile
CPUS=$(nproc)
for SIZE in {1..100} ; do
for OFF in {1..100} ; do
for i in {1..$CPUS} ; do
dd if=/dev/urandom of=/tank/file$i-bs$SIZE-off$OFF seek=$OFF bs=$SIZE count=1 &>/dev/null &
done
wait
errors=$(sudo ./zpool status -j --json-int --json-flat-vdevs | jq '.pools[].vdevs.tankfile.write_errors')
if [ "$errors" != "$old_errors" ] && [ "$errors" -gt 0 ] ; then
echo "ERROR: errors=$errors, Wrote $SIZE bytes to offset $(($OFF * $SIZE))"
fi
old_errors=$errors
done
sudo rm -fr /tank/*
done
sudo ./zpool status -v
sudo ./zpool destroy tank
sudo cryptsetup luksClose /dev/mapper/tankfile
sudo rm -f $VDEV It seems some writes sizes and offsets will strongly trigger the errors and others will not. Examples:
|
I'm sorry, Tony, but how do you expect to see a correlation between user file writes and ZFS disk writes, considering the second are issued asynchronously by sync threads? If nothing else, I would insert |
@amotin I'm just reporting what I'm seeing. I found it interesting the reproducer script would generate a lot of write errors in some size/offset ranges, and none in other size/offset ranges. I don't have any insights beyond that. |
I did indeed spend the afternoon on it, and the evening, and now it's 2am and I really must go to sleep. Tomorrow (today 😬) I travel home from my vacation, which is not going to be amazing when I'm this groggy. Good news though. @tonyhutter's repro script is working nicely, and I made it a lot more aggressive for own testing, and now it starts raining errors pretty much as soon as the loop starts. Better news though. I think I know what's going on. Short version: disk sectors split across memory pages. Again. Background: Linux expects to be able to "split" a BIO by picking some point in the segment list, cutting the list in two, and giving one side of it to another BIO. (Ok, not quite like that, but that's the principle). So, there is an expectation that when split, a BIO will retain the correct alignment for the target device's block size. OpenZFS used to quite aggressively split memory pages across BIO segments, and also had a tendency to overfill BIOs during heavy write loads, requiring the kernel to split the BIO to bring it under its fill limit. For 512-byte sectors, any split is fine, as we never loaded less than 512 bytes into a single BIO segment. For 4K though, an arbitrary split can cut on a non-4K boundary. This misaligned BIO will be rejected by many drivers. #15588 and it's followups tried to address this in a few ways:
There's some amount of belt-and-suspenders here: we try to make sure we create BIOs that are always safe to split, while at the same time trying to never put the kernel in a position where it has to split. This has worked well so far. That brings us to dm-crypt. Inside,
I threw a bunch of logging on this, and sure enough, the segment it was looking at was < 4K. I believe this to be the entire issue. There is one place where where we don't load full pages into a BIO: when the data is smaller than the logical block size (say, 4K) and it's already split across two pages, of (say) 2K each. This seemed reasonable to me at the time, because it's impossible for a BIO maximum segments to be less than 16 (iirc), so it seemed there would never be a reason where the kernel would have to split a two-segment BIO. Turns out dm-crypt throws a spanner in this. It has pretty exacting requirements about the shape of the data it recieves, mostly because size and offset are part of the math. So it will split or remap a BIO to get things into the shape it wants. Except, it seems it never expects an LBS worth of data to span two pages, so rejects it. The big "fix" is probably to always allocate IO memory out of whole LBS-sized chunks, never the smaller packed allocations that we sometimes use. The more targetted one, I believe, is going to be to detect when we're about to load less than an LBS-sized chunk into a BIO, and instead, get an LBS-sized allocation directly from the page cache, copy the data in, copy the rest from the next page, and submit that page to the BIO instead. Sort of like covering visible seams with a patch. Theory untested, but it feels right. I'll sleep on it and take a swing at it tomorrow on the train. |
Allowing @robn to focus on Linux-specific side, I've focused my investigation on a possible source of misaligned writes. Generally But I think I've found one place when it triggers a problem. If user writes 6KB file, ZFS allocates 6KB buffer with the mentioned 2KB alignment. But if compression of the buffer produces <= 4KB, While I still think the proper solution should include fix for Linux vdev_disk.c, we may be able to workaround the problem (at least in this part) by banning buffers bigger than PAGE_SIZE not multiple of PAGE_SIZE (at this time for x86 there is only one 6KB) by this patch:
@tonyhutter Rather than reverting whole bd7a02c lets try the above patch, if you think it must be work-arounded ASAP. I haven't tested it on Linux, but applying I can no longer detect misaligned writes with dtrace on FreeBSD while writing ZFS sources onto the pool with ashift=12, few of which I detected each time before. |
With @amotin's patch I haven't been able to trip this yet. Granted, it's only a handful of runs, but still. |
This reverts commit bd7a02c which can trigger an unlikely existing bio alignment issue on Linux. This change is good, but the underlying issue it exposes needs to be resolved before this can be re-applied. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#15533 Issue openzfs#16631
Add a LUKS sanity test to trigger: openzfs#16631 Signed-off-by: Tony Hutter <[email protected]>
@robn I would be curious if it does not trip with both zfs_vdev_disk_classic=0 and =1 with above change, can you maybe give it a try? |
This reverts commit bd7a02c which can trigger an unlikely existing bio alignment issue on Linux. This change is good, but the underlying issue it exposes needs to be resolved before this can be re-applied. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#15533 Issue openzfs#16631
Add a LUKS sanity test to trigger: openzfs#16631 Signed-off-by: Tony Hutter <[email protected]>
@ochilan already have, and it does not, at least so far (absence of evidence etc). I wouldn't expect it to on this load. For avoidance of doubt, the revert in #16676 and @amotin's narrower patch won't fix all these kind of issues with LUKS, just that they're easier to hit without it. The 2.2 series since 2.2.1 and all of 2.0 and 2.1 never had the implicated allocator change, and this kind of problem has been seen with LUKS there, just much harder to hit. If I'm right and the heart of the problem data blocks split across memory pages, then that's still possible to get, just much much harder in normal situations, ie, back to where we were a few weeks ago, when life was quieter and less urgent. And now I've got a bit more information to go on :) |
Thinking again, I wonder if the same problem might happen also for 1.5KB and 3KB buffers on vdevs with ashift of 10 and 11 respectively, if such exist, since after compression the buffers may also end up crossing page boundary. But banning those sizes also sounds like even bigger waste of memory efficiency. |
@amotin yeah, maybe. Honestly though, who is running on ashift 10/11 and on LUKS? Maybe five people? 😆 I probably wouldn't bother, at least not yet. |
@amotin I tried your patch with the reproducer and was not able to trigger the errors 👍 Could you put together a PR for it?
I just tried the reproducer with ashift=10, and it gave me a bunch of errors and hung the pool :
But I agree that there's probably very few users running with that 🤷♂️ |
@tonyhutter #16678 . I'll think if anything can be done cleaner from the compression code side, but not today. |
Add a LUKS sanity test to trigger: #16631 Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #16681
Regarding the question whether the alignment issues could also affect "actual hardware" vdevs: Does anyone of you have access to 4k native HDDs (without 512 byte emulation) to try whether the reproducer causes an issue in that case, even without LUKS in between? Just curious. |
It seems out our notion of "properly" aligned IO was incomplete. In particular, dm-crypt does its own splitting, and assumes that a logical block will never cross an order-0 page boundary (ie, the physical page size, not compound size). This effectively means that it needs to be possible to split a BIO at any page or block size boundary and have it work correctly. This updates the alignment check function to enforce these rules (to the extent possible). Our response to misaligned data is to make some new allocation that is properly aligned, and copy the data into it. It turns out that linearising (via abd_borrow_buf()) is not enough, because we allocate eg 4K blocks from a general purpose slab, and so may receive (or already have) a 4K block that crosses pages. So instead, we allocate a new ABD, which is guaranteed to be aligned properly to block sizes, and then copy everything into it, and back out on the way back. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16687 #16631 #15646 #15533 #14533
Add a LUKS sanity test to trigger: openzfs#16631 Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes openzfs#16681
It seems out our notion of "properly" aligned IO was incomplete. In particular, dm-crypt does its own splitting, and assumes that a logical block will never cross an order-0 page boundary (ie, the physical page size, not compound size). This effectively means that it needs to be possible to split a BIO at any page or block size boundary and have it work correctly. This updates the alignment check function to enforce these rules (to the extent possible). Our response to misaligned data is to make some new allocation that is properly aligned, and copy the data into it. It turns out that linearising (via abd_borrow_buf()) is not enough, because we allocate eg 4K blocks from a general purpose slab, and so may receive (or already have) a 4K block that crosses pages. So instead, we allocate a new ABD, which is guaranteed to be aligned properly to block sizes, and then copy everything into it, and back out on the way back. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16687 openzfs#16631 openzfs#15646 openzfs#15533 openzfs#14533
Add a LUKS sanity test to trigger: openzfs#16631 Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes openzfs#16681
It seems out our notion of "properly" aligned IO was incomplete. In particular, dm-crypt does its own splitting, and assumes that a logical block will never cross an order-0 page boundary (ie, the physical page size, not compound size). This effectively means that it needs to be possible to split a BIO at any page or block size boundary and have it work correctly. This updates the alignment check function to enforce these rules (to the extent possible). Our response to misaligned data is to make some new allocation that is properly aligned, and copy the data into it. It turns out that linearising (via abd_borrow_buf()) is not enough, because we allocate eg 4K blocks from a general purpose slab, and so may receive (or already have) a 4K block that crosses pages. So instead, we allocate a new ABD, which is guaranteed to be aligned properly to block sizes, and then copy everything into it, and back out on the way back. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16687 openzfs#16631 openzfs#15646 openzfs#15533 openzfs#14533 (cherry picked from commit 63bafe6)
It seems out our notion of "properly" aligned IO was incomplete. In particular, dm-crypt does its own splitting, and assumes that a logical block will never cross an order-0 page boundary (ie, the physical page size, not compound size). This effectively means that it needs to be possible to split a BIO at any page or block size boundary and have it work correctly. This updates the alignment check function to enforce these rules (to the extent possible). Our response to misaligned data is to make some new allocation that is properly aligned, and copy the data into it. It turns out that linearising (via abd_borrow_buf()) is not enough, because we allocate eg 4K blocks from a general purpose slab, and so may receive (or already have) a 4K block that crosses pages. So instead, we allocate a new ABD, which is guaranteed to be aligned properly to block sizes, and then copy everything into it, and back out on the way back. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16687 openzfs#16631 openzfs#15646 openzfs#15533 openzfs#14533
Add a LUKS sanity test to trigger: openzfs#16631 Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes openzfs#16681
It seems out our notion of "properly" aligned IO was incomplete. In particular, dm-crypt does its own splitting, and assumes that a logical block will never cross an order-0 page boundary (ie, the physical page size, not compound size). This effectively means that it needs to be possible to split a BIO at any page or block size boundary and have it work correctly. This updates the alignment check function to enforce these rules (to the extent possible). Our response to misaligned data is to make some new allocation that is properly aligned, and copy the data into it. It turns out that linearising (via abd_borrow_buf()) is not enough, because we allocate eg 4K blocks from a general purpose slab, and so may receive (or already have) a 4K block that crosses pages. So instead, we allocate a new ABD, which is guaranteed to be aligned properly to block sizes, and then copy everything into it, and back out on the way back. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16687 openzfs#16631 openzfs#15646 openzfs#15533 openzfs#14533 (cherry picked from commit 63bafe6)
I would like to verify that these changes fixed the issue for me. |
@Rudd-O that would be great. The easiest way to confirm they resolve the issue would be to use the 2.3.0-rc4 release. |
One of my machines (FC40) recently received a kernel update
master
lineage, from commit 02c5aa9 to ca0141fsimultaneously. This took place earlier today. The pool was healthy, in use, and recently scrubbed multiple times. No error anywhere, in the kernel log, or in the journal.
Mere minutes after I rebooted to the new kernel and ZFS, my Prometheus setup alerted me to 30 checksum errors, several write errors, and 4 data errors. Upon inspection:
The kernel also didn't show any hardware errors in the kernel ring buffer.
I rebooted to the older kernel and ZFS module, and started a scrub. It's still ongoing, but it has not found any problem nor produced any WRITE or CKSUM errors.
Interestingly, neither the cache nor the ZIL devices had that error. The boot drive also seemed unaffected.
This, to me, indicates a software issue, probably related to the LUKS write path (we've had these before) or mirroring (only the pool with the mirrored drives were hit — the single boot drive was not hit by the problem despite being the same otherwise).
The affected LUKS2 devices are all whole-disk formatted with 4K sector size, and my pool is ashift 12. (The unaffected root pool, cache device and ZIL device for the affected pool are not formatted with LUKS 4K sector size).
In the interest of seeing if it makes a difference, the affected LUKS devices are tuned with the following persistent flags:
The pool has the following features:
This is where my pool currently sits:
Update: got good news! After reverting to the prior kernel + commit mentioned above, I am very happy to report that the scrub found no errors, and the data errors listed previously simply disappeared. So not a single bit of data loss!
The less good news: this does indicate, strongly, that under these use cases, there is a software defect in OpenZFS.
The text was updated successfully, but these errors were encountered: