Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

permanent errors (ereport.fs.zfs.authentication) reported after syncoid snapshot/send workload #11688

Open
aerusso opened this issue Mar 4, 2021 · 86 comments
Labels
Component: Encryption "native encryption" feature Component: Send/Recv "zfs send/recv" feature Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@aerusso
Copy link
Contributor

aerusso commented Mar 4, 2021

System information

Type Version/Name
Distribution Name Debian
Distribution Version testing (bullseye)
Linux Kernel 5.10.19
Architecture amd64
ZFS Version 2.0.3-1

Describe the problem you're observing

After upgrading to zfs 2.0.3 and Linux 5.10.19 (from 0.8.6), a well-tested syncoid workload causes "Permanent errors have been detected in the following files:" reports for a pool/dataset@snapshot:<0x0> (no file given).

Removing the snapshot, and running a scrub causes the error to go away.

This is on a single-disk nvme SSD---never experiencing any problems before upgrading---and has happened twice, once after each reboot/re-running of the syncoid workload. I have since booted back into 0.8.6, ran the same workload, and have not experienced the error report.

Describe how to reproduce the problem

Beyond the above, I do not have a mechanism to reproduce this. I'd rather not blindly do it again!

Include any warning/errors/backtraces from the system logs

See also @jstenback's report of very similar symptoms: 1 and 2, which appear distinct from the symptoms of the bug report they are in. Additionally, compare to @rbrewer123's reports 1 and 2, which comes with a kernel panic---I do not experience this.

My setup is very similar: I run a snapshot workload periodically, and transfer the snapshots every day to another machine. I also transfer snapshots much more frequently to another pool on the same machine.

If valuable, I have zpool history output that I can provide. Roughly, the workload looks like many snapshot, send -I, destroy (on one pool) and receive (on the same machine, but another pool ).

@aerusso aerusso added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Mar 4, 2021
@aerusso
Copy link
Contributor Author

aerusso commented Mar 7, 2021

@behlendorf I'm trying to dig into this a little bit further. I want to rule out in-flight corruption of snapshot data, so I'd like to be able to get access to zb_blkid and zb_level in the zbookmark_phys associated with this error---and then actually read out the corrupted data. Is there a better way than just putting a printk into spa_log_error (after avl_inser(tree, new, where); that should get called exactly once per affected block, right?). I see that ZED will print this out. My only reproducer is my main desktop, so I want to be very careful.

EDIT: Can I just use zdb -R pool 0:$blkid to dump the contents of this block? (There's only one vdev) I.e., is the "offset" in zdb -R the same as the zb_blkid of zbookmark_phys?

Also, is this approach reasonable? I would think it would be helpful to know if the affected block is the meta_dnode, or a root block, etc., right? Or am I embarking on a wild goose chase?

@ahrens
Copy link
Member

ahrens commented Mar 15, 2021

I'd like to be able to get access to zb_blkid and zb_level in the zbookmark_phys associated with this error

zpool events -v should also have this

Can I just use zdb -R pool 0:$blkid to dump the contents of this block? (There's only one vdev) I.e., is the "offset" in zdb -R the same as the zb_blkid of zbookmark_phys?

No, zb_blkid is the logical block ID, which is unrelated to the location on disk. You need the DVA's offset, which is also included in the zpool events -v output.

@jstenback
Copy link
Contributor

For the record, I'm still seeing the same behavior that I reported in the issue @aerusso linked above. And I just for the first time since this started (right after updating to 2.0) saw a kernel panic that left my zfs filesystems unresponsive. Here's what I found in dmesg at that time:

[380231.914245] #PF: supervisor write access in kernel mode
[380231.914310] #PF: error_code(0x0002) - not-present page
[380231.914373] PGD 0 P4D 0 
[380231.914414] Oops: 0002 [#1] SMP PTI
[380231.914462] CPU: 3 PID: 1741442 Comm: zpool Tainted: P           OE     5.10.23-1-lts #1
[380231.914569] Hardware name: Supermicro Super Server/X10DRL-i, BIOS 2.0 12/18/2015
[380231.914664] RIP: 0010:sa_setup+0xc7/0x5f0 [zfs]
[380231.914687] Code: 7b 01 00 4d 89 8f 10 05 00 00 48 85 ff 0f 84 b4 00 00 00 4c 89 0c 24 e8 27 c5 55 c6 49 8b 87 20 05 00 00 4c 8b 0c 24 4c 89 e7 <4c> 89 48 28 49 c7 87 10 05 00 00 00 00 00 00 e8 c5 eb 55 c6 48 89
[380231.914762] RSP: 0018:ffffac37c9963bf8 EFLAGS: 00010296
[380231.914786] RAX: 0000000000000000 RBX: ffff8b12fb31d4e8 RCX: ffff8b08050e8001
[380231.914818] RDX: ffff8b05a847f910 RSI: 0000000000000003 RDI: ffff8b12fb31d508
[380231.914848] RBP: ffffac37c9963df8 R08: 0000000000000000 R09: ffff8b08050e8000
[380231.914879] R10: 0000000000000e80 R11: 0000000000000000 R12: ffff8b12fb31d508
[380231.914910] R13: 0000000000000002 R14: ffffac37c9963c38 R15: ffff8b12fb31d000
[380231.914942] FS:  00007fcadcf9a7c0(0000) GS:ffff8b10ffac0000(0000) knlGS:0000000000000000
[380231.914976] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[380231.915002] CR2: 0000000000000028 CR3: 00000002cb2b8003 CR4: 00000000003706e0
[380231.915032] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[380231.915063] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[380231.915094] Call Trace:
[380231.915161]  zfs_sa_setup.constprop.0+0x6a/0x90 [zfs]
[380231.915233]  zfs_obj_to_path+0x50/0xe0 [zfs]
[380231.915291]  ? dmu_objset_hold_flags+0x95/0xe0 [zfs]
[380231.915361]  zfs_ioc_obj_to_path+0x86/0xf0 [zfs]
[380231.915387]  ? strlcpy+0x2d/0x40
[380231.915452]  zfsdev_ioctl_common+0x71c/0x880 [zfs]
[380231.915525]  zfsdev_ioctl+0x53/0xe0 [zfs]
[380231.915550]  __x64_sys_ioctl+0x83/0xb0
[380231.915567]  do_syscall_64+0x33/0x40
[380231.915581]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[380231.915598] RIP: 0033:0x7fcadd568e6b
[380231.915612] Code: ff ff ff 85 c0 79 8b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d5 af 0c 00 f7 d8 64 89 01 48
[380231.915660] RSP: 002b:00007ffcdb0e43a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[380231.915682] RAX: ffffffffffffffda RBX: 000055e10285a560 RCX: 00007fcadd568e6b
[380231.915702] RDX: 00007ffcdb0e43e0 RSI: 0000000000005a25 RDI: 0000000000000003
[380231.915723] RBP: 00007ffcdb0e7ad0 R08: 0000000000000000 R09: 0000000000000000
[380231.915742] R10: 00007fcadd5e6ac0 R11: 0000000000000246 R12: 00007ffcdb0e7990
[380231.915762] R13: 00007ffcdb0e43e0 R14: 000055e102873440 R15: 0000000000002000
[380231.915783] Modules linked in: rpcsec_gss_krb5 xt_nat veth xt_MASQUERADE br_netfilter bridge stp llc target_core_user uio target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod xt_recent ipt_REJECT nf_reject_ipv4 xt_multiport xt_comment xt_conntrack xt_hashlimit xt_addrtype xt_mark iptable_mangle xt_CT xt_tcpudp iptable_raw nfnetlink_log xt_NFLOG nf_log_ipv4 nf_log_common xt_LOG nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp nf_conntrack_amanda nf_conntrack_sane nf_conntrack_tftp nf_conntrack_sip nf_conntrack_pptp nf_conntrack_netlink nfnetlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c vboxnetflt(OE) vboxnetadp(OE) iptable_filter vboxdrv(OE) joydev mousedev usbhid intel_rapl_msr ipmi_ssif intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp
[380231.915824]  coretemp iTCO_wdt kvm_intel intel_pmc_bxt iTCO_vendor_support kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper rapl intel_cstate intel_uncore pcspkr ixgbe igb mdio_devres i2c_i801 libphy mei_me i2c_smbus ast mdio mei i2c_algo_bit ioatdma lpc_ich dca wmi mac_hid acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) vboxvideo drm_vram_helper drm_ttm_helper ttm drm_kms_helper cec syscopyarea sysfillrect sysimgblt fb_sys_fops vboxsf vboxguest crypto_user fuse drm agpgart nfsd auth_rpcgss nfs_acl lockd grace sunrpc nfs_ssc bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 mpt3sas crc32c_intel raid_class scsi_transport_sas xhci_pci xhci_pci_renesas [last unloaded: vboxdrv]
[380231.924269] CR2: 0000000000000028
[380231.925010] ---[ end trace 67d499cf5193f033 ]---
[380231.984744] RIP: 0010:sa_setup+0xc7/0x5f0 [zfs]
[380231.986002] Code: 7b 01 00 4d 89 8f 10 05 00 00 48 85 ff 0f 84 b4 00 00 00 4c 89 0c 24 e8 27 c5 55 c6 49 8b 87 20 05 00 00 4c 8b 0c 24 4c 89 e7 <4c> 89 48 28 49 c7 87 10 05 00 00 00 00 00 00 e8 c5 eb 55 c6 48 89
[380231.988429] RSP: 0018:ffffac37c9963bf8 EFLAGS: 00010296
[380231.989326] RAX: 0000000000000000 RBX: ffff8b12fb31d4e8 RCX: ffff8b08050e8001
[380231.990116] RDX: ffff8b05a847f910 RSI: 0000000000000003 RDI: ffff8b12fb31d508
[380231.990890] RBP: ffffac37c9963df8 R08: 0000000000000000 R09: ffff8b08050e8000
[380231.991657] R10: 0000000000000e80 R11: 0000000000000000 R12: ffff8b12fb31d508
[380231.992419] R13: 0000000000000002 R14: ffffac37c9963c38 R15: ffff8b12fb31d000
[380231.993182] FS:  00007fcadcf9a7c0(0000) GS:ffff8b10ffac0000(0000) knlGS:0000000000000000
[380231.993959] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[380231.994731] CR2: 0000000000000028 CR3: 00000002cb2b8003 CR4: 00000000003706e0
[380231.995509] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[380231.996285] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Unfortunately I'm not running a debug build here, so that stack is only of so much value, but wanted to share nonetheless in case it provides any insight into this issue.

@jstenback
Copy link
Contributor

For the record, I just got another kernel crash dump with the same exact behavior and stack trace in dmesg as reported in my previous comment. The dmesg this time (maybe last time too, though that didn't make it into my earlier report) states that it's a kernel NULL pointer dereference.

@ahrens
Copy link
Member

ahrens commented Mar 26, 2021

@jstenback I don't see how the problem you're describing is related to this issue. @aerusso is experiencing unexpected checksum errors, and you have a null pointer dereference. Unless I'm missing something, please file a separate issue report for this.

@jstenback
Copy link
Contributor

@ahrens I'm experiencing the exact same symptoms that @aerusso is experiencing, in addition to the two null pointer dereferences I've seen over the past week or so. I figured there's a chance they're related, but that may of course not be the case at all.

@aerusso
Copy link
Contributor Author

aerusso commented Mar 26, 2021

For reference, I experienced the corruption in this report after minutes of running 2.0.3. The total time I used 2.0.3 was probably less than 2 hours. I'm guessing that @jstenback has been running the kernel for hundreds of hours. It might be that I just had not yet experienced that symptom. (Of course, it's also possible it's an unrelated bug).

@jstenback
Copy link
Contributor

That is correct, my uptime during both of the crashes I mentioned were on the order of a hundred hours. And I typically start seeing the corruption after about the same amount of uptime.

@IvanVolosyuk
Copy link
Contributor

@aerusso Can you also mention: what was the last 'good' version of ZFS you didn't experience the issue? Can be helpful to narrow down the search.

@aerusso
Copy link
Contributor Author

aerusso commented Mar 26, 2021

@IvanVolosyuk Unfortunately, my last known good configuration is ZFS 0.8.6 and Linux 5.9.15 (and it's stable as a rock back here). I was also unsuccessful in reproducing the bug in a VM (using a byte-for-byte copy of the whole 1 TB nvme).

My current plan (once I can find a free weekend) is to try to bisect on the actual workstation exhibiting the bug. To complicate things, I'll have to do the bisection back here with Linux 5.9.15, since support for 5.10 wasn't added until very late in the release cycle.

@glueckself
Copy link

glueckself commented May 9, 2021

As I've noted in #12014, I'm running 2.0.2 (with Ubuntu 21.04, Linux 5.11.0) since 30th April and I haven't experienced any issues yet.

On my server with Debian Buster, Linux 5.10 and ZFS 2.0.3 (from backports), I've experienced the issue on 4 datasets, but not at the same time, EDIT: two of them at the same time:

            zvm/secure/locker@autosnap_2021-05-05_15:00:30_frequently:<0x0>
            zvm/secure/plappermaul@autosnap_2021-04-30_20:00:30_hourly:<0x0>
            zvm/secure/sunshine-db@autosnap_2021-05-08_00:45:30_frequently:<0x0>
            zvm/secure/dovecote-root@autosnap_2021-05-08_00:45:30_frequently:<0x0>

What I've also noted in the other issue, is that after reverting back to 0.8.4, everything seemed ok. I've also managed to destroy the affected snapshots and multiple scrubs didn't detect any issues.

@aerusso
Copy link
Contributor Author

aerusso commented Jun 5, 2021

I added 3f81aba on top of Debian's 2.0.3-8, and am tentatively reporting that I cannot reproduce this bug. I've been running for about 45 minutes now, without the permanent error (I used to experience this bug immediately upon running my sanoid workload, which at this point has run three times).

I would suggest that anyone already running 2.x consider applying that patch.

@aerusso
Copy link
Contributor Author

aerusso commented Jun 5, 2021

Unfortunately my optimism was premature. After another two and a half hours, I did indeed experience another corrupted snapshot.

@aerusso
Copy link
Contributor Author

aerusso commented Jun 12, 2021

After about 3.5 hours of uptime under Linux 5.9.15-1 (making sure this can be reproduced on a kernel supporting the known-good 0.8.6) with ZFS 3c1a2a9 (candidate 2.0.5 with another suspect patch reverted):

zpool events -v reveals

class = "ereport.fs.zfs.authentication"
ena = 0xb91dbe5624302801
detector = (embedded nvlist)
        version = 0x0
        scheme = "zfs"
        pool = 0xe9128a59c39360a
(end detector)
pool = "REDACTED"
pool_guid = 0xe9128a59c39360a
pool_state = 0x0
pool_context = 0x0
pool_failmode = "wait"
zio_objset = 0x83c4
zio_object = 0x0
zio_level = 0x0
zio_blkid = 0x0
time = 0x60c52c8d 0x7a6bfee 
eid = 0x116e

I failed to capture this information in my previous reports. I can reproduce this by trying to send the offending snapshot.

This dataset has encryption set to aes-256-gcm.

Also, am I correct that there is some kind of MAC that is calculated before the on-disk checksum? My pool shows no READ/WRITE/CKSUM errors---does that mean that the data and/or the MAC was wrong before being written? Should I try changing any encryption settings?

# cat /sys/module/icp/parameters/icp_gcm_impl
cycle [fastest] avx generic pclmulqdq

@aerusso aerusso changed the title permanent errors reported after/during snapshot/send workload (syncoid) permanent errors (ereport.fs.zfs.authentication) reported after syncoid snapshot/send workload Jun 12, 2021
@aerusso
Copy link
Contributor Author

aerusso commented Jun 12, 2021

More fun: I can zfs send the "corrupt" snapshot after rebooting the machine (both in 0.8.6 and the 2.0.5-ish that caused the problem; hopefully it doesn't matter, but I tested in that order). Is it possible that some cache is somehow being corrupted?

@AttilaFueloep
Copy link
Contributor

Also, am I correct that there is some kind of MAC that is calculated before the on-disk checksum?

Yes, both AES-CCM and AES-GCM are authenticated encryption algorithms, which protect against tampering with the cipher text.
Encrypting a block creates a MAC. The same MAC is created while decrypting. If they don't match decryption fails, generating an I/O error. Half of the block pointer checksum of an encrypted block is this MAC and the other half the checksum of the encrypted block.

My pool shows no READ/WRITE/CKSUM errors---does that mean that the data and/or the MAC was wrong before being written?

There were some problems with metadata(<0x0>) MACs related to ZFS user/group/project used on datasets created with certain master ZFS versions, but I'd need to lookup the details.

Should I try changing any encryption settings?

You could try to change icp_gcm_impl to generic to make sure the avx implementation didn't break something. Since it went into 0.8.4 I don't think this is likely though.

@aerusso aerusso mentioned this issue Jul 16, 2021
13 tasks
@glueckself
Copy link

glueckself commented Aug 2, 2021

@aerusso, to answer your questions from r/zfs (hope this is the right issue):

Are your affected datasets encrypted?

Yes. All of them are direct children of an encrypted parent and inherit its encryption.

Does this error coincide with a failed zfs send? Is there also a bunch of snapshots being taken at the same time?

Sanoid is taking snapshots of all datasets every 5 minutes. I can't find any log entry about sanoid failing to send it, however, manually running zfs send zvm/secure/sunshine-db@autosnap_2021-08-02_07:00:02_hourly > /dev/null says cannot open 'pool/secure/sunshine-db@autosnap_2021-08-02_07:00:02_hourly': I/O error.
Also, trying syncoid with the entire dataset results in cannot iterate filesystems: I/O error

(Here's a real hopeful question) Do you have a specific workload that consistently reproduces the bug? (I have to wait ~hours to reproduce it, which will make bisecting this tragically hard).

Not really. I've changed the syncoid cronjob yesterday to 5 minutes and then it happened. Have you tried reproducing it with really frequent snapshots and sending, something like while true; do uuid > /mnt/pool/dataset/testfile; zfs snapshot pool/dataset@$(uuid); done & while true; do sanoid ...; done? (the testfile write so that the snapshots have some content)

Can you grab a zpool events -v output relatively quickly after any such permanent errors occur? (you can also enable the all-debug.sh zedlet)? Any such output might be very useful.

If possible, please try rebooting, without deleting the snapshot, and accessing it (i.e., zfs send). Does the snapshot send without issue? If so, try scrubbing the pool twice (the pool doesn't forget the errors until there are two clean scrubs). This worked for me, by the way. This makes me suspect that there may be no on-disk corruption at all.

To answer the last two: no, after rebooting I get invalid argument when I try to zfs send one of the datasets.
Too bad I've read too late that this seems to be an issue with zfs writing to /dev/null (see #11445), so send would've worked probably. Scrubbing it twice resolved the corruptions, thanks!

~~I'm moving everything away from that server right now, afterwards I'll reboot, test that, try to clean it up, then start a zpool events -vf > somelogfile and see if my really-frequent-snapshots/sends-method can reproduce it. ~~
~~I'm setting up a VM to play around, I don't want to break my hosts pool... ~~
~~I'll let you know when I have something more. ~~

Ok, so I've had following stuff running for the last few hours:
while true; do for i in $(seq 1 10); do zfs snapshot pool1/secure/test${i}@$(uuid); sleep 1; done; done
while true; do for i in $(seq 1 10); do dd if=/dev/urandom of=/test/${i}/blafile bs=1M count=5 & done; sleep 1; done
while true; do syncoid -r pool1/secure [email protected]:pool2/secure/t1-test-2; sleep 1; done
without any corruption so far. This was running inside a VM, which used a SATA SSD, instead of the NVMe SSDs my normal VMs use. I can waste that SSD, so if you have any suggestions to stress test snapshots & sending, let me know.

@glueckself
Copy link

glueckself commented Aug 8, 2021

I have one more occurrence, however this time no sends were involved. When I woke up today, my backup server was offline and every I/O seemed to hang (i.e., after I type root + password it would hung and my services were offline).
After a hard reboot, every active dataset had snapshot corruptions, which went away after scrubbing twice.

Also, in the dmesg I've got the following warning:

Aug 08 00:29:01 host.example.com kernel: VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
Aug 08 00:29:01 host.example.com kernel: PANIC at arc.c:3790:arc_buf_destroy()
Aug 08 00:29:01 host.example.com kernel: Showing stack for process 482
Aug 08 00:29:01 host.example.com kernel: CPU: 0 PID: 482 Comm: z_rd_int Tainted: P          IO      5.11.0-25-generic #27-Ubuntu
Aug 08 00:29:01 host.example.com kernel: Hardware name: <MANUFACTURER> <PRODUCT>, BIOS <VERSION> <DATE>
Aug 08 00:29:01 host.example.com kernel: Call Trace:
Aug 08 00:29:01 host.example.com kernel:  show_stack+0x52/0x58
Aug 08 00:29:01 host.example.com kernel:  dump_stack+0x70/0x8b
Aug 08 00:29:01 host.example.com kernel:  spl_dumpstack+0x29/0x2b [spl]
Aug 08 00:29:01 host.example.com kernel:  spl_panic+0xd4/0xfc [spl]
Aug 08 00:29:01 host.example.com kernel:  ? do_raw_spin_unlock.constprop.0+0x9/0x10 [zfs]
Aug 08 00:29:01 host.example.com kernel:  ? __raw_spin_unlock.constprop.0+0x9/0x10 [zfs]
Aug 08 00:29:01 host.example.com kernel:  ? zfs_zevent_post+0x183/0x1c0 [zfs]
Aug 08 00:29:01 host.example.com kernel:  ? cityhash4+0x8d/0xa0 [zcommon]
Aug 08 00:29:01 host.example.com kernel:  ? abd_verify+0x15/0x70 [zfs]
Aug 08 00:29:01 host.example.com kernel:  ? abd_to_buf+0x12/0x20 [zfs]
Aug 08 00:29:01 host.example.com kernel:  arc_buf_destroy+0xe8/0xf0 [zfs]
Aug 08 00:29:01 host.example.com kernel:  arc_read_done+0x213/0x4a0 [zfs]
Aug 08 00:29:01 host.example.com kernel:  zio_done+0x39d/0xdc0 [zfs]
Aug 08 00:29:01 host.example.com kernel:  zio_execute+0x92/0xe0 [zfs]
Aug 08 00:29:01 host.example.com kernel:  taskq_thread+0x236/0x420 [spl]
Aug 08 00:29:01 host.example.com kernel:  ? wake_up_q+0xa0/0xa0
Aug 08 00:29:01 host.example.com kernel:  ? zio_execute_stack_check.constprop.0+0x10/0x10 [zfs]
Aug 08 00:29:01 host.example.com kernel:  kthread+0x12f/0x150
Aug 08 00:29:01 host.example.com kernel:  ? param_set_taskq_kick+0xf0/0xf0 [spl]
Aug 08 00:29:01 host.example.com kernel:  ? __kthread_bind_mask+0x70/0x70
Aug 08 00:29:01 host.example.com kernel:  ret_from_fork+0x22/0x30

One minute afterwards the snapshoting starts, and all the ZFS related tasks start hanging:

Aug 08 00:31:54 host.example.com kernel: INFO: task z_rd_int:482 blocked for more than 120 seconds.
Aug 08 00:31:54 host.example.com kernel:       Tainted: P          IO      5.11.0-25-generic #27-Ubuntu
Aug 08 00:31:54 host.example.com kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message

(with the call stack below, if you think it's relevant let me know and I can post it)

Affected tasks were also dp_sync_taskq and txg_sync which explains why the I/O was hanging (if I may guess z_rd_int is the read interrupt handler, txg_sync writes the transaction group to the disks).

I don't have the pool events, sorry for that.

EDIT: two more things to note. The system average load is about 0.5 (an old laptop running ~10 VMs) and it probably gets high memory pressure on the ARC. I had 8GB huge pages reserved for the VMs and 4GB zfs_arc_max, with 12GB RAM total - so ARC is going to have to fight with the host system (which is not much, the Linux kernel, libvirt and SSH server - I'd guess 100-200MB). I've now reduced the VM huge pages to 7GB, which should reduce the memory pressure.

@aerusso
Copy link
Contributor Author

aerusso commented Aug 8, 2021

I have never had one of these kernel panics, so it may be better to put this in #12014 (which I see you've already posted in -- I'm also subscribed to that bug). The output of zpool status -v and zpool events immediately after a failed zfs send should give us a lot of diagnostic information.

It's reassuring that merely rebooting and scrubbing makes the invalid reads go away, but you may want to set up the ZED, enable all_debug.sh, and set the output directory to something that is permanent. Then, in the event of a crash, you can sift through the old zpool events.

@aerusso
Copy link
Contributor Author

aerusso commented Aug 26, 2021

I can still reproduce this while running #12346.

@aerusso
Copy link
Contributor Author

aerusso commented Aug 27, 2021

I'm going to try to bisect this bug over the course of the next ~months, hopefully. @behlendorf, are there any particularly dangerous commits/bugs I should be aware of lurking between e9353bc and 78fac8d? Any suggestions on doing this monster bisect (each failed test is going to take about ~3 hours, each successful test probably needs ~6 hours to make sure I'm not just "getting lucky" and not hitting the bug)?

@wohali
Copy link

wohali commented Sep 1, 2021

Hey there, we're seeing this as well - not sure if I should put it here, or in #12014 .

System specs

  • TrueNAS 12.0-U5 (FreeBSD 12.2 + OpenZFS 2.0.5)
  • AMD Epyc 7F52 in an AsRockRack ROMED8-2T motherboard
  • 512GB RAM ECC
  • Chelsio T580 LP-CR with 2x 40G
  • 4x LSI 9305-16i controllers
  • 60x SAS-12G SSD; all drives enterprise 8TB and 16TB (56 in use)
  • SuperMicro 4U, 72-bay chassis with redundant PS & 3x BPN-SAS3-216A-N4 backplanes
    • No NVMe devices populated

History

  • Pool was rebuilt in late July/early August from backup as 9 6-disk RAIDZ2 vdevs with encryption, AES-128-GCM, with AMD hardware acceleration.
  • Data was replicated from the backup server using zfs send/recv.
  • Server ran for ~3 weeks without issue before we moved the server into production (acting primarily as NFS server; no VM or DB loads.)
  • Once it became active, we turn on our backup setup which does regular snapshots every 10min using pyznap, which are then zfs send/recv to the backup server.
  • Within 24 hours, we started seeing the corruption described in this ticket, causing whole machine crashes.
    • No kernel panics were observed, but that could be an artefact of the BMC not displaying the crashes.
    • The BMC shows no hardware failures, no memory errors, etc.
    • The disks report no errors via SMART
    • The zpool status command shows no checksum errors
    • We did see these errors on zpool status -v:
errors: Permanent errors have been detected in the following files:
        <0x21c49>:<0x0>
        Speedy/[client]/Projects/foo@pyznap_2021-08-25_15:00:00_hourly:<0x1>
        Speedy/[client]/Projects/foo@pyznap_2021-08-25_15:00:00_hourly:<0x20>
        <0x50a5>:<0x0>
        <0xa7d5>:<0x0>
        <0xa4de>:<0x0>
        <0x191e5>:<0x0>
        <0x191e5>:<0x3>
  • A scrub, interrupted by reboots, eventually reported this without resolving the errors: scan: scrub repaired 0B in 08:15:27 with 0 errors on Tue Aug 24 08:31:40 2021
  • We weren't able to confirm at that time that this was a ereport.fs.zfs.authentication failure because we were too busy with DR protocols
  • Once we removed the load from the system:
    • We mounted the pool under Ubuntu 21.04 and saw the same output
    • Back under TrueNAS, zpool events -v confirmed:
Aug 31 2021 20:14:44.609825904 ereport.fs.zfs.authentication
        class = "ereport.fs.zfs.authentication"
        ena = 0xca1cc9e0cba06c01
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0x6909b6729e67dcf9
        (end detector)
        pool = "Speedy"
        pool_guid = 0x6909b6729e67dcf9
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "continue"
        zio_objset = 0x65e7
        zio_object = 0x0
        zio_level = 0xffffffffffffffff
        zio_blkid = 0x0
        time = 0x612ec5f4 0x24593470
        eid = 0x136
  • After reading this ticket, we were then able to successfully complete a zpool scrub without a crash. The scrub / reboot / scrub process successfully eliminated the errors.
    • This didn't work earlier because crashes occurred so often, we could not complete a scrub process without a reboot in the middle.
  • As a result of reading this ticket we decided to rebuild the pool without encryption. Load has not yet been restored to the system in question.

@aerusso
Copy link
Contributor Author

aerusso commented Sep 2, 2021

@wohali I'm assuming I just don't leave my machine running long enough to experience the crash. I have a some questions:

  1. Can you describe the workload more precisely? I.e., approximately what kind of read/write/file creates are going on? Was the access was primarily/exclusively NFS server? Can you describe what the clients were doing?
  2. When not "in production" was the server primarily just idle?
  3. Do you have any clean reproducer? (I know this is a stretch, but if you get me a reproducer I can do in a VM, I will fix this)

I think it's very curious that this only seems to happen for people with flash memory. I wonder if it's some race in the encryption code that just doesn't really happen on slower media? Or, is it possible that this is just being caught because of the authentication (encryption), and is otherwise silently passing through for unencrypted datasets?

@wohali
Copy link

wohali commented Sep 2, 2021

  1. Can you describe the workload more precisely? I.e., approximately what kind of read/write/file creates are going on? Was the access was primarily/exclusively NFS server? Can you describe what the clients were doing?

Yes, exclusively an NFS server. It's /home directories, so access is varied. Under load it sees a constant 40Gbps of traffic.

2, When not "in production" was the server primarily just idle?

Yes

  1. Do you have any clean reproducer? (I know this is a stretch, but if you get me a reproducer I can do in a VM, I will fix this)

No, sorry.

@dani
Copy link

dani commented Sep 3, 2021

I also face this issue. My env is:

  • Debian Bullseye (Proxmox Backup Server)
  • Kernel 5.11.22-1-pve
  • ZFS 2.0.5
  • The pool is a single, simple vdev, built on top of hardware RAID10 10 SATA HDD (I know it's not the recommended setup, but I have reasons to do it this way)
  • Encryption is aes-256-gcm

The server is running Proxmox Backup Server and BackupPC to handle backups. The load vary (can be very busy during the night, a bit less during the day), but is mainly random read access. Sanoid is used to manage snapshots, and syncoid to replicate the data every hours to a remote location. The corrupted snap can be either those from sanoid, or from syncoid. Never had any crash though, errors disapear after two scrubs (but as scrub takes almost 3 days, most of the time, new corrupted snap appears during the previous scrub)

@Ryushin
Copy link

Ryushin commented Nov 16, 2022

I'm seem to be experiencing this daily (or more than once each day). Switching to raw syncs does not seem to improve anything. This is from my Thinkpad P17 Gen2, with a Xeon CPU and 128GB of ECC RAM, so even though it is a laptop, I have all the boxes ticked to not having corruption. I have a mirrored VDEV with two 2TB NVME drives sending to my server that does not use encryption.

I'm almost at the point of dumping encryption on my laptop until this is fixed. Is there any debugging I can provide since this is happening so often on my system?

Once I delete the bad snapshots, I have to run a scrub twice to fix the pool. Luckily it only takes a few minutes to run a scrub.

@mheubach
Copy link

@Ryushin: Can you give more details? If you're sending raw encrypted to another server the received data will be encrypted. You can't send raw encrypted to a server and not having the received dataset beeing encrypted too. So I suspect you're doing something wrong.

@Ryushin
Copy link

Ryushin commented Nov 17, 2022

@Ryushin: Can you give more details? If you're sending raw encrypted to another server the received data will be encrypted. You can't send raw encrypted to a server and not having the received dataset beeing encrypted too. So I suspect you're doing something wrong.

The raw sends are showing encrypted on the destination. Since sending raw did not fix this problem, I've reverted back to not sending raw any longer. (destroyed the snaps on the destination and remove the -w option from syncoid).

This morning, and I'm currently trying to fix this as I'm typing here, I had 145 bad snapshots. I've cleared them out and I'm now running scrubs, which only takes about five minutes. Before this happened, I saw all my CPU threads go to max for a few minutes. pigz-9 was the top CPU usage (I used compress=pigz-slow in syncoid) After the CPU calmed down, I had those 145 bad snapshots. It might be time to recreate my pool without encryption.

@isopix
Copy link

isopix commented Nov 17, 2022

In the past I was told that sending raw snapshots is not affected by this bug. Isn't that case?

@Ryushin
Copy link

Ryushin commented Nov 17, 2022

In the past I was told that sending raw snapshots is not affected by this bug. Isn't that case?

Yea, I thought that was the case in reading the thread. Though I was still getting corrupted snapshots a few hours after changing to sending raw. I've reverted back to non raw now as I'd rather have the backup data on my local server unencrypted.

@kyle0r
Copy link

kyle0r commented Nov 17, 2022

Based on the last couple of posts I thought I might point out/remind that raw and non-raw sends are not bi-directionally interoperable (at least for encrypted datasets).

man: https://openzfs.github.io/openzfs-docs/man/8/zfs-send.8.html#w

Note that if you do not use this flag (-w, --raw) for sending encrypted datasets, data will be sent unencrypted and may be re-encrypted with a different encryption key on the receiving system, which will disable the ability to do a raw send to that system for incrementals.

So @Ryushin reading your posts, it sounds like you might have bit on confusion to clear up on encrypted vs. unencrypted datasets, and the behaviour of raw and non-raw sends in relation to encrypted datasets. It would help if you can share your workflow and commands being used in a detailed post, so folks can better visualise your setup and provide assistance.

@Blackclaws, you said the following in September:

We can't just migrate to encrypted sends now without breaking the complete history. So a fix for this would be much appreciated.

I was wondering if you could raw send a fully copy of the datasets (with the history you want to maintain) to a temp dst, including the latest common snapshot from your src, and then try raw sending from the src to the temp dst to see if it would continue with raw replication? If yes, I think you know the suggestion I'm pointing towards?
OR would it be possible to re-seed the src from the dst and then start the replication again?

@Andreas-Marx
Copy link

I am reading here, because I was affected by issue #11294, for which a fix PR #12981 ended up in OpenZFS 2.1.3 . I still get "permanent" pool errors on <0x0> from time to time when I try to expire snaps, because I still have many encrypted snapshots that were at some point raw-received with OpenZFS<2.1.3 . But I am quite confident that newer snapshots are not affected, above mentioned flaw was obviously my problem.

Does anyone have a case where no incremental snap was ever received with OpenZFS < 2.1.3 ?

@atj
Copy link

atj commented Nov 17, 2022

$ uname -a
Linux <redacted> 5.10.0-16-amd64 #1 SMP Debian 5.10.127-1 (2022-06-30) x86_64 GNU/Linux
$ zfs --version
zfs-2.1.5-1~bpo11+1
zfs-kmod-2.1.5-1~bpo11+1

$ zpool status rpool
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:02:49 with 0 errors on Sun Nov 13 00:26:50 2022
config:

	NAME                                                  STATE     READ WRITE CKSUM
	rpool                                                 ONLINE       0     0     0
	  mirror-0                                            ONLINE       0     0     0
	    ata-Micron_1100_MTFDDAK256TBN_171516D965A6-part4  ONLINE       0     0     0
	    ata-Micron_1100_MTFDDAK256TBN_171516D9656A-part4  ONLINE       0     0     0

errors: No known data errors
$ zfs get encryption rpool
NAME   PROPERTY    VALUE        SOURCE
rpool  encryption  aes-256-gcm  -

$ zfs list -t snapshot -r rpool
no datasets available
$ zfs snapshot -r rpool@backup

$ zfs send -Rw rpool@backup | cat > /dev/null
warning: cannot send 'rpool/srv@backup': Invalid argument

$ zpool status -v rpool
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:02:47 with 0 errors on Thu Nov 17 17:39:19 2022
config:

	NAME                                                  STATE     READ WRITE CKSUM
	rpool                                                 ONLINE       0     0     0
	  mirror-0                                            ONLINE       0     0     0
	    ata-Micron_1100_MTFDDAK256TBN_171516D965A6-part4  ONLINE       0     0     0
	    ata-Micron_1100_MTFDDAK256TBN_171516D9656A-part4  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        rpool/srv@backup:<0x0>

$ zfs list -H -o name -t snapshot -r rpool |while read; do zfs destroy $REPLY; done
$ zpool status -v rpool
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:02:47 with 0 errors on Thu Nov 17 17:39:19 2022
config:

	NAME                                                  STATE     READ WRITE CKSUM
	rpool                                                 ONLINE       0     0     0
	  mirror-0                                            ONLINE       0     0     0
	    ata-Micron_1100_MTFDDAK256TBN_171516D965A6-part4  ONLINE       0     0     0
	    ata-Micron_1100_MTFDDAK256TBN_171516D9656A-part4  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0x19433>:<0x0>

$ zpool scrub -w rpool
$ zpool status -v rpool
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:02:48 with 0 errors on Thu Nov 17 17:47:41 2022
config:

	NAME                                                  STATE     READ WRITE CKSUM
	rpool                                                 ONLINE       0     0     0
	  mirror-0                                            ONLINE       0     0     0
	    ata-Micron_1100_MTFDDAK256TBN_171516D965A6-part4  ONLINE       0     0     0
	    ata-Micron_1100_MTFDDAK256TBN_171516D9656A-part4  ONLINE       0     0     0

errors: No known data errors

The dataset where the error occurs is not deterministic but it happens every time.

@Ryushin
Copy link

Ryushin commented Nov 17, 2022

Based on the last couple of posts I thought I might point out/remind that raw and non-raw sends are not bi-directionally interoperable (at least for encrypted datasets).

man: https://openzfs.github.io/openzfs-docs/man/8/zfs-send.8.html#w

Note that if you do not use this flag (-w, --raw) for sending encrypted datasets, data will be sent unencrypted and may be re-encrypted with a different encryption key on the receiving system, which will disable the ability to do a raw send to that system for incrementals.

So @Ryushin reading your posts, it sounds like you might have bit on confusion to clear up on encrypted vs. unencrypted datasets, and the behaviour of raw and non-raw sends in relation to encrypted datasets. It would help if you can share your workflow and commands being used in a detailed post, so folks can better visualise your setup and provide assistance.

So on the destination, I'm not mixing encrypted (raw) and non encrypted snapshots to the same destination dataset. When I switched to raw (-w) I destroy the destination datasets first.

My syncoid command:

/usr/bin/flock -n /run/syncoid-cron-to-myserver.lock -c "/usr/sbin/syncoid --recursive --skip-parent --compress=zstd-fast --no-sync-snap rpool root@myserver:netshares/zfs_backups/muaddib_backup" | /usr/bin/logger --tag=syncoid

My zfs list:

root@muaddib:~# zfs list
NAME                USED  AVAIL     REFER  MOUNTPOINT
rpool               516G  1.25T      192K  none
rpool/ROOT          374G  1.25T      280G  /
rpool/steam_games   142G  1.25T      142G  /home/chris/.steam

It is pretty much always the steam_games dataset that is seeing corrupted snapshots.

I'm going to create a unencrypted dataset to put the steam_games in since there is nothing that needs to be secure there.

@Ryushin
Copy link

Ryushin commented Nov 17, 2022

Actually, after the problem this morning it seems that manipulating any of the rpool/steam_games snapshots results in a problem. So this dataset probably has some underlying corruption in it now. Even deleting all snapshots, creating a new snapshot and trying to zfs send | zfs receive locally instantly gives an error:

warning: cannot send 'rpool/steam_games@migrate': Invalid argument

And the pool has a permanent error from that point.

I tar'd the dataset, I made a new unencrypted dataset and then untared to that. Hopefully this should fix the problems I'm seeing...... for a little while.

@Blackclaws
Copy link

@Blackclaws, you said the following in September:

We can't just migrate to encrypted sends now without breaking the complete history. So a fix for this would be much appreciated.

I was wondering if you could raw send a fully copy of the datasets (with the history you want to maintain) to a temp dst, including the latest common snapshot from your src, and then try raw sending from the src to the temp dst to see if it would continue with raw replication? If yes, I think you know the suggestion I'm pointing towards? OR would it be possible to re-seed the src from the dst and then start the replication again?

Raw replication will not work on a previously non raw replicated targets.

Our issue is that our history goes back much further than our current systems as these are backups. While, yes, technically it would be possible to restore from backup to the live systems and then raw replicate to the backup systems this would incur a rather large amount of downtime, which is currently not acceptable.

There are also other good reasons not to have an encrypted backup or have the backup be encrypted by a different key than the source. Therefore fixing the issues that still exist here should be preferred to just solving the issue by working around it.

@Ryushin
Copy link

Ryushin commented Nov 18, 2022

Well, that did not last long. I got two bad snapshots for my rpool/ROOT dataset which contains my main critical data. I'm going to have recreate my pool without encryption this weekend and restore it from snapshot from my server. I wanted to wait two ZFS versions after encryption was rolled out to let it mature, but this is a major bug that looks like it leads to dataloss if it's allowed to keep happening.

@Blackclaws
Copy link

Well, that did not last long. I got two bad snapshots for my rpool/ROOT dataset which contains my main critical data. I'm going to have recreate my pool without encryption this weekend and restore it from snapshot from my server. I wanted to wait two ZFS versions after encryption was rolled out to let it mature, but this is a major bug that looks like it leads to dataloss if it's allowed to keep happening.

You shouldn't actually have lost any data. The snapshots show as bad but aren't actually in any way corrupted. Reboot the system and all should be good. To get the error to vanish you have to run two scrubs though.

@kyle0r
Copy link

kyle0r commented Nov 18, 2022

@Ryushin it would good to see your workflow and exact commands to better understand your scenario, also your zfs versions and co.

@Ryushin
Copy link

Ryushin commented Nov 18, 2022

Well, that did not last long. I got two bad snapshots for my rpool/ROOT dataset which contains my main critical data. I'm going to have recreate my pool without encryption this weekend and restore it from snapshot from my server. I wanted to wait two ZFS versions after encryption was rolled out to let it mature, but this is a major bug that looks like it leads to dataloss if it's allowed to keep happening.

You shouldn't actually have lost any data. The snapshots show as bad but aren't actually in any way corrupted. Reboot the system and all should be good. To get the error to vanish you have to run two scrubs though.

I have not lost any data as of yet. But not being able to access snapshots using local zfs send/receive is not good. Though I did not reboot. Also having 145 previous snapshots go "bad" is also a scary proposition. I do have ZFS send/receive (backups) to my server along with traditional file level backups using Bareos every night. So technically I can recover from disaster.

@Ryushin it would good to see your workflow and exact commands to better understand your scenario, also your zfs versions and co.

My workflow is probably very typical.

Source: Thinkpad P17 Gen2 with 128GB ECC RAM, Xeon mobile processor.
OS: Devuan Chimaera (Debian Bullseye minus SystemD)
Kernel: 5.18.0-0.deb11.4-amd64
ZFS: Version 2.1.5-1~bpo11+1 (provided by Debian)
Mirrored encrypted pool using two Samsung 970 Evo Plus NVME M.2 drives.
Snapshots taken with sanoid every 15 minutes. Snapshots sent to destination server using syncoid.
Syncoid command: usr/sbin/syncoid --recursive --skip-parent --compress=zstd-fast --no-sync-snap rpool root@myserver:netshares/zfs_backups/muaddib_backup
Note: Raw snapshots are not sent, as I prefer the data on the destination to not be encrypted.

Destination Server: Supermicro 36 drive chassis with dual Xeon Processors and 128GB of ECC RAM.
OS: Devuan Chimaera (Debian Bullseye minus SystemD)
Kernel: 5.10.0-17-amd64
ZFS: Version 2.1.5-1~bpo11+1 (provided by Debian)
Three 10 spinning drives wide RaidZ2 VDEVs in one pool. One 6 drive SATA SSD wide RaidZ2 that is the root pool. Both pools do not use encryption.
Server also runs sanoid for snapshots and sends it's it's root pool to an offsite encrypted zfs storage located at ZFS.rent.

So nothing really out of the ordinary.

Edit: I should mention that all my pools are using ZFS 2.0 features and I have not yet upgraded them yet to 2.1.

@siilike
Copy link

siilike commented Feb 21, 2023

I am facing the same issue and previously I was complaining in #12014.

  1. zfs-2.1.9-1~bpo11+1 on Debian.
  2. Snapshots taken and pruned with a custom program, unencrypted replication with syncoid --bookmark --no-sync-snap.
  3. Two mirrored encrypted pools, both facing the same issue. One also has mirrored special devices.
  4. One pool worked fine for a while, now happens to both again.
  5. There is a strong correlation with how much data is written to the dataset: after creating a third dataset and moving some write-heavy apps there the situation has slightly improved.
  6. Usually it goes bad after the 2nd snapshot -- after deleting some old snapshots it works fine for one round of replication, and once another snapshot is taken one of them is not possible to replicate. This also applies when replication is not done between the snapshots.
  7. No relation to taking snapshots while a send is in progress.
  8. Dataset is random, but mostly happens to the same ones (write-heavy or at least some writes)
  9. Only remedy this far is to destroy the affected snapshots ASAP.
  10. Don't remember any panics, but the receiving side has been having affected by zfs-2.1.7 zfs recv panic in dsl_dataset_deactivate_feature_impl #14252 and ZFS hangs on kernel error: VERIFY3(0 == dmu_bonus_hold_by_dnode #12001.

Given a test script:

#!/bin/sh

for i in $(seq 1 100)
do
        echo $i
        zfs snapshot data2/test@$i
        zfs send data2/test@$i | cat > /dev/null
done

for i in $(seq 101 200)
do
        echo $i
        dd if=/dev/zero of=test bs=1 count=1 2>/dev/null
        zfs snapshot data2/test@$i
        zfs send data2/test@$i | cat > /dev/null
done

the output would be:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
warning: cannot send 'data2/test@114': Invalid argument
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
warning: cannot send 'data2/test@148': Invalid argument
149
150
151
152
153
154
155
156
157
158
159
160
warning: cannot send 'data2/test@160': Invalid argument
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
warning: cannot send 'data2/test@181': Invalid argument
182
183
184
185
186
187
188
189
190
191
192
193
warning: cannot send 'data2/test@193': Invalid argument
194
195
196
197
198
199
200

So snapshots without any new data don't trigger the issue, but writing even one byte will.

@rjycnfynby
Copy link

I'm experiencing similar problem with ZFS native encryption on version 2.2.2 and kernel 6.6.13 on all of my servers and zvol pools. Permanent errors starting to appear after about three days of uptime.

This is an old thread and I don't see any solution was found. Does it mean that ZFS native encryption is not production ready yet?

@mheubach
Copy link

You can use it for production. It is stable and there is no data loss or data corruption problem.
The problem seems to occur within the metadata of the in memory list of snapshots.
In my experience this happens only if you have:

  • statistically sufficient write activity within your encrypted datasets or zvols
  • decrypted zfs sends at high intervals
  • regular creations and deletions of snapshots whilst doing zfs send

There are workloads where you have to do unencrypted sends. For the time being i suggest you to make sure you don't create or delete snapshots while an unencrypted send is running.

If you only do raw encrypted zfs sends, the problem does not occur.

@rjycnfynby
Copy link

You can use it for production. It is stable and there is no data loss or data corruption problem. The problem seems to occur within the metadata of the in memory list of snapshots. In my experience this happens only if you have:

You are correct. The data within the VMs looks good and problem only affect the snapshots consistency. Unfortunately there is no way of fixing the problem once the permanent error ZFS-8000-8A happened. I can only create a new pool.

* statistically sufficient write activity within your encrypted datasets or zvols

What do you mean by "statistically sufficient write activity"? I'm running about a dozen of VMs on each hypervisor and this can trigger an issue?

* decrypted zfs sends at high intervals

I was sending incremental snapshots using a default syncoid settings which I believe does an unencrypted zfs send because encrypted datasets are using a different keys. Both servers are connected to the same switch with a 10G direct link. Not sure if I understand what is "high intervals" in this case. Can you elaborate, please?

* regular creations and deletions of snapshots whilst doing zfs send

My sanoid configuration does recursive snapshots on each dataset and zvol described in its configuration file. I don't think I can create a delay between each snapshot without modifying the script. Does it mean that it's not recommended to do a recursive snapshots within an encrypted dataset?

There are workloads where you have to do unencrypted sends. For the time being i suggest you to make sure you don't create or delete snapshots while an unencrypted send is running.

If you only do raw encrypted zfs sends, the problem does not occur.

I'm sure that I was using a lock-file that prevents running two sanoid/syncoid scripts at the same time. I believe that only one instance of the sanoid/syncoid script can run at a time.

If I understand you correctly sending raw encrypted zfs might help to avoid the issue with inconsistent list of snapshots?

@siilike
Copy link

siilike commented Jan 30, 2024

You're fine to use it in production if:
a) this issue does not occur for some reason, or
b) you are willing to restart your server every few days, especially if you are running Docker on zfs which quickly becomes unusable.

@rincebrain rincebrain added Component: Send/Recv "zfs send/recv" feature Component: Encryption "native encryption" feature labels Jan 30, 2024
@mheubach
Copy link

What do you mean by "statistically sufficient write activity"? I'm running about a dozen of VMs on each hypervisor and this can trigger an issue?

that just means, that you have a relevant amount of writes, to trigger this error - which you probably have

I was sending incremental snapshots using a default syncoid settings which I believe does an unencrypted zfs send because encrypted datasets are using a different keys. Both servers are connected to the same switch with a 10G direct link. Not sure if I understand what is "high intervals" in this case. Can you elaborate, please?

that simply means, that only occasional sends are likely not to trigger the error.

My sanoid configuration does recursive snapshots on each dataset and zvol described in its configuration file. I don't think I can create a delay between each snapshot without modifying the script. Does it mean that it's not recommended to do a recursive snapshots within an encrypted dataset?
No - that's ok. The point is just, that without creating or deleting snapshots, the error is not triggered (my experience)

If I understand you correctly sending raw encrypted zfs might help to avoid the issue with inconsistent list of snapshots?
that's correct. Do raw encrypted sends only and you will never see this error.
It's not possible to switch from unencrypted to raw encrypted incremental receives on your destination. So you have to do full raw encrypted sends before you can do incremental sends.

@wohali
Copy link

wohali commented Jan 30, 2024

mheubach said:

In my experience this happens only if you have:

  • statistically sufficient write activity within your encrypted datasets or zvols
  • decrypted zfs sends at high intervals
  • regular creations and deletions of snapshots whilst doing zfs send

In other words, an active system. In our company's opinion, the functionality is wholly unfit for purpose outside of home labs and toy instances where you can withstand downtime.

siilike said:

You're fine to use it in production if:
...
b) you are willing to restart your server every few days, especially if you are running Docker on zfs which quickly becomes unusable.

On large enough servers, such as ours, with reboots taking upwards of 6 hours, and massive workload, that's unacceptable in production. Feel free to read our horror story from 2021. And one note: this occurred for us on both spinning media as well as solid state storage.

So no, rjycnfynby, this isn't fixed, and there isn't even a 'good lead' as to where the problem resides either. I suspect this is because generating sufficient load for reproduction isn't something the active devs easily can manage in their setups -- we certainly couldn't take our machine out of prod and give them unfettered access for weeks to diagnose.

My recommendation is to rely on SED (Self-Encrypting Drives) for encryption at rest, and move on.

@rbrewer123
Copy link

I experienced the snapshot errors on my home desktop system for years. I don't even use that machine very much, so it was completely idle over 23h per day.

It was an AMD machine running NixOS with 2 7200 RPM consumer HDDs in a ZFS mirrored pair with ZFS native encryption. I had pyznap configured to take snapshots every 15 minutes. Once a day, pyznap would send a single daily snapshot to my backup pool, which was a second pair of mirrored HDDs with ZFS native encryption.

Despite the machine being idle all day long, it accumulated 1-2 errored snapshots per day on the main pool. The backup pool never got any errors. Destroying the offending snapshots followed by multiple rounds of scrubs would sometimes fix the problem, sometimes not. But the errored snapshots always caused the ZFS send to the backup pool to fail, which meant my daily backups were often not performed.

I replaced the main pool HDDs with a single NVMe drive several months ago and opted not to use ZFS native encryption on the new pool. pyznap still takes snapshots every 15 minutes and sends them to the ZFS-encrypted backup pool. I haven't experienced any snapshot errors since changing that main pool to not use encryption.

Seeing how this problem has remained for years, and considering the other recent data corruption bug has caused me to really consider whether the bells and whistles of ZFS are worth the risk.

@grahamperrin
Copy link
Contributor

grahamperrin commented Feb 17, 2024

From #11688 (comment):

… this isn't fixed, and there isn't even a 'good lead' as to where the problem resides either. …

@wohali your spec in 2021 included:

TrueNAS 12.0-U5 (FreeBSD 12.2 + OpenZFS 2.0.5)

What now? (Since FreeBSD stable/12 is end of life.)

https://www.truenas.com/software-status/

@wohali
Copy link

wohali commented Feb 17, 2024

We are always on the latest released TrueNAS Core. Right now that's FreeBSD 13.1, but with the next patch release it will be 13.2.

@muay-throwaway
Copy link

muay-throwaway commented Feb 18, 2024

My recommendation is to rely on SED (Self-Encrypting Drives) for encryption at rest, and move on.

@wohali Prior research has found that hardware-based encrypted disks very widely have serious vulnerabilities that allow the encryption to be bypassed (e.g., having master passwords or incorrectly implemented cryptographic protocols) (1, 2, 3). While many of these may be fixed now, this is difficult to verify. Software-based encryption offers the advantage of being verifiable. For Linux, LUKS is a widely accepted choice and does not suffer from the same stability issues of ZFS native encryption.

@wohali
Copy link

wohali commented Feb 19, 2024

@muay-throwaway Throwaway is right. I did not ask for your advice or approval, nor can you help resolve this specific issue. Further, all three of your references refer to the exact same 2 CVEs from 2018.

Kindly leave this issue to those who are directly impacted or directly trying to solve the problem, rather than sea lion in from nowhere. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Encryption "native encryption" feature Component: Send/Recv "zfs send/recv" feature Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests