-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
permanent errors (ereport.fs.zfs.authentication) reported after syncoid snapshot/send workload #11688
Comments
@behlendorf I'm trying to dig into this a little bit further. I want to rule out in-flight corruption of snapshot data, so I'd like to be able to get access to EDIT: Can I just use Also, is this approach reasonable? I would think it would be helpful to know if the affected block is the meta_dnode, or a root block, etc., right? Or am I embarking on a wild goose chase? |
No, |
For the record, I'm still seeing the same behavior that I reported in the issue @aerusso linked above. And I just for the first time since this started (right after updating to 2.0) saw a kernel panic that left my zfs filesystems unresponsive. Here's what I found in dmesg at that time:
Unfortunately I'm not running a debug build here, so that stack is only of so much value, but wanted to share nonetheless in case it provides any insight into this issue. |
For the record, I just got another kernel crash dump with the same exact behavior and stack trace in dmesg as reported in my previous comment. The dmesg this time (maybe last time too, though that didn't make it into my earlier report) states that it's a kernel NULL pointer dereference. |
@jstenback I don't see how the problem you're describing is related to this issue. @aerusso is experiencing unexpected checksum errors, and you have a null pointer dereference. Unless I'm missing something, please file a separate issue report for this. |
For reference, I experienced the corruption in this report after minutes of running 2.0.3. The total time I used 2.0.3 was probably less than 2 hours. I'm guessing that @jstenback has been running the kernel for hundreds of hours. It might be that I just had not yet experienced that symptom. (Of course, it's also possible it's an unrelated bug). |
That is correct, my uptime during both of the crashes I mentioned were on the order of a hundred hours. And I typically start seeing the corruption after about the same amount of uptime. |
@aerusso Can you also mention: what was the last 'good' version of ZFS you didn't experience the issue? Can be helpful to narrow down the search. |
@IvanVolosyuk Unfortunately, my last known good configuration is ZFS 0.8.6 and Linux 5.9.15 (and it's stable as a rock back here). I was also unsuccessful in reproducing the bug in a VM (using a byte-for-byte copy of the whole 1 TB nvme). My current plan (once I can find a free weekend) is to try to bisect on the actual workstation exhibiting the bug. To complicate things, I'll have to do the bisection back here with Linux 5.9.15, since support for 5.10 wasn't added until very late in the release cycle. |
As I've noted in #12014, I'm running 2.0.2 (with Ubuntu 21.04, Linux 5.11.0) since 30th April and I haven't experienced any issues yet. On my server with Debian Buster, Linux 5.10 and ZFS 2.0.3 (from backports), I've experienced the issue on 4 datasets,
What I've also noted in the other issue, is that after reverting back to 0.8.4, everything seemed ok. I've also managed to destroy the affected snapshots and multiple scrubs didn't detect any issues. |
I added 3f81aba on top of Debian's 2.0.3-8, and am tentatively reporting that I cannot reproduce this bug. I've been running for about 45 minutes now, without the permanent error (I used to experience this bug immediately upon running my sanoid workload, which at this point has run three times). I would suggest that anyone already running 2.x consider applying that patch. |
Unfortunately my optimism was premature. After another two and a half hours, I did indeed experience another corrupted snapshot. |
After about 3.5 hours of uptime under Linux 5.9.15-1 (making sure this can be reproduced on a kernel supporting the known-good 0.8.6) with ZFS 3c1a2a9 (candidate 2.0.5 with another suspect patch reverted):
I failed to capture this information in my previous reports. I can reproduce this by trying to send the offending snapshot. This dataset has encryption set to Also, am I correct that there is some kind of MAC that is calculated before the on-disk checksum? My pool shows no READ/WRITE/CKSUM errors---does that mean that the data and/or the MAC was wrong before being written? Should I try changing any encryption settings?
|
More fun: I can |
Yes, both AES-CCM and AES-GCM are authenticated encryption algorithms, which protect against tampering with the cipher text.
There were some problems with metadata(<0x0>) MACs related to
You could try to change |
@aerusso, to answer your questions from r/zfs (hope this is the right issue):
Yes. All of them are direct children of an encrypted parent and inherit its encryption.
Sanoid is taking snapshots of all datasets every 5 minutes. I can't find any log entry about sanoid failing to send it, however, manually running
Not really. I've changed the syncoid cronjob yesterday to 5 minutes and then it happened. Have you tried reproducing it with really frequent snapshots and sending, something like
~~I'm moving everything away from that server right now, afterwards I'll reboot, test that, try to clean it up, then start a Ok, so I've had following stuff running for the last few hours: |
I have one more occurrence, however this time no sends were involved. When I woke up today, my backup server was offline and every I/O seemed to hang (i.e., after I type root + password it would hung and my services were offline). Also, in the dmesg I've got the following warning:
One minute afterwards the snapshoting starts, and all the ZFS related tasks start hanging:
(with the call stack below, if you think it's relevant let me know and I can post it) Affected tasks were also dp_sync_taskq and txg_sync which explains why the I/O was hanging (if I may guess z_rd_int is the read interrupt handler, txg_sync writes the transaction group to the disks). I don't have the pool events, sorry for that. EDIT: two more things to note. The system average load is about 0.5 (an old laptop running ~10 VMs) and it probably gets high memory pressure on the ARC. I had 8GB huge pages reserved for the VMs and 4GB zfs_arc_max, with 12GB RAM total - so ARC is going to have to fight with the host system (which is not much, the Linux kernel, libvirt and SSH server - I'd guess 100-200MB). I've now reduced the VM huge pages to 7GB, which should reduce the memory pressure. |
I have never had one of these kernel panics, so it may be better to put this in #12014 (which I see you've already posted in -- I'm also subscribed to that bug). The output of It's reassuring that merely rebooting and scrubbing makes the invalid reads go away, but you may want to set up the ZED, enable |
I can still reproduce this while running #12346. |
I'm going to try to bisect this bug over the course of the next ~months, hopefully. @behlendorf, are there any particularly dangerous commits/bugs I should be aware of lurking between e9353bc and 78fac8d? Any suggestions on doing this monster bisect (each failed test is going to take about ~3 hours, each successful test probably needs ~6 hours to make sure I'm not just "getting lucky" and not hitting the bug)? |
Hey there, we're seeing this as well - not sure if I should put it here, or in #12014 . System specs
History
|
@wohali I'm assuming I just don't leave my machine running long enough to experience the crash. I have a some questions:
I think it's very curious that this only seems to happen for people with flash memory. I wonder if it's some race in the encryption code that just doesn't really happen on slower media? Or, is it possible that this is just being caught because of the authentication (encryption), and is otherwise silently passing through for unencrypted datasets? |
Yes, exclusively an NFS server. It's /home directories, so access is varied. Under load it sees a constant 40Gbps of traffic.
Yes
No, sorry. |
I also face this issue. My env is:
The server is running Proxmox Backup Server and BackupPC to handle backups. The load vary (can be very busy during the night, a bit less during the day), but is mainly random read access. Sanoid is used to manage snapshots, and syncoid to replicate the data every hours to a remote location. The corrupted snap can be either those from sanoid, or from syncoid. Never had any crash though, errors disapear after two scrubs (but as scrub takes almost 3 days, most of the time, new corrupted snap appears during the previous scrub) |
I'm seem to be experiencing this daily (or more than once each day). Switching to raw syncs does not seem to improve anything. This is from my Thinkpad P17 Gen2, with a Xeon CPU and 128GB of ECC RAM, so even though it is a laptop, I have all the boxes ticked to not having corruption. I have a mirrored VDEV with two 2TB NVME drives sending to my server that does not use encryption. I'm almost at the point of dumping encryption on my laptop until this is fixed. Is there any debugging I can provide since this is happening so often on my system? Once I delete the bad snapshots, I have to run a scrub twice to fix the pool. Luckily it only takes a few minutes to run a scrub. |
@Ryushin: Can you give more details? If you're sending raw encrypted to another server the received data will be encrypted. You can't send raw encrypted to a server and not having the received dataset beeing encrypted too. So I suspect you're doing something wrong. |
The raw sends are showing encrypted on the destination. Since sending raw did not fix this problem, I've reverted back to not sending raw any longer. (destroyed the snaps on the destination and remove the -w option from syncoid). This morning, and I'm currently trying to fix this as I'm typing here, I had 145 bad snapshots. I've cleared them out and I'm now running scrubs, which only takes about five minutes. Before this happened, I saw all my CPU threads go to max for a few minutes. pigz-9 was the top CPU usage (I used compress=pigz-slow in syncoid) After the CPU calmed down, I had those 145 bad snapshots. It might be time to recreate my pool without encryption. |
In the past I was told that sending raw snapshots is not affected by this bug. Isn't that case? |
Yea, I thought that was the case in reading the thread. Though I was still getting corrupted snapshots a few hours after changing to sending raw. I've reverted back to non raw now as I'd rather have the backup data on my local server unencrypted. |
Based on the last couple of posts I thought I might point out/remind that raw and non-raw sends are not bi-directionally interoperable (at least for encrypted datasets). man: https://openzfs.github.io/openzfs-docs/man/8/zfs-send.8.html#w
So @Ryushin reading your posts, it sounds like you might have bit on confusion to clear up on encrypted vs. unencrypted datasets, and the behaviour of raw and non-raw sends in relation to encrypted datasets. It would help if you can share your workflow and commands being used in a detailed post, so folks can better visualise your setup and provide assistance. @Blackclaws, you said the following in September:
I was wondering if you could raw send a fully copy of the datasets (with the history you want to maintain) to a temp dst, including the latest common snapshot from your src, and then try raw sending from the src to the temp dst to see if it would continue with raw replication? If yes, I think you know the suggestion I'm pointing towards? |
I am reading here, because I was affected by issue #11294, for which a fix PR #12981 ended up in OpenZFS 2.1.3 . I still get "permanent" pool errors on <0x0> from time to time when I try to expire snaps, because I still have many encrypted snapshots that were at some point raw-received with OpenZFS<2.1.3 . But I am quite confident that newer snapshots are not affected, above mentioned flaw was obviously my problem. Does anyone have a case where no incremental snap was ever received with OpenZFS < 2.1.3 ? |
The dataset where the error occurs is not deterministic but it happens every time. |
So on the destination, I'm not mixing encrypted (raw) and non encrypted snapshots to the same destination dataset. When I switched to raw (-w) I destroy the destination datasets first. My syncoid command:
My zfs list:
It is pretty much always the steam_games dataset that is seeing corrupted snapshots. I'm going to create a unencrypted dataset to put the steam_games in since there is nothing that needs to be secure there. |
Actually, after the problem this morning it seems that manipulating any of the rpool/steam_games snapshots results in a problem. So this dataset probably has some underlying corruption in it now. Even deleting all snapshots, creating a new snapshot and trying to zfs send | zfs receive locally instantly gives an error:
And the pool has a permanent error from that point. I tar'd the dataset, I made a new unencrypted dataset and then untared to that. Hopefully this should fix the problems I'm seeing...... for a little while. |
Raw replication will not work on a previously non raw replicated targets. Our issue is that our history goes back much further than our current systems as these are backups. While, yes, technically it would be possible to restore from backup to the live systems and then raw replicate to the backup systems this would incur a rather large amount of downtime, which is currently not acceptable. There are also other good reasons not to have an encrypted backup or have the backup be encrypted by a different key than the source. Therefore fixing the issues that still exist here should be preferred to just solving the issue by working around it. |
Well, that did not last long. I got two bad snapshots for my rpool/ROOT dataset which contains my main critical data. I'm going to have recreate my pool without encryption this weekend and restore it from snapshot from my server. I wanted to wait two ZFS versions after encryption was rolled out to let it mature, but this is a major bug that looks like it leads to dataloss if it's allowed to keep happening. |
You shouldn't actually have lost any data. The snapshots show as bad but aren't actually in any way corrupted. Reboot the system and all should be good. To get the error to vanish you have to run two scrubs though. |
@Ryushin it would good to see your workflow and exact commands to better understand your scenario, also your zfs versions and co. |
I have not lost any data as of yet. But not being able to access snapshots using local zfs send/receive is not good. Though I did not reboot. Also having 145 previous snapshots go "bad" is also a scary proposition. I do have ZFS send/receive (backups) to my server along with traditional file level backups using Bareos every night. So technically I can recover from disaster.
My workflow is probably very typical. Source: Thinkpad P17 Gen2 with 128GB ECC RAM, Xeon mobile processor. Destination Server: Supermicro 36 drive chassis with dual Xeon Processors and 128GB of ECC RAM. So nothing really out of the ordinary. Edit: I should mention that all my pools are using ZFS 2.0 features and I have not yet upgraded them yet to 2.1. |
I am facing the same issue and previously I was complaining in #12014.
Given a test script:
the output would be:
So snapshots without any new data don't trigger the issue, but writing even one byte will. |
I'm experiencing similar problem with ZFS native encryption on version 2.2.2 and kernel 6.6.13 on all of my servers and zvol pools. Permanent errors starting to appear after about three days of uptime. This is an old thread and I don't see any solution was found. Does it mean that ZFS native encryption is not production ready yet? |
You can use it for production. It is stable and there is no data loss or data corruption problem.
There are workloads where you have to do unencrypted sends. For the time being i suggest you to make sure you don't create or delete snapshots while an unencrypted send is running. If you only do raw encrypted zfs sends, the problem does not occur. |
You are correct. The data within the VMs looks good and problem only affect the snapshots consistency. Unfortunately there is no way of fixing the problem once the permanent error ZFS-8000-8A happened. I can only create a new pool.
What do you mean by "statistically sufficient write activity"? I'm running about a dozen of VMs on each hypervisor and this can trigger an issue?
I was sending incremental snapshots using a default syncoid settings which I believe does an unencrypted zfs send because encrypted datasets are using a different keys. Both servers are connected to the same switch with a 10G direct link. Not sure if I understand what is "high intervals" in this case. Can you elaborate, please?
My sanoid configuration does recursive snapshots on each dataset and zvol described in its configuration file. I don't think I can create a delay between each snapshot without modifying the script. Does it mean that it's not recommended to do a recursive snapshots within an encrypted dataset?
I'm sure that I was using a lock-file that prevents running two sanoid/syncoid scripts at the same time. I believe that only one instance of the sanoid/syncoid script can run at a time. If I understand you correctly sending raw encrypted zfs might help to avoid the issue with inconsistent list of snapshots? |
You're fine to use it in production if: |
that just means, that you have a relevant amount of writes, to trigger this error - which you probably have
that simply means, that only occasional sends are likely not to trigger the error.
|
mheubach said:
In other words, an active system. In our company's opinion, the functionality is wholly unfit for purpose outside of home labs and toy instances where you can withstand downtime. siilike said:
On large enough servers, such as ours, with reboots taking upwards of 6 hours, and massive workload, that's unacceptable in production. Feel free to read our horror story from 2021. And one note: this occurred for us on both spinning media as well as solid state storage. So no, rjycnfynby, this isn't fixed, and there isn't even a 'good lead' as to where the problem resides either. I suspect this is because generating sufficient load for reproduction isn't something the active devs easily can manage in their setups -- we certainly couldn't take our machine out of prod and give them unfettered access for weeks to diagnose. My recommendation is to rely on SED (Self-Encrypting Drives) for encryption at rest, and move on. |
I experienced the snapshot errors on my home desktop system for years. I don't even use that machine very much, so it was completely idle over 23h per day. It was an AMD machine running NixOS with 2 7200 RPM consumer HDDs in a ZFS mirrored pair with ZFS native encryption. I had pyznap configured to take snapshots every 15 minutes. Once a day, pyznap would send a single daily snapshot to my backup pool, which was a second pair of mirrored HDDs with ZFS native encryption. Despite the machine being idle all day long, it accumulated 1-2 errored snapshots per day on the main pool. The backup pool never got any errors. Destroying the offending snapshots followed by multiple rounds of scrubs would sometimes fix the problem, sometimes not. But the errored snapshots always caused the ZFS send to the backup pool to fail, which meant my daily backups were often not performed. I replaced the main pool HDDs with a single NVMe drive several months ago and opted not to use ZFS native encryption on the new pool. pyznap still takes snapshots every 15 minutes and sends them to the ZFS-encrypted backup pool. I haven't experienced any snapshot errors since changing that main pool to not use encryption. Seeing how this problem has remained for years, and considering the other recent data corruption bug has caused me to really consider whether the bells and whistles of ZFS are worth the risk. |
From #11688 (comment):
@wohali your spec in 2021 included:
What now? (Since FreeBSD stable/12 is end of life.) |
We are always on the latest released TrueNAS Core. Right now that's FreeBSD 13.1, but with the next patch release it will be 13.2. |
@wohali Prior research has found that hardware-based encrypted disks very widely have serious vulnerabilities that allow the encryption to be bypassed (e.g., having master passwords or incorrectly implemented cryptographic protocols) (1, 2, 3). While many of these may be fixed now, this is difficult to verify. Software-based encryption offers the advantage of being verifiable. For Linux, LUKS is a widely accepted choice and does not suffer from the same stability issues of ZFS native encryption. |
@muay-throwaway Throwaway is right. I did not ask for your advice or approval, nor can you help resolve this specific issue. Further, all three of your references refer to the exact same 2 CVEs from 2018. Kindly leave this issue to those who are directly impacted or directly trying to solve the problem, rather than sea lion in from nowhere. Thank you. |
System information
Describe the problem you're observing
After upgrading to zfs 2.0.3 and Linux 5.10.19 (from 0.8.6), a well-tested syncoid workload causes "Permanent errors have been detected in the following files:" reports for a
pool/dataset@snapshot:<0x0>
(no file given).Removing the snapshot, and running a scrub causes the error to go away.
This is on a single-disk nvme SSD---never experiencing any problems before upgrading---and has happened twice, once after each reboot/re-running of the syncoid workload. I have since booted back into 0.8.6, ran the same workload, and have not experienced the error report.
Describe how to reproduce the problem
Beyond the above, I do not have a mechanism to reproduce this. I'd rather not blindly do it again!
Include any warning/errors/backtraces from the system logs
See also @jstenback's report of very similar symptoms: 1 and 2, which appear distinct from the symptoms of the bug report they are in. Additionally, compare to @rbrewer123's reports 1 and 2, which comes with a kernel panic---I do not experience this.
My setup is very similar: I run a snapshot workload periodically, and transfer the snapshots every day to another machine. I also transfer snapshots much more frequently to another pool on the same machine.
If valuable, I have
zpool history
output that I can provide. Roughly, the workload looks like manysnapshot
,send -I
,destroy
(on one pool) andreceive
(on the same machine, but another pool ).The text was updated successfully, but these errors were encountered: