-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unlistable and disappearing files #7401
Comments
I can confirm the same behavior on a minimal CentOS 7.4 installation (running inside VirtualBox) and latest ZFS 0.7.7. Please note that when copying somewhat bigger files (ie: kernel source) it does not happen, so it seems something as a race condition...
The problem does NOT appear on ZoL 0.7.6:
Maybe it can help. Here you find the output of
... and
|
We are also seeing similar behavior since the install of 0.7.7 |
I have a hand-built ZoL 0.7.7 on a stock Ubuntu 16.04 server (currently with Ubuntu kernel version '4.4.0-109-generic') and I can't reproduce this problem on it, following the reproduction here and some variants (eg using 'seq -w' to make all of the filenames the same size). The pool I'm testing against has a single mirrored vdev. |
One more data point, with the hope that it helps narrow down the issue. I cannot reproduce the issue on the few machines I have here, neither with 10k files, nor with 100k or even 1M. They all have very similar configuraition. They use a single 2-drive mirrored vdev. The drives are Samsung SSD 950 PRO 512GB (NVMe, quite fast).
|
I get a worse situation on latest Centos 7 with kmod: `[root@zirconia test]# mkdir SRC [root@zirconia test]# zpool status
I |
@rblank Did you use empty files? Please try the following:
Thanks. |
I used the exact commands from the OP (which create non-empty files), only changing 10000 to 100000 and 1000000. But for completeness, I tried yours as well.
The few data points above weakly hint at raidz, since no one was able to reproduce on mirrors so far. |
On one of my pools this works fine, on another it exhibits the problems. Both datasets belong to the same pool. bash-4.2$ mkdir SRC On beast/engineering the above commands run without issue. On beast/dataio they fail.
|
I think the issue is related to primarycache=all. If I set a pool to have primarycache=metadata there are no errors. |
@rblank I replicated the issue with a simple, single-vdev pool. I'll try and report back with mirror, anyway. @alatteri What pool/vdev layout do you use? Can you show |
Same machine, different datasets on the same pool.
|
@vbrik what's the HW config of this system - how much RAM, what model of x86_64 CPU? |
I can confirm this bug on a mirrored zpool. It is a production system so I didn't do much testing before downgrading to 0.7.6:
I have attempted to reproduce the bug on 0.7.6 without success. Here is an except of one of the processor feature levels:
|
I still get it with primarycache=metadata, on the first attempt to cp: |
For those that have upgraded to the 0.7.7 branch - is it advisable to downgrade back to 0.7.6 until this regression is resolved? |
What is the procedure to downgrade ZFS on CentOS 7.4? |
For reverts, I usually do:
Note that with dkms installs, after reverts, I usually find I need to:
To make sure all modules are actually happy and loadable on reboot. |
Is this seen with rsync instead of cp? |
I'm not able to reproduce this, and I have several machines (Debian unstable; 0.7.7, Linux 4.15). Can people also include
|
Ok, I've done some more tests.
On a Ubuntu Server 16.04 LTS with compiled 0.7.7 spl+zfs (so not using the repository version), I can not reproduce the error. As a side note, compiling on Ubuntu does not give any warning. So, the problem seems confined in CentOS/RHEL territory. To me, it seems a timing/racing problem (possibly related to the ARC): anything which increases copy time lowers the error probability/frequency. Some example of action which lower the fail rate:
[1] compilation give the following warning:
|
Is anyone experiencing this issue with "recent" mainline kernels (like 4.x)? |
Greetings,
The output of my yum install:
I am using rsnapshot to do backups. It is when it runs the equivalent to below that issues come up.
There's plenty of space
For those that want to know my hardware, the system is a AMD X2 255 processor with 8GB of memory (so far more than enough for my home backup system). I can revert today, or I can help test if someone needs me to try something. Just let me know. Thanks! |
Can someone who can repro this try bisecting the changes between 0.7.6 and 0.7.7 so we can see which commit breaks people? |
Most likely cc63068, seems to be a race condition in the mzap->fzap upgrade phase. |
@loli10K this, uh, seems horrendous enough that unless someone volunteers a fix for the race Real Fast, a revert and cutting a point release for this alone seems like it would be merited, to me at least. |
@rincebrain I can try later today. I'm meeting some friends for lunch and will be gone for a few hours but I'm happy to help how I can when I get back. |
@cstackpole if you do, it's probably worth trying with and without the commit @loli10K pointed to, rather than letting the bisect naturally find it. |
From what we have seen so far it certainly seems to only affect older (by which I mean lower-versioned) kernels. I have not been able to reproduce the issue on Linux 4.15 (Fedora). |
Issue openzfs#7401 identified data loss when many small files are being copied. Add a test to check for this condition. Signed-off-by: Antonio Russo <[email protected]>
This reverts commit cc63068. Under certain circumstances this change can result in an ENOSPC error when adding new files to a directory. See #7401 for full details. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Issue #7401 Cloes #7416
Our analysis is not finished. I am reopening this pending the completion of our analysis. |
Right I didn't mean to suggest this issue should be closed, and reverting the change was all that was needed. There's still clearly careful investigation to be done, which we can now focus on. @ryao when possible rolling back to a snapshot would be the cleanest way to recover these files. However, since that won't always be an option let's investigate implementing a generic orhpan recovery mechanism. Adding this functionality initially to |
Remove 0.7.7 links due to a regression: openzfs/zfs#7401 Signed-off-by: Tony Hutter <[email protected]>
This reverts commit cc63068. Under certain circumstances this change can result in an ENOSPC error when adding new files to a directory. See #7401 for full details. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Issue #7401 Closes #7416
This reverts commit 8c3cb02. See issues NixOS#38666 and openzfs/zfs#7401.
Given the improved understanding of the cause of this regression, can anything be said about the behaviour of rsync? If it reports no errors, are the data fine? What about mv? And what if mv is from one dataset to another, on the same pool? |
@darrenfreeman The mailing list or IRC chatroom would probably be a better place to ask, but
Also, one final caveat:
|
rsync always sorts files, so it should be fine. And as long as you don't receive errors, you should be fine. |
Reproducibility: yes Reproduced using: Furthermore, this didn't look good:
The pool was freshly created as,
I am trying to install the debug symbols for Update: can't reproduce the segfault on |
Thanks for the solutions and quick efforts to fix. |
Given this bug has now been listed on The Register (https://www.theregister.co.uk/2018/04/10/zfs_on_linux_data_loss_fixed/), it might be wise have an FAQ article on the wiki page (with a link in this ticket). The FAQ article should clearly state which versions of ZoL are affected and which distros/kernel versions (similar to the birthhole bug). This would hopefully limit any panic concerns about the reliability of ZoL as a storage layer. |
From that article (emphasis mine): Ouch. I agree with @markdesouza that there should be a FAQ article for that so we ZFS apologizers can point anyone who questions us about that to it. I would also like to suggest that the ZFS signing-off procedure be reviewed to avoid (or at least make it way more improbable) for such a "cruddy commit" to make it into a ZFS stable release, and that notice of this review also be added to that same FAQ article. |
Due to upstream data loss bug: openzfs/zfs#7401
In #7411, the |
Answering my earlier question. Debian 9.3 as above.
So anyone who prefers Something similar to |
FYI - you probably all saw it already, but we released zfs-0.7.8 with the reverted patch last night. |
@ort163 We do not have a one liner yet. People are continuing to analyze the issue and we will have a proper fix in the near future. That will include a way to detect+correct the wrong directory sizes, list snapshots affected and place the orphaned files in some kind of lost+found directory. I am leaning toward extending scrub to do it. |
@markdesouza I have spent a fair amount of time explaining things to end users on Hacker News, Reddit and Phoronix. I do not think that our understanding is sufficient to post a final FAQ yet, but we could post an interim FAQ. I think the interim FAQ entry should advise users to upgrade ASAP to avoid having to possibly deal with orphaned files if nothing has happened yet, or more orphaned files if something has already happened; and not to change how they do things after upgrading unless they deem it necessary until we finish our analysis, make a proper fix, and issue proper instructions on how to repair the damage in the release notes. I do not think there is any harm to pools if datasets have incorrect directory sizes and orphaned files while people wait for us to release a proper fix with instructions on how to completely address the issue, so telling them to wait after upgrading should be fine. The orphan files should stay around and persist through send/recv unless snapshot rollback is done or the dataset is destroyed. Until that is up, you could point users to my hacker news post: https://news.ycombinator.com/item?id=16797932 In specific, we need to nail down whether existing files’ directory entries could be lost, what if any other side effects happen when this is triggered on new file creation, what course of events leads to directory entries disappearing after ENOSPC, how system administrators could detect it and how system administrators will repair it. Then we should be able to make a proper FAQ entry. Edit: The first 3 questions are answered satisfactorily in #7421. |
Commit cc63068 caused ENOSPC error when copy a large amount of files between two directories. The reason is that the patch limits zap leaf expansion to 2 retries, and return ENOSPC when failed. The intent for limiting retries is to prevent pointlessly growing table to max size when adding a block full of entries with same name in different case in mixed mode. However, it turns out we cannot use any limit on the retry. When we copy files from one directory in readdir order, we are copying in hash order, one leaf block at a time. Which means that if the leaf block in source directory has expanded 6 times, and you copy those entries in that block, by the time you need to expand the leaf in destination directory, you need to expand it 6 times in one go. So any limit on the retry will result in error where it shouldn't. Note that while we do use different salt for different directories, it seems that the salt/hash function doesn't provide enough randomization to the hash distance to prevent this from happening. Since cc63068 has already been reverted. This patch adds it back and removes the retry limit. Also, as it turn out, failing on zap_add() has a serious side effect for mzap_upgrade(). When upgrading from micro zap to fat zap, it will call zap_add() to transfer entries one at a time. If it hit any error halfway through, the remaining entries will be lost, causing those files to become orphan. This patch add a VERIFY to catch it. Reviewed-by: Sanjeev Bagewadi <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Albert Lee <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes openzfs#7401 Closes openzfs#7421
Commit cc63068 caused ENOSPC error when copy a large amount of files between two directories. The reason is that the patch limits zap leaf expansion to 2 retries, and return ENOSPC when failed. The intent for limiting retries is to prevent pointlessly growing table to max size when adding a block full of entries with same name in different case in mixed mode. However, it turns out we cannot use any limit on the retry. When we copy files from one directory in readdir order, we are copying in hash order, one leaf block at a time. Which means that if the leaf block in source directory has expanded 6 times, and you copy those entries in that block, by the time you need to expand the leaf in destination directory, you need to expand it 6 times in one go. So any limit on the retry will result in error where it shouldn't. Note that while we do use different salt for different directories, it seems that the salt/hash function doesn't provide enough randomization to the hash distance to prevent this from happening. Since cc63068 has already been reverted. This patch adds it back and removes the retry limit. Also, as it turn out, failing on zap_add() has a serious side effect for mzap_upgrade(). When upgrading from micro zap to fat zap, it will call zap_add() to transfer entries one at a time. If it hit any error halfway through, the remaining entries will be lost, causing those files to become orphan. This patch add a VERIFY to catch it. Reviewed-by: Sanjeev Bagewadi <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Albert Lee <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes openzfs#7401 Closes openzfs#7421
Commit cc63068 caused ENOSPC error when copy a large amount of files between two directories. The reason is that the patch limits zap leaf expansion to 2 retries, and return ENOSPC when failed. The intent for limiting retries is to prevent pointlessly growing table to max size when adding a block full of entries with same name in different case in mixed mode. However, it turns out we cannot use any limit on the retry. When we copy files from one directory in readdir order, we are copying in hash order, one leaf block at a time. Which means that if the leaf block in source directory has expanded 6 times, and you copy those entries in that block, by the time you need to expand the leaf in destination directory, you need to expand it 6 times in one go. So any limit on the retry will result in error where it shouldn't. Note that while we do use different salt for different directories, it seems that the salt/hash function doesn't provide enough randomization to the hash distance to prevent this from happening. Since cc63068 has already been reverted. This patch adds it back and removes the retry limit. Also, as it turn out, failing on zap_add() has a serious side effect for mzap_upgrade(). When upgrading from micro zap to fat zap, it will call zap_add() to transfer entries one at a time. If it hit any error halfway through, the remaining entries will be lost, causing those files to become orphan. This patch add a VERIFY to catch it. Reviewed-by: Sanjeev Bagewadi <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Albert Lee <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes openzfs#7401 Closes openzfs#7421
System information
Describe the problem you're observing
Data loss when copying a directory with large-ish number of files. For example,
cp -r SRC DST
with 10000 files in SRC is likely to result in a couple of "cp: cannot create regular file `DST/XXX': No space left on device" error messages, and a few thousand files missing from the listing of the DST directory. (Needless to say, filesystem being full is not the problem.)The missing files are missing in the sense that they don't appear in the directory listing, but can be accessed using their name (except for the couple of files for which
cp
generated "No space left on device" error). For example:The content of DST/FOO are accessible by path (e.g.
cat DST/FOO
works) and is the same as SRC/FOO. If caches are dropped (echo 3 > /proc/sys/vm/drop_caches
) or the machine is rebooted, opening FOO directly by path fails.ls -ld DST
reports N fewer hard links than SRC, where N is the number of files for whichcp
reported "No space left on device" error.Names of missing files are mostly predictable if SRC is small.
Scrub does not find any errors.
I think the problem appeared in 0.7.7, but I am not sure.
Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: