Fix ENOSPC in "Handle zap_add() failures in mixed ..." #7421

tuxoko · 2018-04-10T17:21:33Z

Commit cc63068 caused ENOSPC error when copy a large amount of files
between two directories. The reason is that the patch limits zap leaf
expansion to 2 retries, and return ENOSPC when failed.

The intent for limiting retries is to prevent pointlessly growing table
to max size when adding a block full of entries with same name in
different case in mixed mode. However, it turns out we cannot use any
limit on the retry. When we copy files from one directory in readdir
order, we are copying in hash order, one leaf block at a time. Which
means that if the leaf block in source directory has expanded 6 times,
and you copy those entries in that block, by the time you need to expand
the leaf in destination directory, you need to expand it 6 times in one
go. So any limit on the retry will result in error where it shouldn't.

Note that while we do use different salt for different directories, it
seems that the salt/hash function doesn't provide enough randomization
to the hash distance to prevent this from happening.

Since cc63068 has already been reverted. This patch adds it back and
removes the retry limit.

Also, as it turn out, failing on zap_add() has a serious side effect for
mzap_upgrade(). When upgrading from micro zap to fat zap, it will
call zap_add() to transfer entries one at a time. If it hit any error
halfway through, the remaining entries will be lost, causing those files
to become orphan. This patch add a VERIFY to catch it.

Signed-off-by: Chunwei Chen [email protected]

Description

Motivation and Context

#7401

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

tuxoko · 2018-04-10T17:26:39Z

diff between cc63068
https://gist.github.com/tuxoko/8747a08795c17cefb755ad199b0879cb

trisk · 2018-04-10T18:07:06Z

If mze_canfit_fzap_leaf is canonical (and not overly conservative), can we make it a condition for all paths in zap_add, not just microzap to fatzap upgrades? That way fzap_add is guaranteed to succeed.

ahrens · 2018-04-10T18:09:55Z

Thanks for your analysis, and for adding a test case!

Are we seeing that 2 ZAP's end up with the same salt, or is it that even with (slightly?) different salts, the low bits of the hash value are preserved? Either way, perhaps we should consider initializing the salt with random_get_pseudo_bytes() rather than the current hacky way.

tuxoko · 2018-04-10T18:29:44Z

@ahrens
The current salt on Linux is different but pretty limited. Because it XOR two pointer so the upper bits are all zeros. However, I did try using random_get_pseudo_bytes() but still get the ENOSPC error.

@trisk
The point for mze_canfit_fzap_leaf is to make sure you can fit those entries with same hash value (mixed mode) when you upgrade, so you won't fail during upgrade and end up losing entries. But during normal zap_add, we need to let it fail in order for mixed mode to work.

ryao · 2018-04-10T19:16:00Z

@tuxoko Good work. This explains things nicely.

trisk · 2018-04-10T19:17:47Z

@tuxoko Thanks, makes sense that we'd need the failure to allow expansions on an existing fatzap.
Changes look good!

ahrens · 2018-04-10T19:49:58Z

@tuxoko I tested with salt = random_get_pseudo_bytes() and I also saw that the directory entries have similar order when sorted by hash value. This is most easily noticed by observing that the inode (object) numbers returned from getdents() (or in zdb) of the new directory are nearly sequential, indicating that the new sorting (dirent order) is similar to the old sorting (object ID, indicating creation order). Example output below.

I would guess that this is a property of the zap hash function. We should consider replacing it with something better, e.g. cityhash https://github.com/google/cityhash/. But that would require a ZPL on-disk version bump, and obviously your change should go forward without that.

illumos$ truss -v getdents ls -U /test/fs3/fs
getdents64(3, 0xFEDE4000, 8192)                 = 8192
    ino=37169 off=269237847 rlen=32  "100763-8531"
    ino=37172 off=269243910 rlen=32  "100742-7627"
    ino=37171 off=269244466 rlen=32  "100721-8797"
    ino=37175 off=269246886 rlen=32  "100728-5858"
    ino=37174 off=269247203 rlen=32  "100735-7141"
    ino=37173 off=269247978 rlen=32  "100749-2542"
    ino=37177 off=269251869 rlen=32  "100742-3581"
    ino=37176 off=269252650 rlen=32  "100777-1775"
    ino=37179 off=269255038 rlen=32  "100728-3104"
    ino=37178 off=269257892 rlen=32  "100721-2524"
    ino=37181 off=269260142 rlen=32  "100735-9054"
    ino=37180 off=269260397 rlen=32  "100763-4310"
    ino=37185 off=269263243 rlen=32  "100742-9732"
    ino=37184 off=269263496 rlen=32  "100714-4476"
    ino=37183 off=269264877 rlen=32  "100770-4050"
    ino=37182 off=269266879 rlen=32  "100721-6682"
    ino=37186 off=269269057 rlen=32  "100756-2242"
    ino=37187 off=269273562 rlen=32  "100763-6424"
    ino=58243 off=269275058 rlen=32  "100749-870"
    ino=37189 off=269278921 rlen=32  "100728-1630"
    ino=58241 off=269279797 rlen=32  "100742-116"
    ino=37190 off=269281985 rlen=32  "100763-2782"
    ino=37193 off=269285466 rlen=32  "100770-6764"
    ino=37191 off=269286719 rlen=32  "100714-6342"
    ino=37192 off=269286866 rlen=32  "100728-5596"
    ino=37196 off=269288173 rlen=32  "100777-7273"
    ino=37195 off=269291482 rlen=32  "100742-5087"
    ino=37198 off=269293604 rlen=32  "100735-1447"
    ino=37197 off=269295405 rlen=32  "100749-4044"
    ino=37199 off=269298881 rlen=32  "100742-1321"

ryao · 2018-04-10T19:57:43Z

@ahrens It is not likely to change your conclusions, but I should point out that the implementations of random_get_pseudo_bytes() differ between Illumos and Linux. Illumos uses a "FIPS 186-2 algorithm" while we use a 128-bit Xorshift Pseudo Random Number Generator:

https://github.com/zfsonlinux/spl/blob/master/module/spl/spl-generic.c#L64

That implies that the tests that the two of you did were using different sources of randomness, but came to the same result. You might have expected that, but I thought I would state it explicitly.

tonyhutter · 2018-04-11T19:16:44Z

tests/zfs-tests/tests/functional/cp_files/cp_files_001_pos.ksh

+done
+
+log_must test $NR_FILES -eq $(ls -U $TESTDIR/src | wc -l)
+


Can you add a quick comment here explaining why you use cp_files instead of just cp?

tonyhutter · 2018-04-11T20:20:21Z

I separated out your test case code and tried it on master with/without the bad patch (cc63068). With the bad patch, it correctly failed 18/20 times. When I bumped NR_FILES from 30,000 to 60,000, it correctly failed 20/20 times, while only increasing the test time from 1min -> 1min30s. Without the bad patch, it correctly passed 10/10 times. I also tested your full patch in this PR and saw it correctly passed 10/10 times.

tuxoko · 2018-04-11T21:16:50Z

@tonyhutter
I bumped NR_FILES to 60000. And also change the how files are created to speed up the test case.

behlendorf

@tuxoko thanks for getting to the bottom of this.

bunder2015 · 2018-04-12T04:04:07Z

tests/zfs-tests/tests/functional/cp_files/cp_files_001_pos.ksh

+
+WD=$(pwd)
+cd $TESTDIR/src
+# create NR_FILES in BATCH at a time to prevent overlowing argument buffer


overflowing?

bunder2015 · 2018-04-12T04:04:37Z

tests/zfs-tests/tests/functional/cp_files/cp_files_001_pos.ksh

+
+log_must test $NR_FILES -eq $(ls -U $TESTDIR/src | wc -l)
+
+# copy files from src to dst, use cp_files to make sure readdir order


nit: s/make sure/ensure

Commit cc63068 caused ENOSPC error when copy a large amount of files between two directories. The reason is that the patch limits zap leaf expansion to 2 retries, and return ENOSPC when failed. The intent for limiting retries is to prevent pointlessly growing table to max size when adding a block full of entries with same name in different case in mixed mode. However, it turns out we cannot use any limit on the retry. When we copy files from one directory in readdir order, we are copying in hash order, one leaf block at a time. Which means that if the leaf block in source directory has expanded 6 times, and you copy those entries in that block, by the time you need to expand the leaf in destination directory, you need to expand it 6 times in one go. So any limit on the retry will result in error where it shouldn't. Note that while we do use different salt for different directories, it seems that the salt/hash function doesn't provide enough randomization to the hash distance to prevent this from happening. Since cc63068 has already been reverted. This patch adds it back and removes the retry limit. Also, as it turn out, failing on zap_add() has a serious side effect for mzap_upgrade(). When upgrading from micro zap to fat zap, it will call zap_add() to transfer entries one at a time. If it hit any error halfway through, the remaining entries will be lost, causing those files to become orphan. This patch add a VERIFY to catch it. Signed-off-by: Chunwei Chen <[email protected]>

tuxoko · 2018-04-17T00:14:21Z

Fix the comment.

codecov · 2018-04-17T06:37:53Z

Codecov Report

Merging #7421 into master will decrease coverage by 0.02%.
The diff coverage is 81.39%.

@@            Coverage Diff             @@
##           master    #7421      +/-   ##
==========================================
- Coverage   76.53%    76.5%   -0.03%     
==========================================
  Files         335      331       -4     
  Lines      107100   104299    -2801     
==========================================
- Hits        81965    79792    -2173     
+ Misses      25135    24507     -628

Flag	Coverage Δ
#kernel	`76.64% <96.42%> (-0.41%)`	⬇️
#user	`65.65% <67.39%> (-0.22%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d68ac65...983d87d. Read the comment docs.

sanjeevbagewadi · 2018-04-17T10:46:50Z

Thank you @tuxoko for handling this while I was away. I agree with the changes.

The optimisation for bailing out early was to avoid growing the table unnecessarily. But, as seen in this bug we cannot deterministically decide when. Hence, we would need to live with the extra indirection.

lnicola · 2018-04-24T09:00:30Z

This isn't the best place to ask, but is there a way to detect if you've been affected by this issue yet? If not, will the FAQ when there is?

Commit cc63068 caused ENOSPC error when copy a large amount of files between two directories. The reason is that the patch limits zap leaf expansion to 2 retries, and return ENOSPC when failed. The intent for limiting retries is to prevent pointlessly growing table to max size when adding a block full of entries with same name in different case in mixed mode. However, it turns out we cannot use any limit on the retry. When we copy files from one directory in readdir order, we are copying in hash order, one leaf block at a time. Which means that if the leaf block in source directory has expanded 6 times, and you copy those entries in that block, by the time you need to expand the leaf in destination directory, you need to expand it 6 times in one go. So any limit on the retry will result in error where it shouldn't. Note that while we do use different salt for different directories, it seems that the salt/hash function doesn't provide enough randomization to the hash distance to prevent this from happening. Since cc63068 has already been reverted. This patch adds it back and removes the retry limit. Also, as it turn out, failing on zap_add() has a serious side effect for mzap_upgrade(). When upgrading from micro zap to fat zap, it will call zap_add() to transfer entries one at a time. If it hit any error halfway through, the remaining entries will be lost, causing those files to become orphan. This patch add a VERIFY to catch it. Reviewed-by: Sanjeev Bagewadi <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Albert Lee <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes openzfs#7401 Closes openzfs#7421

mailinglists35 · 2018-09-30T16:21:29Z

@ryao apologies for interrupting you with my tagging, I know this is probably very low priority but is there any way for a user without zfs source code reading ability to verify if is there any impact of this issue?

I read on the remedy pull request that this is triggered by copying of "large amount of files", but it does not mention the time distribution of those files.

Is there a rate number to compare that guarantees the issue was (or not) triggered? Like N files per second or per cpu cycles?

Thank you! :)

Originally Solaris didn't expect errors there, but they may happen if we fail to add entry into ZAP. Linux fixed it in openzfs#7421, but it was never fully ported to FreeBSD. Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes openzfs#13215

Originally Solaris didn't expect errors there, but they may happen if we fail to add entry into ZAP. Linux fixed it in #7421, but it was never fully ported to FreeBSD. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #13215 Closes #16138

Originally Solaris didn't expect errors there, but they may happen if we fail to add entry into ZAP. Linux fixed it in openzfs#7421, but it was never fully ported to FreeBSD. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes openzfs#13215 Closes openzfs#16138

Originally Solaris didn't expect errors there, but they may happen if we fail to add entry into ZAP. Linux fixed it in #7421, but it was never fully ported to FreeBSD. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #13215 Closes #16138

Originally Solaris didn't expect errors there, but they may happen if we fail to add entry into ZAP. Linux fixed it in openzfs#7421, but it was never fully ported to FreeBSD. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes openzfs#13215 Closes openzfs#16138

behlendorf requested review from ahrens and behlendorf April 10, 2018 17:33

tuxoko force-pushed the i7401 branch 3 times, most recently from eb6d663 to cc83da9 Compare April 10, 2018 17:50

ryao mentioned this pull request Apr 10, 2018

Unlistable and disappearing files #7401

Closed

tuxoko force-pushed the i7401 branch from cc83da9 to 86d2e87 Compare April 10, 2018 21:22

tonyhutter reviewed Apr 11, 2018

View reviewed changes

tuxoko force-pushed the i7401 branch 2 times, most recently from 2ddec57 to 2ac5646 Compare April 11, 2018 21:14

behlendorf approved these changes Apr 12, 2018

View reviewed changes

bunder2015 reviewed Apr 12, 2018

View reviewed changes

ahrens approved these changes Apr 16, 2018

View reviewed changes

tuxoko force-pushed the i7401 branch from 2ac5646 to 983d87d Compare April 17, 2018 00:13

behlendorf merged commit 599b864 into openzfs:master Apr 18, 2018

amotin mentioned this pull request Apr 26, 2024

FreeBSD: Add zfs_link_create() error handling #16138

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ENOSPC in "Handle zap_add() failures in mixed ..." #7421

Fix ENOSPC in "Handle zap_add() failures in mixed ..." #7421

tuxoko commented Apr 10, 2018 •

edited

Loading

tuxoko commented Apr 10, 2018

trisk commented Apr 10, 2018

ahrens commented Apr 10, 2018

tuxoko commented Apr 10, 2018

ryao commented Apr 10, 2018

trisk commented Apr 10, 2018

ahrens commented Apr 10, 2018

ryao commented Apr 10, 2018

tonyhutter Apr 11, 2018

tuxoko Apr 11, 2018

tonyhutter commented Apr 11, 2018

tuxoko commented Apr 11, 2018

behlendorf left a comment

bunder2015 Apr 12, 2018

bunder2015 Apr 12, 2018

tuxoko commented Apr 17, 2018

codecov bot commented Apr 17, 2018 •

edited

Loading

sanjeevbagewadi commented Apr 17, 2018

lnicola commented Apr 24, 2018

mailinglists35 commented Sep 30, 2018 •

edited

Loading

		done

		log_must test $NR_FILES -eq $(ls -U $TESTDIR/src \| wc -l)


		log_must test $NR_FILES -eq $(ls -U $TESTDIR/src \| wc -l)

		# copy files from src to dst, use cp_files to make sure readdir order

Fix ENOSPC in "Handle zap_add() failures in mixed ..." #7421

Fix ENOSPC in "Handle zap_add() failures in mixed ..." #7421

Conversation

tuxoko commented Apr 10, 2018 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

tuxoko commented Apr 10, 2018

trisk commented Apr 10, 2018

ahrens commented Apr 10, 2018

tuxoko commented Apr 10, 2018

ryao commented Apr 10, 2018

trisk commented Apr 10, 2018

ahrens commented Apr 10, 2018

ryao commented Apr 10, 2018

tonyhutter Apr 11, 2018

Choose a reason for hiding this comment

tuxoko Apr 11, 2018

Choose a reason for hiding this comment

tonyhutter commented Apr 11, 2018

tuxoko commented Apr 11, 2018

behlendorf left a comment

Choose a reason for hiding this comment

bunder2015 Apr 12, 2018

Choose a reason for hiding this comment

bunder2015 Apr 12, 2018

Choose a reason for hiding this comment

tuxoko commented Apr 17, 2018

codecov bot commented Apr 17, 2018 • edited Loading

Codecov Report

sanjeevbagewadi commented Apr 17, 2018

lnicola commented Apr 24, 2018

mailinglists35 commented Sep 30, 2018 • edited Loading

tuxoko commented Apr 10, 2018 •

edited

Loading

codecov bot commented Apr 17, 2018 •

edited

Loading

mailinglists35 commented Sep 30, 2018 •

edited

Loading