Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue-4023: partial overwrite of metadata holes results in loss of hole birth info #4024

Closed
wants to merge 5 commits into from

Conversation

bprotopopov
Copy link
Contributor

When reading Ln metadata holes with non-zero birth epochs, instead of initializing the dbuf data with zeros, initialize with Ln-1 hole block pointers of same dmu type and birth epoch as Ln hole, and of
proper level and logical size, in order to make sure the new (Ln-1) holes inherit correct hole info.

This is important when generating incremental zfs send streams with hole_birth feature active, where
the proper hole birth is required in order to generate proper FREE records for the new holes.

Regression tested with ztest and verified proper assignment of hole birth epochs with zdb and by parsing the resulting send stream.

@behlendorf
Copy link
Contributor

I haven't had a chance to look closely at the patch but I did notice it causes a failure in xfstests 013. Stack below.

[ 1542.828564] VERIFY(buf[i] == 0) failed
[ 1542.831199] PANIC at dbuf.c:550:dbuf_verify()
[ 1542.833949] Showing stack for process 627
[ 1542.836451] CPU: 1 PID: 627 Comm: fsstress Tainted: PF          O--------------   3.10.0-229.el7.x86_64 #1
[ 1542.842295] Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/06/2015
[ 1542.845819]  ffffffffa04710d4 000000009d5dea6d ffff88034c7e7728 ffffffff81604b0a
[ 1542.850366]  ffff88034c7e7738 ffffffffa025e3f4 ffff88034c7e78c0 ffffffffa025e4bf
[ 1542.854827]  0000000000000000 ffff880300000028 ffff88034c7e78d0 ffff88034c7e7870
[ 1542.858896] Call Trace:
[ 1542.860416]  [<ffffffff81604b0a>] dump_stack+0x19/0x1b
[ 1542.863445]  [<ffffffffa025e3f4>] spl_dumpstack+0x44/0x50 [spl]
[ 1542.866381]  [<ffffffffa025e4bf>] spl_panic+0xbf/0xf0 [spl]
[ 1542.869398]  [<ffffffff816165f1>] ? ftrace_call+0x5/0x2f
[ 1542.872030]  [<ffffffff816165f1>] ? ftrace_call+0x5/0x2f
[ 1542.874992]  [<ffffffffa036671e>] dbuf_verify+0x59e/0x750 [zfs]
[ 1542.878330]  [<ffffffffa0366a5d>] dbuf_rele_and_unlock+0x4d/0x4a0 [zfs]
[ 1542.881528]  [<ffffffff816165f1>] ? ftrace_call+0x5/0x2f
[ 1542.884659]  [<ffffffffa03671cb>] dbuf_rele+0x4b/0x80 [zfs]
[ 1542.887780]  [<ffffffffa03699c7>] __dbuf_hold_impl+0x277/0x900 [zfs]
[ 1542.890856]  [<ffffffffa036a0c3>] dbuf_hold_impl+0x73/0xa0 [zfs]
[ 1542.893868]  [<ffffffffa036a5ce>] dbuf_hold_level+0x2e/0x60 [zfs]
[ 1542.897595]  [<ffffffffa0387779>] dmu_tx_check_ioerr+0x39/0xf0 [zfs]
[ 1542.901255]  [<ffffffffa0387e9e>] dmu_tx_count_write+0x2de/0x760 [zfs]
[ 1542.905182]  [<ffffffff816165f1>] ? ftrace_call+0x5/0x2f
[ 1542.908501]  [<ffffffffa0256c20>] ? spl_kmem_zalloc+0xf0/0x1a0 [spl]
[ 1542.912240]  [<ffffffffa0256c20>] ? spl_kmem_zalloc+0xf0/0x1a0 [spl]
[ 1542.915966]  [<ffffffffa03886da>] dmu_tx_hold_write+0x5a/0x100 [zfs]
[ 1542.919837]  [<ffffffffa042c18b>] zfs_write+0x3eb/0xe80 [zfs]
[ 1542.923451]  [<ffffffff811d3ff1>] ? link_path_walk+0x81/0x8b0
[ 1542.926938]  [<ffffffff816165f1>] ? ftrace_call+0x5/0x2f
[ 1542.930272]  [<ffffffffa03c1bd6>] ? refcount_remove+0x16/0x20 [zfs]
[ 1542.934140]  [<ffffffffa044b3c0>] zpl_write_common_iovec.constprop.8+0x90/0x130 [zfs]
[ 1542.938926]  [<ffffffffa044b56e>] zpl_write+0x7e/0xb0 [zfs]
[ 1542.942113]  [<ffffffff811c6c6d>] vfs_write+0xbd/0x1e0
[ 1542.945229]  [<ffffffff811c76b8>] SyS_write+0x58/0xb0
[ 1542.948320]  [<ffffffff81614a29>] system_call_fastpath+0x16/0x1b

@bprotopopov
Copy link
Contributor Author

@behlendorf, I will take a look, thanks for the pointer.

@bprotopopov
Copy link
Contributor Author

Yes, the commit has clearly changed some invariants that are being checked in debug mode; let me test some more with ZFS_DEBUG on.

@bprotopopov bprotopopov force-pushed the issue-4023 branch 2 times, most recently from 799624d to 3688423 Compare December 17, 2015 22:38
@bprotopopov
Copy link
Contributor Author

I am running ztest and with default stack size, I see intermittent asserts, whereas with 64k stack size, I do not see asserts. Is this normal ?

Also, I looked at the implementation of the range lock in ztest, and I am not sure I understand why it is supposed to work. Perhaps for small object numbers and offsets with large alignments, the range locks become object-scope locks, and they "work". But for unaligned offsets, they seem to easily break.

The reason I ask is that the test I added (ztest_dmu_free_long_range()) requires working range locks. I noticed that the ztest_dmu_prealloc() that also uses range locks and dmu_free_long_range() is disabled both in zfsonlinux and in openzfs code since a very long time ago.

I am not certain how to proceed - should I also disable my test ?

@behlendorf
Copy link
Contributor

@bprotopopov on what distribution and release? We do still do see occasional segfaults on CentOS 6 which may be related to stack size. Although these default stack size is 256K for ztest unless set otherwise so it would be odd that making it smaller helps. Unless we're just failing to create new threads due to memory being exhausted?

As for the range locks I've actually never looked very carefully at them in ztest, I see it's using it's own version. But if they're broken we should fix them first so you can add your test case (and we can reenable the other tests). Actually it would be great if ztest shared the same range lock code so it could also be validated but that might take a fair bit of refactoring.

@bprotopopov
Copy link
Contributor Author

@behlendorf, yes I am testing on CentOS 6.7. I am fully rebased to the tip of the master.

I get intermittent asserts getting NULL out of the thread-specific var in zk_thread_current() that seems to correlate very well with running with default stack. I guess I'll get more run stats.

Yes, I was planning to look into switching to the kernel range locks implementation bug did not have a chance to do so.

@behlendorf
Copy link
Contributor

@bprotopopov it's not clear to me how that's possible with the current code. It has be wondering if we're hitting some grey area in pthreads because I'm fairly certainly we don't see this on CentOS 7 or other more modern Linux distributions with newer versions of pthreads. I'd be very interested to hear whatever you determine!

As for the range lock it looks like you could use them if you rig up a partial znode_t to pass in.

@bprotopopov
Copy link
Contributor Author

Rebased against the master. Addressed cstyle issues. Tested with ztest, no issues with standard stack size. It looks like I was slightly out of sync between the zfs and spl code on the test machine. The new MUTEX_NOLOCKDEP macro in dbuf.c forced rebase to master in spl.

@bprotopopov
Copy link
Contributor Author

I looked through the test failures and found many seemingly infrastructure related failures, e.g. 'bad gateway', 'backend read error', slave nodes going away unexpectedly, or a disabled test (ziltest) returning status 3. I checked the 'slave going away' failure mode by rerunning filebench test on my Centos 6.7 VM, and it passes cleanly.

I wonder if the tests could be tried again, in order to filter out the infrastructure related issues.

@kernelOfTruth
Copy link
Contributor

@bprotopopov just amend some comments changes to the commit and force-push

that should do it ;)

It might still fail due to infrastructure issues though

@behlendorf
Copy link
Contributor

Where does this stand? Was a proper solution agreed upon?

@bprotopopov you can always retrigger a test run by rebasing on master and force updating the branch. There are definitely still some infrastructure failures, the Github outage yesterday didn't help any.

@bprotopopov
Copy link
Contributor Author

@behlendorf, I think this is a working solution. I have developed this concurrent with the solution presented in openzfs/openzfs#46. I believe the latter is a bit off in a few places, I have made my comments to that effect. I think I will summarize those comments again and point to this pull request asking for a review from @pcd1193182 and @ahrens.

Another issue was the ztest runs asserting intermittently in the txg_sync() thread as it's stack was getting corrupted. I have rebased both SPL and ZFS to the tip of the master, and I do not see those any more.

Finally, there was an issue with added test case fore dmu_free_long_range() that I felt needed a better implementation for the range locks. I have added such implementation to ztest.

@behlendorf
Copy link
Contributor

@bprotopopov OK, it sounds like we need to let this settle a bit so we're sure we end up with the same fix as illumos.

The stack issue was addressed by 89666a8, based on your previous observation the stack was slightly increased in user space.

@bprotopopov
Copy link
Contributor Author

@behlendorf, sounds good. This pull request is a superset of the one in openzfs, as it includes three additional fixes: one somewhat cosmetic (properly setting the size of L0 holes), and two of a more functional nature - one that adjusts the dn_maxblikid so that no blkids for the new meta-level holes point beyond it in dbuf_write_ready(), and one making sure we are not forgetting the new holes in dnode_increase_indirection(). I have asked the guys (@pcd1193182 and @ahrens) for a review.

*/
mutex_enter(&dn->dn_mtx);
if (db->db_blkid >
dn->dn_phys->dn_maxblkid >> (db->db_level * epbs))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continuation lines should be indented by 4 spaces

@pcd1193182
Copy link
Contributor

@bprotopopov We've got an updated version of the openzfs patch that addresses most of your comments internally. It differs from yours in a few ways.
We copy all the blkptrs when we do dnode_increase_indirection, not just the non-holes and the non-birth time-zero ones; there isn't a significant cost, and it makes the code cleaner.
We set the LSIZE slightly differently, but the result is the same.
Your change to dnode_sync() seems fine. Doesn't change anything, but it's certainly cleaner. Not sure why it wasn't that way in the first place!
I'm not sure about the ztest changes; are those ZoL specific?
I left a couple line comments in the code; I'm specifically a little concerned abotu the changes to dbuf_write_ready()

Update: The new version of the openzfs patch has been posted on github.

@bprotopopov
Copy link
Contributor Author

Hi, @pcd1193182

no, the changes to the ztest are intended to test the new code that deals with holes; the dmu_free_long_range() is a generic mechanism to punch arbitrary holes in objects, including ones that span Ln (n >= 1) block ranges.

In addition, I also implemented range locks based on zfs range locks for ztest; it did not look to me that the old user-space implementation of the range locks in ztest was sufficient for my goals.

These aren't specific to ZoL at all, to my knowledge.

@bprotopopov
Copy link
Contributor Author

Hi, @pcd1193182

here is the assertion in ztest (with my added test case for dmu_free_lokc_fange()) I hit if I remove the fixup of the maxblkid discussed above:

ztest: ../../module/zfs/dbuf.c:3270: Assertion `db->db_blkid <= dn->dn_phys->dn_maxblkid >> (db->db_level * epbs) (0xa3bd0 <= 0x3e8d6)' failed.
/sbin/ztest[0x409093]
/lib64/libpthread.so.0[0x3e6b20f790]
/lib64/libc.so.6(gsignal+0x35)[0x3e6aa32625]
/lib64/libc.so.6(abort+0x175)[0x3e6aa33e05]
/lib64/libc.so.6[0x3e6aa2b74e]
/lib64/libc.so.6(__assert_perror_fail+0x0)[0x3e6aa2b810]
/lib64/libc.so.6[0x3e6aa2b87b]
/lib64/libzpool.so.2(+0x4a52a)[0x7fda9977f52a]
/lib64/libzpool.so.2(+0x38ffe)[0x7fda9976dffe]
/lib64/libzpool.so.2(+0xf9836)[0x7fda9982e836]
/lib64/libzpool.so.2(+0xfa112)[0x7fda9982f112]
/lib64/libzpool.so.2(+0xfa112)[0x7fda9982f112]
/lib64/libzpool.so.2(+0xfa112)[0x7fda9982f112]
/lib64/libzpool.so.2(zio_execute+0xe0)[0x7fda998278f0]
/lib64/libzpool.so.2(+0x2e912)[0x7fda99763912]
/lib64/libzpool.so.2(zk_thread_helper+0x22c)[0x7fda9975fd1c]
/lib64/libpthread.so.0[0x3e6b207a51]
/lib64/libc.so.6(clone+0x6d)[0x3e6aae893d]
child died with signal 6

uint64_t end_size;
#else
ASSERT3U(zp->z_is_zvol, ==, B_TRUE);
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear it me it's worth the bother to add the z_is_zvol member if it's only ever used for this assertion. Also wouldn't it make more sense to set z_is_zvol = 0 and then initialize z_size and z_blksz to reasonable values so the if (!zp->z_is_zvol) { } wouldn't need to be disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe z_is_zvol is used by the zfs_range_lock_writer() to decide if this is a 'simple' (for zvols) or 'compicated' (for real znodes - files) implementation. This member simplifies things quite a bit in terms of using 'user space' znode structure.

@behlendorf
Copy link
Contributor

@bprotopopov I see. Yes, in the end we'll want to to adopt the same patch which is applied to openZFS and apply your test cases. If upstream is working on this already with you I'm happy to wait until it's finalized.

struct zvol_state contains a dummy znode, which is around 1KB on x64, only for
zfs_range_lock. But in reality, other than z_range_lock and z_range_avl,
zfs_range_lock only need znode on regular file, which means we add 1KB on a
structure and gain nothing.

In this patch, we remove the dummy znode for zvol_state. In order to do that,
we also need to refactor zfs_range_lock a bit. We move z_range_lock and
z_range_avl pair out of znode_t to form zfs_rlock_t. This new struct replaces
znode_t as the main handle inside the range lock functions. Since regular
files still need znode for RL_WRITER and RL_APPEND, we make znode an optional
argument in zfs_range_lock_impl.

To reduce possible merge conflict, we retain the prototype of zfs_range_lock.
And zvol now should call zvol_range_lock instead.

Signed-off-by: Chunwei Chen <[email protected]>
bprotopopov and others added 3 commits April 13, 2016 13:24
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>

dnode_next_offset is used in a variety of places to iterate over the
holes or allocated blocks in a dnode. It operates under the premise that
it can iterate over the blockpointers of a dnode in open context while
holding only the dn_struct_rwlock as reader. Unfortunately, this premise
does not hold.

When we create the zio for a dbuf, we pass in the actual block pointer
in the indirect block above that dbuf. When we later zero the bp in
zio_write_compress, we are directly modifying the bp. The state of the
bp is now inconsistent from the perspective of dnode_next_offset: the bp
will appear to be a hole until zio_dva_allocate finally finishes filling
it in. In the meantime, dnode_next_offset can detect a hole in the dnode
when none exists.

I was able to experimentally demonstrate this behavior with the
following setup:
1. Create a file with 1 million dbufs.
2. Create a thread that randomly dirties L2 blocks by writing to the
first L0 block under them.
3. Observe dnode_next_offset, waiting for it to skip over a hole in the
middle of a file.
4. Do dnode_next_offset in a loop until we skip over such a non-existent
hole.

The fix is to ensure that it is valid to iterate over the indirect
blocks in a dnode while holding the dn_struct_rwlock by passing the zio
a copy of the BP and updating the actual BP in dbuf_write_ready while
holding the lock.

Upstream bugs: DLPX-35372

Ported-by: Boris Protopopov <[email protected]>
@bprotopopov
Copy link
Contributor Author

@tuxoko, here is a rebase against your branch zrl. For some reason, the commits above are in a little different order as compared to my branch, but the one you are interested in is c4ebefe.

The 78013e0 and 66127cb deal with the original issue, whereas the c7bc5db is a ztest unit test that punches large holes in objects (designed to test for side effects of the 78013e0 and 66127cb).

#define zfs_range_lock(zp, off, len, type) \
zfs_range_lock_impl(&(zp)->z_range_lock, off, len, type, zp)
#define zvol_range_lock(zrl, off, len, type) \
zfs_range_lock_impl(zrl, off, len, type, NULL)
#else
#define zfs_range_lock(zrl, off, len, type) \
zfs_range_lock_impl(zrl, off, len, type, NULL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should just use zfs_range_lock_impl and don't bother with znode at all. Having different prototype of the "same" function for kernel and user is really ugly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tuxoko, do you have a specific suggestion ?

@tuxoko
Copy link
Contributor

tuxoko commented Apr 13, 2016

@bprotopopov
I thought the ztest stuff is independent to others, so what I meant is only rebase ztest patch on to my patch in a separate branch.

@bprotopopov
Copy link
Contributor Author

@tuxoko yes the ztest commit is independent to my knowledge.

@bprotopopov
Copy link
Contributor Author

Just ran 10 iterations of the ztest command that failed on Centos 6.7 - I cannot reproduce the crash.

@behlendorf
Copy link
Contributor

@bprotopopov the CentOS 6.7 failure is very likely #4267 which is unrelated. It's the only ztest failure I see these days and it's fairly reproducible if you let zloop run long enough (a few hours).

@bprotopopov
Copy link
Contributor Author

bprotopopov commented Apr 15, 2016

@behlendorf, OK, thanks for the info. Also, I can look into addressing @tuxoko's comments above. What is the desired timeline for completing this work?

@behlendorf
Copy link
Contributor

@bprotopopov to my knowledge it's not blocking anything and is mainly a nice bit of cleanup and code improvement. It's the kind of thing which can be accepted once all the patches are in good shape and have been signed off on.

behlendorf pushed a commit to behlendorf/zfs that referenced this pull request May 17, 2016
The zfs range lock interface no longer tightly depends on a
znode_t and therefore can be used in ztest.  This allows the
previous ztest specific implementation to be removed, and for
additional test coverage of the shared version.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Boris Protopopov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#4023
Issue openzfs#4024
@behlendorf
Copy link
Contributor

@bprotopopov can you please refresh this PR, or better yet open a new one and close this one, with the remaining fixes we need to apply. Several of these patches are now in ZoL master and the 6513 fix has been merged in to the OpenZFS repo.

@bprotopopov
Copy link
Contributor Author

Sure, will look into it


From: Brian Behlendorf [email protected]
Sent: Tuesday, May 17, 2016 1:52:31 PM
To: zfsonlinux/zfs
Cc: Boris Protopopov; Mention
Subject: Re: [zfsonlinux/zfs] issue-4023: partial overwrite of metadata holes results in loss of hole birth info (#4024)

@bprotopopovhttps://github.com/bprotopopov can you please refresh this PR, or better yet open a new one and close this one, with the remaining fixes we need to apply. Several of these patches are now in ZoL master and the 6513 fix has been merged in to the OpenZFS repo.

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHubhttps://github.com//pull/4024#issuecomment-219799273

@bprotopopov
Copy link
Contributor Author

bprotopopov commented Jun 1, 2016

@behlendorf, I can't find a fix for 6513 on the master of 0.6.5 branches; perhaps I am looking in the wrong places.

@bprotopopov
Copy link
Contributor Author

@behlendorf, OK, I will open a separate pull request to port 6513 from the openzfs repo.

@bprotopopov bprotopopov closed this Jun 1, 2016
ryao pushed a commit to ClusterHQ/zfs that referenced this pull request Jun 7, 2016
The zfs range lock interface no longer tightly depends on a
znode_t and therefore can be used in ztest.  This allows the
previous ztest specific implementation to be removed, and for
additional test coverage of the shared version.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Boris Protopopov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#4023
Issue openzfs#4024
nedbass pushed a commit to nedbass/zfs that referenced this pull request Aug 26, 2016
The zfs range lock interface no longer tightly depends on a
znode_t and therefore can be used in ztest.  This allows the
previous ztest specific implementation to be removed, and for
additional test coverage of the shared version.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Boris Protopopov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#4023
Issue openzfs#4024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants