-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issue-4023: partial overwrite of metadata holes results in loss of hole birth info #4024
Conversation
I haven't had a chance to look closely at the patch but I did notice it causes a failure in xfstests 013. Stack below.
|
@behlendorf, I will take a look, thanks for the pointer. |
Yes, the commit has clearly changed some invariants that are being checked in debug mode; let me test some more with ZFS_DEBUG on. |
799624d
to
3688423
Compare
I am running ztest and with default stack size, I see intermittent asserts, whereas with 64k stack size, I do not see asserts. Is this normal ? Also, I looked at the implementation of the range lock in ztest, and I am not sure I understand why it is supposed to work. Perhaps for small object numbers and offsets with large alignments, the range locks become object-scope locks, and they "work". But for unaligned offsets, they seem to easily break. The reason I ask is that the test I added (ztest_dmu_free_long_range()) requires working range locks. I noticed that the ztest_dmu_prealloc() that also uses range locks and dmu_free_long_range() is disabled both in zfsonlinux and in openzfs code since a very long time ago. I am not certain how to proceed - should I also disable my test ? |
3688423
to
5f34974
Compare
@bprotopopov on what distribution and release? We do still do see occasional segfaults on CentOS 6 which may be related to stack size. Although these default stack size is 256K for ztest unless set otherwise so it would be odd that making it smaller helps. Unless we're just failing to create new threads due to memory being exhausted? As for the range locks I've actually never looked very carefully at them in ztest, I see it's using it's own version. But if they're broken we should fix them first so you can add your test case (and we can reenable the other tests). Actually it would be great if ztest shared the same range lock code so it could also be validated but that might take a fair bit of refactoring. |
@behlendorf, yes I am testing on CentOS 6.7. I am fully rebased to the tip of the master. I get intermittent asserts getting NULL out of the thread-specific var in zk_thread_current() that seems to correlate very well with running with default stack. I guess I'll get more run stats. Yes, I was planning to look into switching to the kernel range locks implementation bug did not have a chance to do so. |
@bprotopopov it's not clear to me how that's possible with the current code. It has be wondering if we're hitting some grey area in pthreads because I'm fairly certainly we don't see this on CentOS 7 or other more modern Linux distributions with newer versions of pthreads. I'd be very interested to hear whatever you determine! As for the range lock it looks like you could use them if you rig up a partial |
5f34974
to
59ae27d
Compare
59ae27d
to
d891e16
Compare
7d4b75a
to
d54d22b
Compare
Rebased against the master. Addressed cstyle issues. Tested with ztest, no issues with standard stack size. It looks like I was slightly out of sync between the zfs and spl code on the test machine. The new MUTEX_NOLOCKDEP macro in dbuf.c forced rebase to master in spl. |
I looked through the test failures and found many seemingly infrastructure related failures, e.g. 'bad gateway', 'backend read error', slave nodes going away unexpectedly, or a disabled test (ziltest) returning status 3. I checked the 'slave going away' failure mode by rerunning filebench test on my Centos 6.7 VM, and it passes cleanly. I wonder if the tests could be tried again, in order to filter out the infrastructure related issues. |
@bprotopopov just amend some comments changes to the commit and force-push that should do it ;) It might still fail due to infrastructure issues though |
d54d22b
to
5a3619b
Compare
Where does this stand? Was a proper solution agreed upon? @bprotopopov you can always retrigger a test run by rebasing on master and force updating the branch. There are definitely still some infrastructure failures, the Github outage yesterday didn't help any. |
@behlendorf, I think this is a working solution. I have developed this concurrent with the solution presented in openzfs/openzfs#46. I believe the latter is a bit off in a few places, I have made my comments to that effect. I think I will summarize those comments again and point to this pull request asking for a review from @pcd1193182 and @ahrens. Another issue was the ztest runs asserting intermittently in the txg_sync() thread as it's stack was getting corrupted. I have rebased both SPL and ZFS to the tip of the master, and I do not see those any more. Finally, there was an issue with added test case fore dmu_free_long_range() that I felt needed a better implementation for the range locks. I have added such implementation to ztest. |
@bprotopopov OK, it sounds like we need to let this settle a bit so we're sure we end up with the same fix as illumos. The stack issue was addressed by 89666a8, based on your previous observation the stack was slightly increased in user space. |
@behlendorf, sounds good. This pull request is a superset of the one in openzfs, as it includes three additional fixes: one somewhat cosmetic (properly setting the size of L0 holes), and two of a more functional nature - one that adjusts the dn_maxblikid so that no blkids for the new meta-level holes point beyond it in dbuf_write_ready(), and one making sure we are not forgetting the new holes in dnode_increase_indirection(). I have asked the guys (@pcd1193182 and @ahrens) for a review. |
*/ | ||
mutex_enter(&dn->dn_mtx); | ||
if (db->db_blkid > | ||
dn->dn_phys->dn_maxblkid >> (db->db_level * epbs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
continuation lines should be indented by 4 spaces
@bprotopopov We've got an updated version of the openzfs patch that addresses most of your comments internally. It differs from yours in a few ways. Update: The new version of the openzfs patch has been posted on github. |
Hi, @pcd1193182 no, the changes to the ztest are intended to test the new code that deals with holes; the dmu_free_long_range() is a generic mechanism to punch arbitrary holes in objects, including ones that span Ln (n >= 1) block ranges. In addition, I also implemented range locks based on zfs range locks for ztest; it did not look to me that the old user-space implementation of the range locks in ztest was sufficient for my goals. These aren't specific to ZoL at all, to my knowledge. |
Hi, @pcd1193182 here is the assertion in ztest (with my added test case for dmu_free_lokc_fange()) I hit if I remove the fixup of the maxblkid discussed above:
|
5a3619b
to
692a5cf
Compare
uint64_t end_size; | ||
#else | ||
ASSERT3U(zp->z_is_zvol, ==, B_TRUE); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear it me it's worth the bother to add the z_is_zvol
member if it's only ever used for this assertion. Also wouldn't it make more sense to set z_is_zvol = 0
and then initialize z_size
and z_blksz
to reasonable values so the if (!zp->z_is_zvol) { }
wouldn't need to be disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe z_is_zvol is used by the zfs_range_lock_writer() to decide if this is a 'simple' (for zvols) or 'compicated' (for real znodes - files) implementation. This member simplifies things quite a bit in terms of using 'user space' znode structure.
@bprotopopov I see. Yes, in the end we'll want to to adopt the same patch which is applied to openZFS and apply your test cases. If upstream is working on this already with you I'm happy to wait until it's finalized. |
struct zvol_state contains a dummy znode, which is around 1KB on x64, only for zfs_range_lock. But in reality, other than z_range_lock and z_range_avl, zfs_range_lock only need znode on regular file, which means we add 1KB on a structure and gain nothing. In this patch, we remove the dummy znode for zvol_state. In order to do that, we also need to refactor zfs_range_lock a bit. We move z_range_lock and z_range_avl pair out of znode_t to form zfs_rlock_t. This new struct replaces znode_t as the main handle inside the range lock functions. Since regular files still need znode for RL_WRITER and RL_APPEND, we make znode an optional argument in zfs_range_lock_impl. To reduce possible merge conflict, we retain the prototype of zfs_range_lock. And zvol now should call zvol_range_lock instead. Signed-off-by: Chunwei Chen <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> dnode_next_offset is used in a variety of places to iterate over the holes or allocated blocks in a dnode. It operates under the premise that it can iterate over the blockpointers of a dnode in open context while holding only the dn_struct_rwlock as reader. Unfortunately, this premise does not hold. When we create the zio for a dbuf, we pass in the actual block pointer in the indirect block above that dbuf. When we later zero the bp in zio_write_compress, we are directly modifying the bp. The state of the bp is now inconsistent from the perspective of dnode_next_offset: the bp will appear to be a hole until zio_dva_allocate finally finishes filling it in. In the meantime, dnode_next_offset can detect a hole in the dnode when none exists. I was able to experimentally demonstrate this behavior with the following setup: 1. Create a file with 1 million dbufs. 2. Create a thread that randomly dirties L2 blocks by writing to the first L0 block under them. 3. Observe dnode_next_offset, waiting for it to skip over a hole in the middle of a file. 4. Do dnode_next_offset in a loop until we skip over such a non-existent hole. The fix is to ensure that it is valid to iterate over the indirect blocks in a dnode while holding the dn_struct_rwlock by passing the zio a copy of the BP and updating the actual BP in dbuf_write_ready while holding the lock. Upstream bugs: DLPX-35372 Ported-by: Boris Protopopov <[email protected]>
@tuxoko, here is a rebase against your branch zrl. For some reason, the commits above are in a little different order as compared to my branch, but the one you are interested in is c4ebefe. The 78013e0 and 66127cb deal with the original issue, whereas the c7bc5db is a ztest unit test that punches large holes in objects (designed to test for side effects of the 78013e0 and 66127cb). |
#define zfs_range_lock(zp, off, len, type) \ | ||
zfs_range_lock_impl(&(zp)->z_range_lock, off, len, type, zp) | ||
#define zvol_range_lock(zrl, off, len, type) \ | ||
zfs_range_lock_impl(zrl, off, len, type, NULL) | ||
#else | ||
#define zfs_range_lock(zrl, off, len, type) \ | ||
zfs_range_lock_impl(zrl, off, len, type, NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should just use zfs_range_lock_impl and don't bother with znode at all. Having different prototype of the "same" function for kernel and user is really ugly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tuxoko, do you have a specific suggestion ?
@bprotopopov |
@tuxoko yes the ztest commit is independent to my knowledge. |
Just ran 10 iterations of the ztest command that failed on Centos 6.7 - I cannot reproduce the crash. |
@bprotopopov the CentOS 6.7 failure is very likely #4267 which is unrelated. It's the only ztest failure I see these days and it's fairly reproducible if you let zloop run long enough (a few hours). |
@behlendorf, OK, thanks for the info. Also, I can look into addressing @tuxoko's comments above. What is the desired timeline for completing this work? |
@bprotopopov to my knowledge it's not blocking anything and is mainly a nice bit of cleanup and code improvement. It's the kind of thing which can be accepted once all the patches are in good shape and have been signed off on. |
The zfs range lock interface no longer tightly depends on a znode_t and therefore can be used in ztest. This allows the previous ztest specific implementation to be removed, and for additional test coverage of the shared version. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#4023 Issue openzfs#4024
@bprotopopov can you please refresh this PR, or better yet open a new one and close this one, with the remaining fixes we need to apply. Several of these patches are now in ZoL master and the 6513 fix has been merged in to the OpenZFS repo. |
Sure, will look into it From: Brian Behlendorf [email protected] @bprotopopovhttps://github.com/bprotopopov can you please refresh this PR, or better yet open a new one and close this one, with the remaining fixes we need to apply. Several of these patches are now in ZoL master and the 6513 fix has been merged in to the OpenZFS repo. You are receiving this because you were mentioned. |
@behlendorf, I can't find a fix for 6513 on the master of 0.6.5 branches; perhaps I am looking in the wrong places. |
@behlendorf, OK, I will open a separate pull request to port 6513 from the openzfs repo. |
The zfs range lock interface no longer tightly depends on a znode_t and therefore can be used in ztest. This allows the previous ztest specific implementation to be removed, and for additional test coverage of the shared version. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#4023 Issue openzfs#4024
The zfs range lock interface no longer tightly depends on a znode_t and therefore can be used in ztest. This allows the previous ztest specific implementation to be removed, and for additional test coverage of the shared version. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#4023 Issue openzfs#4024
When reading Ln metadata holes with non-zero birth epochs, instead of initializing the dbuf data with zeros, initialize with Ln-1 hole block pointers of same dmu type and birth epoch as Ln hole, and of
proper level and logical size, in order to make sure the new (Ln-1) holes inherit correct hole info.
This is important when generating incremental zfs send streams with hole_birth feature active, where
the proper hole birth is required in order to generate proper FREE records for the new holes.
Regression tested with ztest and verified proper assignment of hole birth epochs with zdb and by parsing the resulting send stream.