-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenZFS 7614 - zfs device evacuation/removal #6900
Conversation
I'm posting this to give the bots a chance to confirm the few test suite failures I'm seeing as well as a general overall health-check. The few test suite failures seem to be related to the inability of zdb to work properly when a removal operation is in progress and/ore related issues. I've not looked into this too deeply yet but it may very well be some sort of latent linux-ism. There are also a number of areas I came across during the ports which need looking into. That all said, this does seem to work in most of the tests and during my manual testing. Also, I'll note that it hasn't been merged upstream yet but I'm hoping this port will give some wider coverage and also allow me and, hopefully others, to review the upstream patch. [EDIT] There may still be some large kmem allocations. [EDIT 2] And at least one lingering style issue. |
Ugh, that first round of tests was rather uninspiring! I'll get some of the basic stuff fixed up soon and re-push. |
Since it's likely some of the issues uncovered here will have applicability in OpenZFS in general, anyone interested in following the progress of this should also follow the upstream PR in openzfs/openzfs#482. |
3d88404
to
2f3473c
Compare
Fixed a bunch of things, rebased to current master and synced with a couple of other new changes in the upstream PR. The |
Re-submitting shortly to give the ZoL buildbots another crack it this. My local tests are still failing on |
9f45b17
to
74d652c
Compare
Status update: There are 3 main issues the buildbots are running into:
Addressing these in order: The two types of ztest problems seen by the buildbots seem to be hangs and segfaults in The segfault is, as best as I can tell, in this line:
which would likely indicate a bogus vdev; presumably one without Finally, a bunch of the new tests for device removal assume that file systems can be cleanly unmounted while files are open and actively being written to. I added an |
Another update: I've been able to locally reproduce both of the ztest issues seen by the buildbots. |
This just landed in illumos. Looking forward to seeing it in ZoL too! illumos/illumos-gate@5cabbc6 |
We've typically addressed this issue by slightly reworking to test cases to kill any active process in the mount point prior to issuing the unmount. |
|
@behlendorf and @megari Thanks, I'll look into getting the processes killed that are preventing the unmount (actually zpool destroy, etc.). Also, another update: I've fixed the segfault in ztest; it was caused by a new dnode-iterating loop which was not large dnode aware. The other ztest issue is a deadlock which, unfortunately, is a pain to diagnose with gdb. The deadlock seems to always involve a call to Once the unmounting issue is handled, I think all the test suite should be good. |
Sounds good!
This is currently a preexisting known unrelated issue in master so it shouldn't hold up this PR. Let me go ahead and open a new issue for with the debugging I have and we can resolve it in a separate PR |
de0eb14
to
bb19db2
Compare
I'm rather surprised at this type of failure on the "Amazon 2" buildbot:
I realize that it's probably a good idea to do "rm -f $DISKS", however, I've never seen any version of the "rm" program which fails on a nonexistent argument when run as "rm -f". [EDIT] This seems to happen on other buildbot instance types. |
bb19db2
to
35d0e29
Compare
openzfs/openzfs#251 matches this PR's description in that removal of mirror devices is not supported. This PR, however, refers to openzfs/openzfs#482 which omits mention of mirror devices in the PR description. Is it intended that this PR #6900 supports removal of mirror devices? Thanks :-) |
Also, @dweeezil: the failure seems to be:
It's not that an argument is missing, but that it's present as a directory. |
I just pushed a refresh of this patch stack. The previous deadlocks were caused by the problem fixed in #7234. The issue of copying corrupted data from one leg of a mirror remains so the patch stack now contains a hack to ztest and zloop to not allow fault injection when device evacuation is enabled. @ahrens, the problem fixed in openzfs/openzfs#561 has been ported in this patch stack. Could you please take a peek at a6de073, particularly In any case, the last commit it this patch stack is intended to work around the issue in the context of ztest to see how everything checks out by the buildbots. |
Given all the newly-appearing failures here (https://github.com/dweeezil/zfs/blob/illumos-7614/module/zfs/vdev_mirror.c#L628), I suspect the merge of openzfs/openzfs#561 may have been bad. At least I've been able to reproduce it locally and can investigate further. Also, I suspect that most of the removal-related test suite failures are device busy issues when trying to export pools. All in all, a rather uninspiring round of buildbot tests :| |
It's possible for the following assertion to be tripped when running ztest: assertion failed for thread 0xf09fca40, thread-id 549: spa->spa_max_ashift == spa->spa_min_ashift (0xc == 0x9), file ../../../uts/common/fs/zfs/vdev_removal.c, line 965 > $c libc.so.1`_lwp_kill+7(ebdde6c0, ebdde6c0, a9, fee7865e) libc.so.1`_assfail+0x214(ebddea28, fed7ac3c, 3c5, fef62000) libc.so.1`assfail3+0xde(fed7b130, c, 0, fed812cb, 9, 0) libzpool.so.1`spa_vdev_copy_impl+0x26b(89a4b40, ebddef74, ebddef68, 8992dc0, ebe10a00, fef073c0) libzpool.so.1`spa_vdev_remove_thread+0x6cd(87450c0, 0, 0, fee8f43a) libc.so.1`_thrp_setup+0x8c(f09fca40) libc.so.1`_lwp_start(f09fca40, 0, 0, 0, 0, 0) > ::spa -v ADDR STATE NAME 08723000 ACTIVE ztest ADDR STATE AUX DESCRIPTION 087466c0 HEALTHY - root 087450c0 HEALTHY - /rpool/tmp/ztest.0a 08745640 HEALTHY - indirect 08745bc0 HEALTHY - /rpool/tmp/ztest.2a 08746140 HEALTHY - /rpool/tmp/ztest.3a - - - spares 08744b40 HEALTHY - /rpool/tmp/ztest.spares.0 Authored by: Prakash Surya <[email protected]> Reviewed by: Prashanth Sreenivasa <[email protected]> Reviewed by: George Wilson <[email protected]> Ported-by: Tim Chase <[email protected]> OpenZFS-Issue: https://www.illumos.org/issues/9084 OpenZFS-Commit: openzfs/openzfs@18acba7 Issue openzfs#6900
Authored by: Toomas Soome <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-Issue: https://illumos.org/issues/9036 OpenZFS-Commit: openzfs/openzfs@f02c28e434 Issue openzfs#6900
Codecov Report
@@ Coverage Diff @@
## master #6900 +/- ##
==========================================
+ Coverage 76.17% 76.51% +0.34%
==========================================
Files 330 335 +5
Lines 104316 107018 +2702
==========================================
+ Hits 79458 81886 +2428
- Misses 24858 25132 +274
Continue to review full report at Codecov.
|
Mirrors are supposed to provide redundancy in the face of whole-disk failure and silent damage (e.g. some data on disk is not right, but ZFS hasn't detected the whole device as being broken). However, the current device removal implementation bypasses some of the mirror's redundancy. Note that in no case is incorrect data returned, but we might get a checksum error when we should have been able to find the right data. There are two underlying problems: 1. When we remove a mirror device, we only read one side of the mirror. Since we can't verify the checksum, this side may be silently bad, but the good data is on the other side of the mirror (which we didn't read). This can cause the removal to "bake in" the busted data – all copies of the data in the new location are the same, busted version, while we left the good version behind. The fix for this is to read and copy both sides of the mirror. If the old and new vdevs are mirrors, we will read both sides of the old mirror, and write each copy to the corresponding side of the new mirror. (If the old and new vdevs have a different number of children, we will do this as best as possible.) Even though we aren't verifying checksums, this ensures that as long as there's a good copy of the data, we'll have a good copy after the removal, even if there's silent damage to one side of the mirror. If we're removing a mirror that has some silent damage, we'll have exactly the same damage in the new location (assuming that the new location is also a mirror). 2. When we read from an indirect vdev that points to a mirror vdev, we only consider one copy of the data. This can lead to reduced effective redundancy, because we might read a bad copy of the data from one side of the mirror, and not retry the other, good side of the mirror. Note that the problem is not with the removal process, but rather after the removal has completed (having copied correct data to both sides of the mirror), if one side of the new mirror is silently damaged, we encounter the problem when reading the relocated data via the indirect vdev. Also note that the problem doesn't occur when ZFS knows that one side of the mirror is bad, e.g. when a disk entirely fails or is offlined. The impact is that reads (from indirect vdevs that point to mirrors) may return a checksum error even though the good data exists on one side of the mirror, and scrub doesn't repair all data on the mirror (if some of it is pointed to via an indirect vdev). The fix for this is complicated by "split blocks" - one logical block may be split into two (or more) pieces with each piece moved to a different new location. In this case we need to read all versions of each split (one from each side of the mirror), and figure out which combination of versions results in the correct checksum, and then repair the incorrect versions. This ensures that we supply the same redundancy whether you use device removal or not. For example, if a mirror has small silent errors on all of its children, we can still reconstruct the correct data, as long as those errors are at sufficiently-separated offsets (specifically, separated by the largest block size - default of 128KB, but up to 16MB). Porting notes: * A new indirect vdev check was moved from dsl_scan_needs_resilver_cb() to dsl_scan_needs_resilver(), which was added to ZoL as part of the sequential scrub work. * Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t parameter. The extra parameter is unique to ZoL. * When posting indirect checksum errors the ABD can be passed directly, zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS. Authored by: Matthew Ahrens <[email protected]> Reviewed by: Tim Chase <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Ported-by: Tim Chase <[email protected]> OpenZFS-issue: https://illumos.org/issues/9290 OpenZFS-commit: openzfs/openzfs#591 Closes #6900
Remove duplicate segment copies to minimize the possible search space for reconstruction. Once reduced an accurate assessment can be made regarding the difficulty in reconstructing the block. Also, ztest will now run zdb with zfs_reconstruct_indirect_combinations_max set to 1000000 in an attempt to avoid checksum errors. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6900
…d for indirect vdevs The timeline of the race condition is the following: [1] Thread A is about to finish condesing the first vdev in spa_condense_indirect_thread(), so it calls the spa_condense_indirect_complete_sync() sync task which sets the spa_condensing_indirect field to NULL. Waiting for the sync task to finish, thread A sleeps until the txg is done. When this happens, thread A will acquire spa_async_lock and set spa_condense_thread to NULL. [2] While thread A waits for the txg to finish, thread B which is running spa_sync() checks whether it should condense the second vdev in vdev_indirect_should_condense() by checking the spa_condensing_indirect field which was set to NULL by spa_condense_indirect_thread() from thread A. So it goes on and tries to spawn a new condensing thread in spa_condense_indirect_start_sync() and the aforementioned assertions fails because thread A has not set spa_condense_thread to NULL (which is basically the last thing it does before returning). The main issue here is that we rely on both spa_condensing_indirect and spa_condense_thread to signify whether a condensing thread is running. Ideally we would only use one throughout the codebase. In addition, for managing spa_condense_thread we currently use spa_async_lock which basically tights condensing to scrubing when it comes to pausing and resuming those actions during spa export. This commit introduces the ZTHR infrastructure, which is basically threads created during spa_load()/spa_create() and exist until we export or destroy the pool. ZTHRs sleep the majority of the time, until they are notified to wake up and do some predefined type of work. In the context of the current bug, a zthr to does the condensing of indirect mappings replacing the older code that used bare kthreads. When a pool is created, the condensing zthr is spawned but sleeps right away, until it is awaken by a signal from spa_sync(). If an existing pool is loaded, the condensing zthr looks if there is anything to condense before going to sleep, in case we were condensing mappings in the pool before it got exported. The benefits of this solution are the following: - The current bug is fixed - spa_condensing_indirect is the sole indicator of whether we are currently condensing or not - condensing is more decoupled from the spa_async_thread related functionality. As a final note, this commit also sets up the path on upstreaming other features that use the ZTHR code like zpool checkpoint and fast clone deletion. Authored by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Approved by: Hans Rosenfeld <[email protected]> Ported-by: Tim Chase <[email protected]> OpenZFS-issue: https://illumos.org/issues/9079 OpenZFS-commit: openzfs/openzfs@3dc606ee Closes #6900
…rect_remap() Authored by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Hans Rosenfeld <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://illumos.org/issues/9080 OpenZFS-commit: openzfs/openzfs@bdfded42e6 Closes #6900
It's possible for the following assertion to be tripped when running ztest: assertion failed for thread 0xf09fca40, thread-id 549: spa->spa_max_ashift == spa->spa_min_ashift (0xc == 0x9), file ../../../uts/common/fs/zfs/vdev_removal.c, line 965 > $c libc.so.1`_lwp_kill+7(ebdde6c0, ebdde6c0, a9, fee7865e) libc.so.1`_assfail+0x214(ebddea28, fed7ac3c, 3c5, fef62000) libc.so.1`assfail3+0xde(fed7b130, c, 0, fed812cb, 9, 0) libzpool.so.1`spa_vdev_copy_impl+0x26b(89a4b40, ebddef74, ebddef68, 8992dc0, ebe10a00, fef073c0) libzpool.so.1`spa_vdev_remove_thread+0x6cd(87450c0, 0, 0, fee8f43a) libc.so.1`_thrp_setup+0x8c(f09fca40) libc.so.1`_lwp_start(f09fca40, 0, 0, 0, 0, 0) > ::spa -v ADDR STATE NAME 08723000 ACTIVE ztest ADDR STATE AUX DESCRIPTION 087466c0 HEALTHY - root 087450c0 HEALTHY - /rpool/tmp/ztest.0a 08745640 HEALTHY - indirect 08745bc0 HEALTHY - /rpool/tmp/ztest.2a 08746140 HEALTHY - /rpool/tmp/ztest.3a - - - spares 08744b40 HEALTHY - /rpool/tmp/ztest.spares.0 Authored by: Prakash Surya <[email protected]> Reviewed by: Prashanth Sreenivasa <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Tim Chase <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9084 OpenZFS-commit: openzfs/openzfs@18acba7 Closes #6900
Authored by: Toomas Soome <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://illumos.org/issues/9036 OpenZFS-commit: openzfs/openzfs@f02c28e434 Closes #6900
I'm a little lost, did this get accepted into the main release or is it waiting for something? |
This has been merged in to the master branch and will be available on the next major release 0.8.0 |
Due to a flaw in 4589f3a the number of unique combinations could be calculated incorrectly. This could result in the random combinations reconstruction being used when it would have been possible to check all combinations. This change fixes the unique combinations calculation and simplifies the reconstruction logic by maintaining a per- segment list of unique copies. The vdev_indirect_splits_damage() function was introduced to validate both the enumeration and random reconstruction logic with ztest. It is implemented such it will never make a known recoverable block unrecoverable. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #6900 Closes #7934
This patch fixes a race condition where the end of vdev_remove_replace_with_indirect(), which holds svr_lock, would race against spa_vdev_removal_destroy(), which destroys the same lock and is called asynchronously via dsl_sync_task_nowait(). Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Issue #6900 Closes #8083
* Detect IO errors during device removal While device removal cannot verify the checksums of individual blocks during device removal, it can reasonably detect hard IO errors from the leaf vdevs. Failure to perform this error checking can result in device removal completing successfully, but moving no data which will permanently corrupt the pool. Situation 1: faulted/degraded vdevs In the configuration shown below, the removal of mirror-0 will permanently corrupt the pool. Device removal will preferentially copy data from 'vdev1 -> vdev3' and from 'vdev2 -> vdev4'. Which in this case will result in nothing being copied since one vdev in each of those groups in unavailable. However, device removal will complete successfully since all IO errors are ignored. tank DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 /var/tmp/vdev1 FAULTED 0 0 0 external fault /var/tmp/vdev2 ONLINE 0 0 0 mirror-1 DEGRADED 0 0 0 /var/tmp/vdev3 ONLINE 0 0 0 /var/tmp/vdev4 FAULTED 0 0 0 external fault This issue is resolved by updating the source child selection logic to exclude unreadable leaf vdevs. Additionally, unwritable destination child vdevs which can never succeed are skipped to prevent generating a large number of write IO errors. Situation 2: individual hard IO errors During removal if an unexpected hard IO error is encountered when either reading or writing the child vdev the entire removal operation is cancelled. While it may be possible to reconstruct the data after removal that cannot be guaranteed. The only strictly safe thing to do is to cancel the removal. As a future improvement we may want to instead suspend the removal process and allow the damaged region to be retried. But that work is left for another time, hard IO errors during the removal process are expected to be exceptionally rare. Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #6900 Closes #8161
Due to a flaw in 4589f3a the number of unique combinations could be calculated incorrectly. This could result in the random combinations reconstruction being used when it would have been possible to check all combinations. This change fixes the unique combinations calculation and simplifies the reconstruction logic by maintaining a per- segment list of unique copies. The vdev_indirect_splits_damage() function was introduced to validate both the enumeration and random reconstruction logic with ztest. It is implemented such it will never make a known recoverable block unrecoverable. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#6900 Closes openzfs#7934
This patch fixes a race condition where the end of vdev_remove_replace_with_indirect(), which holds svr_lock, would race against spa_vdev_removal_destroy(), which destroys the same lock and is called asynchronously via dsl_sync_task_nowait(). Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Issue openzfs#6900 Closes openzfs#8083
* Detect IO errors during device removal While device removal cannot verify the checksums of individual blocks during device removal, it can reasonably detect hard IO errors from the leaf vdevs. Failure to perform this error checking can result in device removal completing successfully, but moving no data which will permanently corrupt the pool. Situation 1: faulted/degraded vdevs In the configuration shown below, the removal of mirror-0 will permanently corrupt the pool. Device removal will preferentially copy data from 'vdev1 -> vdev3' and from 'vdev2 -> vdev4'. Which in this case will result in nothing being copied since one vdev in each of those groups in unavailable. However, device removal will complete successfully since all IO errors are ignored. tank DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 /var/tmp/vdev1 FAULTED 0 0 0 external fault /var/tmp/vdev2 ONLINE 0 0 0 mirror-1 DEGRADED 0 0 0 /var/tmp/vdev3 ONLINE 0 0 0 /var/tmp/vdev4 FAULTED 0 0 0 external fault This issue is resolved by updating the source child selection logic to exclude unreadable leaf vdevs. Additionally, unwritable destination child vdevs which can never succeed are skipped to prevent generating a large number of write IO errors. Situation 2: individual hard IO errors During removal if an unexpected hard IO error is encountered when either reading or writing the child vdev the entire removal operation is cancelled. While it may be possible to reconstruct the data after removal that cannot be guaranteed. The only strictly safe thing to do is to cancel the removal. As a future improvement we may want to instead suspend the removal process and allow the damaged region to be retried. But that work is left for another time, hard IO errors during the removal process are expected to be exceptionally rare. Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#6900 Closes openzfs#8161
I'm looking into using ZFS and am a bit confused by the wording of the pull request. Specifically, this part:
This seems to contradict the 1st sentence:
Unless the differentiating factor between the 2 statements is the use of "device" vs. "vdev." Assuming this is the case, let's say I have a pool, Pool A: Pool A: Per how I read the above, I'd be able to remove any of the vdevs and have Pool A's size be reduced accordingly, but I would be unable to remove any of the individual disks from the pool except vdev4. Is that correct? Following from the above, if I wanted to remove an individual disk from vdev1, vdev2, or vdev3, I'd have to remove the vdev containing it and then remove it from that vdev, but doing so would "destroy" that vdev as far as ZFS is concerned. Is that correct? Just want to ensure I'm understanding this. |
@jdrch the original text for this PR included some awkward wording and was incorrect in a few places. Let me refer you to the finalized wording in the man page which should help clarify the restrictions surrounding device removal.
The key restriction worth emphasizing for device removal, is that no top-level data device may be removed if there exists a top-level raidz vdev in the pool. Only mirror and non-redundant vdevs can be added to the pool if you intend to use device removal. If you have a specific scenario in mind, I'd encourage you to create a pool using sparse file vdevs and use them to run through the process. |
@behlendorf Super! Thanks! |
Solaris ZFS got the ability to remove top-level vdevs in 2018: I was recently bitten by ZOL's limitation not to be able to do this. Someone else was trying to add a hot spare to a zpool, but they wrongly added it as a top-level device (using |
Description
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing
operations on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use it
are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be
"remapped" to their new (concrete) locations if possible. This process
can be accelerated by using the "zfs remap" command to proactively
rewrite all indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of the
data that is copied. This makes the process much faster, but if it were
used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror. Therefore, mirror and raidz devices can
not be removed.
Motivation and Context
See above.
How Has This Been Tested?
Additions to the test suite in functional/removal.
Types of changes
Checklist:
Signed-off-by
.