fix prefetching of indirect blocks while destroying #14603

ahrens · 2023-03-10T04:12:52Z

Motivation and Context

When traversing a tree of block pointers (e.g. for zfs destroy <fs> or zfs send), we prefetch the indirect blocks that will be needed, in traverse_prefetch_metadata(). In the case of zfs destroy <fs>, we do a little traversing each txg, and resume the traversal the next txg. So the indirect blocks that will be needed, and thus are candidates for prefetching, does not include blocks that are before the resume point.

The problem is that the logic for determining if the indirect blocks are before the resume point is incorrect, causing the (up to 1024) L1 indirect blocks that are inside the first L2 to not be prefetched. In practice, if we are able to read many more than 1024 blocks per txg, then this will be inconsequential. But if i/o latency is more than a few milliseconds, almost no L1's will be prefetched, so they will be read serially, and thus the destroying will be very slow. This can be observed as zpool get freeing decreasing very slowly.

Specifically: When we first examine the L2 that contains the block we'll be resuming from, we have not yet resumed, so td_resume is nonzero. At this point, all calls to traverse_prefetch_metadata() will fail, even if the L1 in question is after the resume point. It isn't until the callback is issued for the resume point that we zero out td_resume, but by this point we've already attempted and failed to prefetch everything under this L2 indirect block.

Description

This commit addresses the issue by reusing the existing resume_skip_check() to determine if the L1's bookmark is before or after the resume point. To do so, this function is made non-mutating (the caller now zeros td_resume).

Note, this bug likely predates (was not introduced by) #11803.

How Has This Been Tested?

With high-latency storage, I saw a >10x improvement in space reclamation performance following a zfs destroy <fs>.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

module/zfs/dmu_traverse.c

ryao · 2023-03-10T21:57:46Z

The zloop test failed because this PR's branch was not rebased on the most recent master, so it is missing the recently merged fix from #14583. The FreeBSD test failure is a pre-existing issue.

When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or `zfs send`), we prefetch the indirect blocks that will be needed, in `traverse_prefetch_metadata()`. In the case of `zfs destroy <fs>`, we do a little traversing each txg, and resume the traversal the next txg. So the indirect blocks that will be needed, and thus are candidates for prefetching, does not include blocks that are before the resume point. The problem is that the logic for determining if the indirect blocks are before the resume point is incorrect, causing the (up to 1024) L1 indirect blocks that are inside the first L2 to not be prefetched. In practice, if we are able to read many more than 1024 blocks per txg, then this will be inconsequential. But if i/o latency is more than a few milliseconds, almost no L1's will be prefetched, so they will be read serially, and thus the destroying will be very slow. This can be observed as `zpool get freeing` decreasing very slowly. Specifically: When we first examine the L2 that contains the block we'll be resuming from, we have not yet resumed, so `td_resume` is nonzero. At this point, all calls to `traverse_prefetch_metadata()` will fail, even if the L1 in question is after the resume point. It isn't until the callback is issued for the resume point that we zero out `td_resume`, but by this point we've already attempted and failed to prefetch everything under this L2 indirect block. This commit addresses the issue by reusing the existing `resume_skip_check()` to determine if the L1's bookmark is before or after the resume point. To do so, this function is made non-mutating (the caller now zeros `td_resume`). Note, this bug likely predates (was not introduced by) openzfs#11803. Signed-off-by: Matthew Ahrens <[email protected]>

amotin

I suppose the new code will prefetch again the indirect blocks that were prefetched beyond the resume point on previous iteration. Though that is probably not as bad as the opposite.

When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or `zfs send`), we prefetch the indirect blocks that will be needed, in `traverse_prefetch_metadata()`. In the case of `zfs destroy <fs>`, we do a little traversing each txg, and resume the traversal the next txg. So the indirect blocks that will be needed, and thus are candidates for prefetching, does not include blocks that are before the resume point. The problem is that the logic for determining if the indirect blocks are before the resume point is incorrect, causing the (up to 1024) L1 indirect blocks that are inside the first L2 to not be prefetched. In practice, if we are able to read many more than 1024 blocks per txg, then this will be inconsequential. But if i/o latency is more than a few milliseconds, almost no L1's will be prefetched, so they will be read serially, and thus the destroying will be very slow. This can be observed as `zpool get freeing` decreasing very slowly. Specifically: When we first examine the L2 that contains the block we'll be resuming from, we have not yet resumed, so `td_resume` is nonzero. At this point, all calls to `traverse_prefetch_metadata()` will fail, even if the L1 in question is after the resume point. It isn't until the callback is issued for the resume point that we zero out `td_resume`, but by this point we've already attempted and failed to prefetch everything under this L2 indirect block. This commit addresses the issue by reusing the existing `resume_skip_check()` to determine if the L1's bookmark is before or after the resume point. To do so, this function is made non-mutating (the caller now zeros `td_resume`). Note, this bug likely predates (was not introduced by) openzfs#11803. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#14603

New Features - Block cloning (#13392) - Linux container support (#14070, #14097, #12263) - Scrub error log (#12812, #12355) - BLAKE3 checksums (#12918) - Corrective "zfs receive" - Vdev and zpool user properties Performance - Fully adaptive ARC (#14359) - SHA2 checksums (#13741) - Edon-R checksums (#13618) - Zstd early abort (#13244) - Prefetch improvements (#14603, #14516, #14402, #14243, #13452) - General optimization (#14121, #14123, #14039, #13680, #13613, #13606, #13576, #13553, #12789, #14925, #14948) Signed-off-by: Brian Behlendorf <[email protected]>

ahrens requested review from grwilson and mmaybee March 10, 2023 04:12

amotin reviewed Mar 10, 2023

View reviewed changes

module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved

behlendorf added the Status: Code Review Needed Ready for review and testing label Mar 10, 2023

ahrens added 2 commits March 10, 2023 21:46

fix prefetching - no zero

8e2b28c

ahrens force-pushed the prefetch_indirect branch from e3cfa1f to 8e2b28c Compare March 11, 2023 05:47

This was referenced Mar 13, 2023

Very slow zfs destroy #11933

Open

Destroying dataset with many snapshots takes long and zpool io stalls #14276

Open

amotin approved these changes Mar 14, 2023

View reviewed changes

behlendorf approved these changes Mar 16, 2023

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 16, 2023

behlendorf merged commit d2d4f85 into openzfs:master Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix prefetching of indirect blocks while destroying #14603

fix prefetching of indirect blocks while destroying #14603

ahrens commented Mar 10, 2023

ryao commented Mar 10, 2023 •

edited

Loading

amotin left a comment •

edited

Loading

fix prefetching of indirect blocks while destroying #14603

fix prefetching of indirect blocks while destroying #14603

Conversation

ahrens commented Mar 10, 2023

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

ryao commented Mar 10, 2023 • edited Loading

amotin left a comment • edited Loading

Choose a reason for hiding this comment

ryao commented Mar 10, 2023 •

edited

Loading

amotin left a comment •

edited

Loading