Prevent unnecessary resilver restarts #9588

jwpoduska · 2019-11-14T21:04:07Z

Motivation and Context

ZFS is a little trigger happy about restarting resilvers. There are a number of issues around this, including:

Possibly even: 840

Description

If a device is participating in an active resilver, then it will have a non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the resilver to be restarted (or deferred to be restarted later), which is unnecessary if the DTL is still covered by the current scan range. This is similar to the logic in vdev_dtl_should_excise() where the DTL can only be excised if it's max txg is in the resilvered range.

The fix itself is really just the 'if' statement in dsl_scan_assess_vdev() to prevent requesting a resilver or deferred resilver when the max txg in the vdev's DTL is already participating in the current resilver.

This change includes some cleanup as well.

How Has This Been Tested?

Tested that incorrect re-resilvering was prevented by kicking off a resilver, and doing 'zpool reopen', 'zpool clean', export/import, zinject io errors, etc, and verifying that the resilver didn't restart.

Also, tested that resilver correctly happens by kicking off a resilver, offlining and onlining a vdev to introduce a DTL with a max that isn't covered by the current scan range, and verifying that the deferred resilver occurs after the current one finishes. Or, that a restart of resilver happens if the resilver_defer feature is not enabled.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.

codecov · 2019-11-15T05:57:04Z

Codecov Report

Merging #9588 into master will decrease coverage by 0.07%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #9588      +/-   ##
==========================================
- Coverage   79.19%   79.12%   -0.08%     
==========================================
  Files         418      418              
  Lines      123531   123535       +4     
==========================================
- Hits        97828    97743      -85     
- Misses      25703    25792      +89

Flag	Coverage Δ
#kernel	`79.92% <98.07%> (+0.07%)`	⬆️
#user	`66.5% <50.94%> (-0.41%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c46813...83b5d97. Read the comment docs.

PrivatePuffin · 2019-11-15T10:36:27Z

@jwpoduska
this test failure might actually be related: FAIL fault/auto_spare_002_pos (expected PASS)
It seems from your "How Has This Been Tested?" you haven't tested it with hotspares, am I right?

jwpoduska · 2019-11-20T22:15:59Z

@Ornias1993 right you are, thanks. vdev_clear() still needs to remove spares, even if the resilver has finished, so I added the poorly named async task SPA_ASYNC_RESILVER_DONE to do it, if a resilver isn't in progress or scheduled. Previously, it would always kick off a new resilver or even a deferred resilver after the current one, which would do a resilver scan where nothing got updated.

behlendorf

Nice! This looks great, and works as intended in my local testing. My only suggestion would be to add a test which specifically tests the scenarios you manually tested. This particular issue has caused a fair bit of trouble and we want to make sure it stays fixed!

PrivatePuffin

Besides the earlier mentioned tests, it looks good to me!
Great job on getting those fixes for hotspares done so quick 👍

jgallag88

I confirmed that this fixes the issue I saw in #9155. The code looks good to me too.

jgallag88 · 2019-11-22T04:17:15Z

module/zfs/vdev.c

-	 * If a leaf vdev has a DTL, and seems healthy, then kick off a
-	 * resilver.  But don't do this if we are doing a reopen for a scrub,
-	 * since this would just restart the scrub we are already doing.
+	 * If this is a leaf vdev, assesss whether a resilver is needed.


[nit] One too many 's's in assesss

Thanks, fixed

If a device is participating in an active resilver, then it will have a non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the resilver to be restarted (or deferred to be restarted later), which is unnecessary if the DTL is still covered by the current scan range. This is similar to the logic in vdev_dtl_should_excise() where the DTL can only be excised if it's max txg is in the resilvered range. Signed-off-by: John Poduska <[email protected]>

jwpoduska · 2019-11-26T12:11:21Z

I added a test to check both with and without deferred resilvers. Both fail without this fix and succeed with it.

behlendorf · 2019-11-27T18:15:35Z

@jwpoduska thanks for adding the test cases and getting to the bottom of this issue. Merged!

The resilver restart test was reported as failing about 2% of the time. Two issues were found: - The event log wasn't large enough, so resilver events were missing - One 'zpool sync' wasn't enough for resilver to start after zinject Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: John Poduska <[email protected]> Issue #9588 Closes #9677 Closes #9703

If a device is participating in an active resilver, then it will have a non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the resilver to be restarted (or deferred to be restarted later), which is unnecessary if the DTL is still covered by the current scan range. This is similar to the logic in vdev_dtl_should_excise() where the DTL can only be excised if it's max txg is in the resilvered range. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Gallagher <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: John Poduska <[email protected]> Issue openzfs#840 Closes openzfs#9155 Closes openzfs#9378 Closes openzfs#9551 Closes openzfs#9588

The resilver restart test was reported as failing about 2% of the time. Two issues were found: - The event log wasn't large enough, so resilver events were missing - One 'zpool sync' wasn't enough for resilver to start after zinject Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: John Poduska <[email protected]> Issue openzfs#9588 Closes openzfs#9677 Closes openzfs#9703

If a device is participating in an active resilver, then it will have a non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the resilver to be restarted (or deferred to be restarted later), which is unnecessary if the DTL is still covered by the current scan range. This is similar to the logic in vdev_dtl_should_excise() where the DTL can only be excised if it's max txg is in the resilvered range. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Gallagher <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: John Poduska <[email protected]> Issue #840 Closes #9155 Closes #9378 Closes #9551 Closes #9588

The resilver restart test was reported as failing about 2% of the time. Two issues were found: - The event log wasn't large enough, so resilver events were missing - One 'zpool sync' wasn't enough for resilver to start after zinject Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: John Poduska <[email protected]> Issue #9588 Closes #9677 Closes #9703

behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 15, 2019

jwpoduska force-pushed the reresilver branch from 8202186 to 53fff3f Compare November 20, 2019 22:03

behlendorf requested review from behlendorf and jgallag88 November 20, 2019 23:19

behlendorf approved these changes Nov 21, 2019

View reviewed changes

PrivatePuffin approved these changes Nov 21, 2019

View reviewed changes

jgallag88 approved these changes Nov 22, 2019

View reviewed changes

jwpoduska force-pushed the reresilver branch from 53fff3f to 83b5d97 Compare November 26, 2019 11:55

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 27, 2019

behlendorf merged commit 3c819a2 into openzfs:master Nov 27, 2019

behlendorf mentioned this pull request Dec 4, 2019

resilver_restart_001 intermittently fails #9677

Closed

This was referenced Dec 9, 2019

Fixes for spurious failures of resilver_restart_001 test datto/zfs#1

Closed

Fixes for spurious failures of resilver_restart_001 test #9703

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent unnecessary resilver restarts #9588

Prevent unnecessary resilver restarts #9588

jwpoduska commented Nov 14, 2019 •

edited

Loading

codecov bot commented Nov 15, 2019 •

edited

Loading

PrivatePuffin commented Nov 15, 2019

jwpoduska commented Nov 20, 2019

behlendorf left a comment

PrivatePuffin left a comment

jgallag88 left a comment

jgallag88 Nov 22, 2019

jwpoduska Nov 26, 2019

jwpoduska commented Nov 26, 2019

behlendorf commented Nov 27, 2019

Prevent unnecessary resilver restarts #9588

Prevent unnecessary resilver restarts #9588

Conversation

jwpoduska commented Nov 14, 2019 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

codecov bot commented Nov 15, 2019 • edited Loading

Codecov Report

PrivatePuffin commented Nov 15, 2019

jwpoduska commented Nov 20, 2019

behlendorf left a comment

Choose a reason for hiding this comment

PrivatePuffin left a comment

Choose a reason for hiding this comment

jgallag88 left a comment

Choose a reason for hiding this comment

jgallag88 Nov 22, 2019

Choose a reason for hiding this comment

jwpoduska Nov 26, 2019

Choose a reason for hiding this comment

jwpoduska commented Nov 26, 2019

behlendorf commented Nov 27, 2019

jwpoduska commented Nov 14, 2019 •

edited

Loading

codecov bot commented Nov 15, 2019 •

edited

Loading