Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent unnecessary resilver restarts #9588

Merged
merged 1 commit into from
Nov 27, 2019
Merged

Conversation

jwpoduska
Copy link
Contributor

@jwpoduska jwpoduska commented Nov 14, 2019

Motivation and Context

ZFS is a little trigger happy about restarting resilvers. There are a number of issues around this, including:

9155, 9378 & 9551

Possibly even: 840

Description

If a device is participating in an active resilver, then it will have a non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the resilver to be restarted (or deferred to be restarted later), which is unnecessary if the DTL is still covered by the current scan range. This is similar to the logic in vdev_dtl_should_excise() where the DTL can only be excised if it's max txg is in the resilvered range.

The fix itself is really just the 'if' statement in dsl_scan_assess_vdev() to prevent requesting a resilver or deferred resilver when the max txg in the vdev's DTL is already participating in the current resilver.

This change includes some cleanup as well.

How Has This Been Tested?

Tested that incorrect re-resilvering was prevented by kicking off a resilver, and doing 'zpool reopen', 'zpool clean', export/import, zinject io errors, etc, and verifying that the resilver didn't restart.

Also, tested that resilver correctly happens by kicking off a resilver, offlining and onlining a vdev to introduce a DTL with a max that isn't covered by the current scan range, and verifying that the deferred resilver occurs after the current one finishes. Or, that a restart of resilver happens if the resilver_defer feature is not enabled.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@codecov
Copy link

codecov bot commented Nov 15, 2019

Codecov Report

Merging #9588 into master will decrease coverage by 0.07%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9588      +/-   ##
==========================================
- Coverage   79.19%   79.12%   -0.08%     
==========================================
  Files         418      418              
  Lines      123531   123535       +4     
==========================================
- Hits        97828    97743      -85     
- Misses      25703    25792      +89
Flag Coverage Δ
#kernel 79.92% <98.07%> (+0.07%) ⬆️
#user 66.5% <50.94%> (-0.41%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c46813...83b5d97. Read the comment docs.

@PrivatePuffin
Copy link
Contributor

@jwpoduska
this test failure might actually be related: FAIL fault/auto_spare_002_pos (expected PASS)
It seems from your "How Has This Been Tested?" you haven't tested it with hotspares, am I right?

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 15, 2019
@jwpoduska
Copy link
Contributor Author

@Ornias1993 right you are, thanks. vdev_clear() still needs to remove spares, even if the resilver has finished, so I added the poorly named async task SPA_ASYNC_RESILVER_DONE to do it, if a resilver isn't in progress or scheduled. Previously, it would always kick off a new resilver or even a deferred resilver after the current one, which would do a resilver scan where nothing got updated.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This looks great, and works as intended in my local testing. My only suggestion would be to add a test which specifically tests the scenarios you manually tested. This particular issue has caused a fair bit of trouble and we want to make sure it stays fixed!

Copy link
Contributor

@PrivatePuffin PrivatePuffin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the earlier mentioned tests, it looks good to me!
Great job on getting those fixes for hotspares done so quick 👍

Copy link
Contributor

@jgallag88 jgallag88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that this fixes the issue I saw in #9155. The code looks good to me too.

* If a leaf vdev has a DTL, and seems healthy, then kick off a
* resilver. But don't do this if we are doing a reopen for a scrub,
* since this would just restart the scrub we are already doing.
* If this is a leaf vdev, assesss whether a resilver is needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] One too many 's's in assesss

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed

If a device is participating in an active resilver, then it will have a
non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the
resilver to be restarted (or deferred to be restarted later), which is
unnecessary if the DTL is still covered by the current scan range. This
is similar to the logic in vdev_dtl_should_excise() where the DTL can
only be excised if it's max txg is in the resilvered range.

Signed-off-by: John Poduska <[email protected]>
@jwpoduska
Copy link
Contributor Author

I added a test to check both with and without deferred resilvers. Both fail without this fix and succeed with it.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 27, 2019
@behlendorf behlendorf merged commit 3c819a2 into openzfs:master Nov 27, 2019
@behlendorf
Copy link
Contributor

@jwpoduska thanks for adding the test cases and getting to the bottom of this issue. Merged!

behlendorf pushed a commit that referenced this pull request Dec 10, 2019
The resilver restart test was reported as failing about 2% of the
time. Two issues were found:

- The event log wasn't large enough, so resilver events were missing
- One 'zpool sync' wasn't enough for resilver to start after zinject

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: John Kennedy <[email protected]>
Reviewed-by: Kjeld Schouten <[email protected]>
Signed-off-by: John Poduska <[email protected]>
Issue #9588 
Closes #9677 
Closes #9703
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Jan 22, 2020
If a device is participating in an active resilver, then it will have a
non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the
resilver to be restarted (or deferred to be restarted later), which is
unnecessary if the DTL is still covered by the current scan range. This
is similar to the logic in vdev_dtl_should_excise() where the DTL can
only be excised if it's max txg is in the resilvered range.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: John Gallagher <[email protected]>
Reviewed-by: Kjeld Schouten <[email protected]>
Signed-off-by: John Poduska <[email protected]>
Issue openzfs#840
Closes openzfs#9155
Closes openzfs#9378
Closes openzfs#9551
Closes openzfs#9588
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Jan 22, 2020
The resilver restart test was reported as failing about 2% of the
time. Two issues were found:

- The event log wasn't large enough, so resilver events were missing
- One 'zpool sync' wasn't enough for resilver to start after zinject

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: John Kennedy <[email protected]>
Reviewed-by: Kjeld Schouten <[email protected]>
Signed-off-by: John Poduska <[email protected]>
Issue openzfs#9588
Closes openzfs#9677
Closes openzfs#9703
tonyhutter pushed a commit that referenced this pull request Jan 23, 2020
If a device is participating in an active resilver, then it will have a
non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the
resilver to be restarted (or deferred to be restarted later), which is
unnecessary if the DTL is still covered by the current scan range. This
is similar to the logic in vdev_dtl_should_excise() where the DTL can
only be excised if it's max txg is in the resilvered range.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: John Gallagher <[email protected]>
Reviewed-by: Kjeld Schouten <[email protected]>
Signed-off-by: John Poduska <[email protected]>
Issue #840
Closes #9155
Closes #9378
Closes #9551
Closes #9588
tonyhutter pushed a commit that referenced this pull request Jan 23, 2020
The resilver restart test was reported as failing about 2% of the
time. Two issues were found:

- The event log wasn't large enough, so resilver events were missing
- One 'zpool sync' wasn't enough for resilver to start after zinject

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: John Kennedy <[email protected]>
Reviewed-by: Kjeld Schouten <[email protected]>
Signed-off-by: John Poduska <[email protected]>
Issue #9588
Closes #9677
Closes #9703
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants