-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent unnecessary resilver restarts #9588
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9588 +/- ##
==========================================
- Coverage 79.19% 79.12% -0.08%
==========================================
Files 418 418
Lines 123531 123535 +4
==========================================
- Hits 97828 97743 -85
- Misses 25703 25792 +89
Continue to review full report at Codecov.
|
@jwpoduska |
8202186
to
53fff3f
Compare
@Ornias1993 right you are, thanks. vdev_clear() still needs to remove spares, even if the resilver has finished, so I added the poorly named async task SPA_ASYNC_RESILVER_DONE to do it, if a resilver isn't in progress or scheduled. Previously, it would always kick off a new resilver or even a deferred resilver after the current one, which would do a resilver scan where nothing got updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This looks great, and works as intended in my local testing. My only suggestion would be to add a test which specifically tests the scenarios you manually tested. This particular issue has caused a fair bit of trouble and we want to make sure it stays fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the earlier mentioned tests, it looks good to me!
Great job on getting those fixes for hotspares done so quick 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed that this fixes the issue I saw in #9155. The code looks good to me too.
module/zfs/vdev.c
Outdated
* If a leaf vdev has a DTL, and seems healthy, then kick off a | ||
* resilver. But don't do this if we are doing a reopen for a scrub, | ||
* since this would just restart the scrub we are already doing. | ||
* If this is a leaf vdev, assesss whether a resilver is needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] One too many 's's in assesss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed
If a device is participating in an active resilver, then it will have a non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the resilver to be restarted (or deferred to be restarted later), which is unnecessary if the DTL is still covered by the current scan range. This is similar to the logic in vdev_dtl_should_excise() where the DTL can only be excised if it's max txg is in the resilvered range. Signed-off-by: John Poduska <[email protected]>
53fff3f
to
83b5d97
Compare
I added a test to check both with and without deferred resilvers. Both fail without this fix and succeed with it. |
@jwpoduska thanks for adding the test cases and getting to the bottom of this issue. Merged! |
The resilver restart test was reported as failing about 2% of the time. Two issues were found: - The event log wasn't large enough, so resilver events were missing - One 'zpool sync' wasn't enough for resilver to start after zinject Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: John Poduska <[email protected]> Issue #9588 Closes #9677 Closes #9703
If a device is participating in an active resilver, then it will have a non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the resilver to be restarted (or deferred to be restarted later), which is unnecessary if the DTL is still covered by the current scan range. This is similar to the logic in vdev_dtl_should_excise() where the DTL can only be excised if it's max txg is in the resilvered range. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Gallagher <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: John Poduska <[email protected]> Issue openzfs#840 Closes openzfs#9155 Closes openzfs#9378 Closes openzfs#9551 Closes openzfs#9588
The resilver restart test was reported as failing about 2% of the time. Two issues were found: - The event log wasn't large enough, so resilver events were missing - One 'zpool sync' wasn't enough for resilver to start after zinject Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: John Poduska <[email protected]> Issue openzfs#9588 Closes openzfs#9677 Closes openzfs#9703
If a device is participating in an active resilver, then it will have a non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the resilver to be restarted (or deferred to be restarted later), which is unnecessary if the DTL is still covered by the current scan range. This is similar to the logic in vdev_dtl_should_excise() where the DTL can only be excised if it's max txg is in the resilvered range. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Gallagher <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: John Poduska <[email protected]> Issue #840 Closes #9155 Closes #9378 Closes #9551 Closes #9588
The resilver restart test was reported as failing about 2% of the time. Two issues were found: - The event log wasn't large enough, so resilver events were missing - One 'zpool sync' wasn't enough for resilver to start after zinject Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: John Poduska <[email protected]> Issue #9588 Closes #9677 Closes #9703
Motivation and Context
ZFS is a little trigger happy about restarting resilvers. There are a number of issues around this, including:
9155, 9378 & 9551
Possibly even: 840
Description
If a device is participating in an active resilver, then it will have a non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the resilver to be restarted (or deferred to be restarted later), which is unnecessary if the DTL is still covered by the current scan range. This is similar to the logic in vdev_dtl_should_excise() where the DTL can only be excised if it's max txg is in the resilvered range.
The fix itself is really just the 'if' statement in dsl_scan_assess_vdev() to prevent requesting a resilver or deferred resilver when the max txg in the vdev's DTL is already participating in the current resilver.
This change includes some cleanup as well.
How Has This Been Tested?
Tested that incorrect re-resilvering was prevented by kicking off a resilver, and doing 'zpool reopen', 'zpool clean', export/import, zinject io errors, etc, and verifying that the resilver didn't restart.
Also, tested that resilver correctly happens by kicking off a resilver, offlining and onlining a vdev to introduce a DTL with a max that isn't covered by the current scan range, and verifying that the deferred resilver occurs after the current one finishes. Or, that a restart of resilver happens if the resilver_defer feature is not enabled.
Types of changes
Checklist:
Signed-off-by
.