Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sequential (two-phase) resilvering #3625

Closed
Deewiant opened this issue Jul 24, 2015 · 23 comments
Closed

Implement sequential (two-phase) resilvering #3625

Deewiant opened this issue Jul 24, 2015 · 23 comments
Labels
Type: Feature Feature request or new feature
Milestone

Comments

@Deewiant
Copy link

https://blogs.oracle.com/roch/entry/sequential_resilvering describes a two-phase resilvering process which avoids random I/O, potentially dramatically speeding up resilvering especially on HDDs.

As far as I know ZoL doesn't do anything like this, so I created this issue to keep track of the situation.

@behlendorf behlendorf added the Type: Feature Feature request or new feature label Jul 24, 2015
@behlendorf
Copy link
Contributor

@Deewiant thanks for filing this. Yes, this is something we've talked about implementing for some time. I think it would be great to implement when someone has the time.

@adilger
Copy link
Contributor

adilger commented Dec 3, 2015

Resilvering will also benefit greatly from the Metadata allocation class of issue #3779 to separate the metadata onto a separate SSD device.

Another option that was discussed in the past for mirror devices (not sure if this was ever implemented) is to do a full linear "dd" style copy of the working device to the replacement, and then fall back to a scrub to verify the data was written correctly. That gets the data redundancy back quickly, using nice large streaming IO requests to the disks, and then the scrub can be done at a lower priority. There still exists the possibility that the source device has latent sector errors, so the failing drive shouldn't be taken offline until after the scrub, so that it could be used to read any blocks with bad checksums, if possible.

@jumbi77
Copy link
Contributor

jumbi77 commented Feb 26, 2016

I guess another approach to speed up resilvering is the parity declustered RAIDz/mirror #3497 ?!

Another idea which was not mentioned yet is the "RAID-Z/mirror hybrid allocator" from Oracle. Not sure if this accelerate also resilvering, but it is a nice performance boost in generell i guess. As far as i understand, the metadata are then mirrored within the raidz. Is it planned to implement this or is that obsolete because of #3779 ?

@thegreatgazoo
Copy link

@jumbi77 Parity declustered RAID is a new type of VDEV, called dRAID, which will offer scalable rebuild performance. It will not affect how RAIDz resilver works.

@jumbi77
Copy link
Contributor

jumbi77 commented May 15, 2016

Is there anybody working on that feature or have plans to implement it in the future? Just curious.

@nwf
Copy link
Contributor

nwf commented Jul 6, 2016

I'd like to suggest that there be a RAM-only queue mode for sequential resilvering, along the lines of rsync's asynchronous recursor. Rather than traverse all the metadata blocks at once and sort all the data blocks, which is likely to be an enormous collection which must itself be serialized to disk, it might be nice for the system to use a standard in-RAM producer/consumer queue with the producer (metadata recursor) stalling if the queue fills. The queue would of course be sorted by address on device (with multiple VDEVs intermixed) so that it acted as an enormous elevator queue. While no longer strictly sequential -- the recursor would find blocks out of order and have a limited ability to sort while blocks were in queue -- the collection of data block pointers no longer needs to be persisted to disk and there should be plenty of opportunities for streaming reads.

I suppose the other downside to such a thing is that it seems difficult to persist enough data to allow scrubs to pick up where they left off across exports or reboots, but I am not convinced that that is all that useful?

@ironMann
Copy link
Contributor

ironMann commented Jul 6, 2016

I've just started looking into this. Initially I had the same idea with RAM only solution (and it might be my first prototype), but I don't think it will be enough for larger pools. As design document in #1277 suggests there are few benefits with persisting the resilver log. I'm thinking about a solution in line with async_destroy, but I still have a lot to learn about zfs internals.

If somebody would like to provide mentorship for this project, feel free to contact me.

@nwf
Copy link
Contributor

nwf commented Jul 6, 2016

I think a strictly-read-only scrub might be worthwhile, too. Maybe make persistence optional (treat it as a queue without bound so that it never blocks)?

Alternatively, doing multiple metadata scans and selecting the next consecutive chunk of DVAs, again without persistence, might be simpler. In this design, one would walk the metadata in full and collect the lowest e.g. 16M data DVAs (in sorted order, so that it can be traversed with big streaming reads). By remembering what the 16Mth DVA was, the next walk of the metadata could collect the next 16M DVAs. This is an easily resumable (just remember which bin of DVAs was being scrubbed) and bounded-memory algorithm that should be easy to implement. (Credit, I think, is due to HAMMER2.)

@thewacokid
Copy link

thewacokid commented Jul 18, 2016

Perhaps this is not the correct way to push this, but SMR drives would absolutely love even a slightly sequential workload for resilvers. Current code degrades to 1-5 IOPs with SMR drives over time, which makes rebuilds take an eternity, especially with the queue depth stuck at 1 on the drive being replaced (is that a bug, or expected? I haven't had time to dig into the source).

Just to clarify, this is the SMR drive being resilvered to after a few hours (filtering out idle drives):
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 1.00 0.00 0.00 99.00

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sdgt 0.00 0.00 0.00 6.00 0.00 568.00 189.33 1.00 165.67 166.67 100.00

@ironMann
Copy link
Contributor

@thewacokid try boosting scrub io with parameters as suggested in #4825 (comment)
I've started a discussion on openzfs-developer mailing list about this feature, and it seems there's already work started on this feature.

@thewacokid
Copy link

thewacokid commented Jul 19, 2016

@ironMann Those parameters help massively with normal drives, however, SMR drives eventually (within an hour or so) degrade to a handful of IOPs as they shuffle data out to the shingles. Perhaps higher queue depths would help, or async rebuild IO, or something easier than a full sequential resilver patch? I'm unsure why there's only ever one pending IO to the target drive.

@scineram
Copy link

@thewacokid There will be a talk next month to watch out for on this issue. Probably the work @ironMann mentioned.
http://open-zfs.org/wiki/Scrub/Resilver_Performance

@mailinglists35
Copy link

mailinglists35 commented Sep 27, 2016

from the openzfs conference recording I understand Nexenta may be able to do it. can't wait to see this in ZoL!
http://livestream.com/accounts/15501788/events/6340478/videos/137014181
scroll to minute 15

@mailinglists35
Copy link

what is the status of this? the link I've pasted above is no longer working.

@mailinglists35
Copy link

hm, PR #5153 mentions a new PR, #5841 that intends to solve #3497 which appears to do a faster resilvering.

@nwf
Copy link
Contributor

nwf commented Mar 9, 2017

@mailinglists35: the dRAID stuff is different, though it happens to have similar effects.

@skiselkov has written all the code to do this; it's in review at skiselkov/illumos-gate#2 and https://github.com/skiselkov/illumos-gate/commits/better_resilver_illumos. I have ported the code over to ZoL and been testing it with delightful success (it was very straightforward, doubtless in part because ZoL strives to minimize divergence against upstream). I should assume a pull request to ZoL is forthcoming once the review is done and code gets put back to Illumos.

ETA: @skiselkov's implementation is purely in-RAM and achieves persistence by periodically draining the reorder buffer, thereby bringing the metadata recursor's state and the set of blocks actually scrubbed into sync. This is a really neat design and keeps the on-disk persistence structure fully backwards compatible with the existing records. He deserves immense praise for the work. :)

@skiselkov
Copy link
Contributor

@nwf Just an FYI, the resilver work isn't quite complete yet. I have a number of changes queued up that build in some more suggestions from Matt Ahrens from the design/early review phase. Notably a lot of the range_tree code is gonna change, as well as some of the vdev queue taskq handling. Nothing too dramatic, I just don't want you to put in a lot of work on porting and to have it then blown out by changing it a lot.

@nwf
Copy link
Contributor

nwf commented Mar 9, 2017

@skiselkov: No worries! I'm happy to follow along and start over if needed. :)

@thegreatgazoo
Copy link

Just to clarify:

  • Resilver, and any optimization of it, works with any type of vdev, including the new dRAID vdev
  • Rebuild, a completely new mechanism added by dRAID, works only with dRAID and mirror.

@mailinglists35
Copy link

thank you all!
@behlendorf could you add a milestone for this?

@behlendorf behlendorf added this to the 0.8.0 milestone Mar 9, 2017
@jumbi77
Copy link
Contributor

jumbi77 commented Jun 26, 2017

Referencing to #6256

@interduo
Copy link

interduo commented Nov 17, 2017

Thanks for this. This was a big problem for me in one location.

piwo

Will this come to 0.7.4 release ?

@behlendorf
Copy link
Contributor

Your welcome, this feature will be part of 0.8.

Nasf-Fan pushed a commit to Nasf-Fan/zfs that referenced this issue Jan 29, 2018
Currently, scrubs and resilvers can take an extremely
long time to complete. This is largely due to the fact
that zfs scans process pools in logical order, as
determined by each block's bookmark. This makes sense
from a simplicity perspective, but blocks in zfs are
often scattered randomly across disks, particularly
due to zfs's copy-on-write mechanisms.

This patch improves performance by splitting scrubs
and resilvers into a metadata scanning phase and an IO
issuing phase. The metadata scan reads through the
structure of the pool and gathers an in-memory queue
of I/Os, sorted by size and offset on disk. The issuing
phase will then issue the scrub I/Os as sequentially as
possible, greatly improving performance.

This patch also updates and cleans up some of the scan
code which has not been updated in several years.

Reviewed-by: Brian Behlendorf <[email protected]>
Authored-by: Saso Kiselkov <[email protected]>
Authored-by: Alek Pinchuk <[email protected]>
Authored-by: Tom Caputi <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#3625
Closes openzfs#6256
Nasf-Fan pushed a commit to Nasf-Fan/zfs that referenced this issue Feb 13, 2018
Currently, scrubs and resilvers can take an extremely
long time to complete. This is largely due to the fact
that zfs scans process pools in logical order, as
determined by each block's bookmark. This makes sense
from a simplicity perspective, but blocks in zfs are
often scattered randomly across disks, particularly
due to zfs's copy-on-write mechanisms.

This patch improves performance by splitting scrubs
and resilvers into a metadata scanning phase and an IO
issuing phase. The metadata scan reads through the
structure of the pool and gathers an in-memory queue
of I/Os, sorted by size and offset on disk. The issuing
phase will then issue the scrub I/Os as sequentially as
possible, greatly improving performance.

This patch also updates and cleans up some of the scan
code which has not been updated in several years.

Reviewed-by: Brian Behlendorf <[email protected]>
Authored-by: Saso Kiselkov <[email protected]>
Authored-by: Alek Pinchuk <[email protected]>
Authored-by: Tom Caputi <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#3625
Closes openzfs#6256
FransUrbo pushed a commit to FransUrbo/zfs that referenced this issue Apr 28, 2019
Currently, scrubs and resilvers can take an extremely
long time to complete. This is largely due to the fact
that zfs scans process pools in logical order, as
determined by each block's bookmark. This makes sense
from a simplicity perspective, but blocks in zfs are
often scattered randomly across disks, particularly
due to zfs's copy-on-write mechanisms.

This patch improves performance by splitting scrubs
and resilvers into a metadata scanning phase and an IO
issuing phase. The metadata scan reads through the
structure of the pool and gathers an in-memory queue
of I/Os, sorted by size and offset on disk. The issuing
phase will then issue the scrub I/Os as sequentially as
possible, greatly improving performance.

This patch also updates and cleans up some of the scan
code which has not been updated in several years.

Reviewed-by: Brian Behlendorf <[email protected]>
Authored-by: Saso Kiselkov <[email protected]>
Authored-by: Alek Pinchuk <[email protected]>
Authored-by: Tom Caputi <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#3625 
Closes openzfs#6256
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests