-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement sequential (two-phase) resilvering #3625
Comments
@Deewiant thanks for filing this. Yes, this is something we've talked about implementing for some time. I think it would be great to implement when someone has the time. |
Resilvering will also benefit greatly from the Metadata allocation class of issue #3779 to separate the metadata onto a separate SSD device. Another option that was discussed in the past for mirror devices (not sure if this was ever implemented) is to do a full linear "dd" style copy of the working device to the replacement, and then fall back to a scrub to verify the data was written correctly. That gets the data redundancy back quickly, using nice large streaming IO requests to the disks, and then the scrub can be done at a lower priority. There still exists the possibility that the source device has latent sector errors, so the failing drive shouldn't be taken offline until after the scrub, so that it could be used to read any blocks with bad checksums, if possible. |
I guess another approach to speed up resilvering is the parity declustered RAIDz/mirror #3497 ?! Another idea which was not mentioned yet is the "RAID-Z/mirror hybrid allocator" from Oracle. Not sure if this accelerate also resilvering, but it is a nice performance boost in generell i guess. As far as i understand, the metadata are then mirrored within the raidz. Is it planned to implement this or is that obsolete because of #3779 ? |
@jumbi77 Parity declustered RAID is a new type of VDEV, called dRAID, which will offer scalable rebuild performance. It will not affect how RAIDz resilver works. |
Is there anybody working on that feature or have plans to implement it in the future? Just curious. |
I'd like to suggest that there be a RAM-only queue mode for sequential resilvering, along the lines of rsync's asynchronous recursor. Rather than traverse all the metadata blocks at once and sort all the data blocks, which is likely to be an enormous collection which must itself be serialized to disk, it might be nice for the system to use a standard in-RAM producer/consumer queue with the producer (metadata recursor) stalling if the queue fills. The queue would of course be sorted by address on device (with multiple VDEVs intermixed) so that it acted as an enormous elevator queue. While no longer strictly sequential -- the recursor would find blocks out of order and have a limited ability to sort while blocks were in queue -- the collection of data block pointers no longer needs to be persisted to disk and there should be plenty of opportunities for streaming reads. I suppose the other downside to such a thing is that it seems difficult to persist enough data to allow scrubs to pick up where they left off across exports or reboots, but I am not convinced that that is all that useful? |
I've just started looking into this. Initially I had the same idea with RAM only solution (and it might be my first prototype), but I don't think it will be enough for larger pools. As design document in #1277 suggests there are few benefits with persisting the resilver log. I'm thinking about a solution in line with If somebody would like to provide mentorship for this project, feel free to contact me. |
I think a strictly-read-only scrub might be worthwhile, too. Maybe make persistence optional (treat it as a queue without bound so that it never blocks)? Alternatively, doing multiple metadata scans and selecting the next consecutive chunk of DVAs, again without persistence, might be simpler. In this design, one would walk the metadata in full and collect the lowest e.g. 16M data DVAs (in sorted order, so that it can be traversed with big streaming reads). By remembering what the 16Mth DVA was, the next walk of the metadata could collect the next 16M DVAs. This is an easily resumable (just remember which bin of DVAs was being scrubbed) and bounded-memory algorithm that should be easy to implement. (Credit, I think, is due to HAMMER2.) |
Perhaps this is not the correct way to push this, but SMR drives would absolutely love even a slightly sequential workload for resilvers. Current code degrades to 1-5 IOPs with SMR drives over time, which makes rebuilds take an eternity, especially with the queue depth stuck at 1 on the drive being replaced (is that a bug, or expected? I haven't had time to dig into the source). Just to clarify, this is the SMR drive being resilvered to after a few hours (filtering out idle drives): Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util |
@thewacokid try boosting scrub io with parameters as suggested in #4825 (comment) |
@ironMann Those parameters help massively with normal drives, however, SMR drives eventually (within an hour or so) degrade to a handful of IOPs as they shuffle data out to the shingles. Perhaps higher queue depths would help, or async rebuild IO, or something easier than a full sequential resilver patch? I'm unsure why there's only ever one pending IO to the target drive. |
@thewacokid There will be a talk next month to watch out for on this issue. Probably the work @ironMann mentioned. |
from the openzfs conference recording I understand Nexenta may be able to do it. can't wait to see this in ZoL! |
what is the status of this? the link I've pasted above is no longer working. |
@mailinglists35: the dRAID stuff is different, though it happens to have similar effects. @skiselkov has written all the code to do this; it's in review at skiselkov/illumos-gate#2 and https://github.com/skiselkov/illumos-gate/commits/better_resilver_illumos. I have ported the code over to ZoL and been testing it with delightful success (it was very straightforward, doubtless in part because ZoL strives to minimize divergence against upstream). I should assume a pull request to ZoL is forthcoming once the review is done and code gets put back to Illumos. ETA: @skiselkov's implementation is purely in-RAM and achieves persistence by periodically draining the reorder buffer, thereby bringing the metadata recursor's state and the set of blocks actually scrubbed into sync. This is a really neat design and keeps the on-disk persistence structure fully backwards compatible with the existing records. He deserves immense praise for the work. :) |
@nwf Just an FYI, the resilver work isn't quite complete yet. I have a number of changes queued up that build in some more suggestions from Matt Ahrens from the design/early review phase. Notably a lot of the range_tree code is gonna change, as well as some of the vdev queue taskq handling. Nothing too dramatic, I just don't want you to put in a lot of work on porting and to have it then blown out by changing it a lot. |
@skiselkov: No worries! I'm happy to follow along and start over if needed. :) |
Just to clarify:
|
thank you all! |
Referencing to #6256 |
Your welcome, this feature will be part of 0.8. |
Currently, scrubs and resilvers can take an extremely long time to complete. This is largely due to the fact that zfs scans process pools in logical order, as determined by each block's bookmark. This makes sense from a simplicity perspective, but blocks in zfs are often scattered randomly across disks, particularly due to zfs's copy-on-write mechanisms. This patch improves performance by splitting scrubs and resilvers into a metadata scanning phase and an IO issuing phase. The metadata scan reads through the structure of the pool and gathers an in-memory queue of I/Os, sorted by size and offset on disk. The issuing phase will then issue the scrub I/Os as sequentially as possible, greatly improving performance. This patch also updates and cleans up some of the scan code which has not been updated in several years. Reviewed-by: Brian Behlendorf <[email protected]> Authored-by: Saso Kiselkov <[email protected]> Authored-by: Alek Pinchuk <[email protected]> Authored-by: Tom Caputi <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#3625 Closes openzfs#6256
Currently, scrubs and resilvers can take an extremely long time to complete. This is largely due to the fact that zfs scans process pools in logical order, as determined by each block's bookmark. This makes sense from a simplicity perspective, but blocks in zfs are often scattered randomly across disks, particularly due to zfs's copy-on-write mechanisms. This patch improves performance by splitting scrubs and resilvers into a metadata scanning phase and an IO issuing phase. The metadata scan reads through the structure of the pool and gathers an in-memory queue of I/Os, sorted by size and offset on disk. The issuing phase will then issue the scrub I/Os as sequentially as possible, greatly improving performance. This patch also updates and cleans up some of the scan code which has not been updated in several years. Reviewed-by: Brian Behlendorf <[email protected]> Authored-by: Saso Kiselkov <[email protected]> Authored-by: Alek Pinchuk <[email protected]> Authored-by: Tom Caputi <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#3625 Closes openzfs#6256
Currently, scrubs and resilvers can take an extremely long time to complete. This is largely due to the fact that zfs scans process pools in logical order, as determined by each block's bookmark. This makes sense from a simplicity perspective, but blocks in zfs are often scattered randomly across disks, particularly due to zfs's copy-on-write mechanisms. This patch improves performance by splitting scrubs and resilvers into a metadata scanning phase and an IO issuing phase. The metadata scan reads through the structure of the pool and gathers an in-memory queue of I/Os, sorted by size and offset on disk. The issuing phase will then issue the scrub I/Os as sequentially as possible, greatly improving performance. This patch also updates and cleans up some of the scan code which has not been updated in several years. Reviewed-by: Brian Behlendorf <[email protected]> Authored-by: Saso Kiselkov <[email protected]> Authored-by: Alek Pinchuk <[email protected]> Authored-by: Tom Caputi <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#3625 Closes openzfs#6256
https://blogs.oracle.com/roch/entry/sequential_resilvering describes a two-phase resilvering process which avoids random I/O, potentially dramatically speeding up resilvering especially on HDDs.
As far as I know ZoL doesn't do anything like this, so I created this issue to keep track of the situation.
The text was updated successfully, but these errors were encountered: