-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dRAID vdev driver #7078
dRAID vdev driver #7078
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7078 +/- ##
===========================================
- Coverage 77.08% 42.73% -34.35%
===========================================
Files 336 282 -54
Lines 107132 94446 -12686
===========================================
- Hits 82579 40360 -42219
- Misses 24553 54086 +29533
Continue to review full report at Codecov.
|
I have a question. I've been playing with draid in a VM. It seems like draid is going to like to have bigger vdevs in order to spread the spares across drives. I was wondering if there is anything in draid that will eliminate the reading speed being the slowest drive in the vdev. In my testing, and again it's in a VM with a 16 drive pool across 5 spinning rust disks, it seems like sequential reads are in fact limited to approximately the bandwidth of one drive. Thanks. |
@naclosagc Spare blocks are always evenly distributed regardless of the # of child drives in a draid vdev, e.g. in a small draid vdev like 11 drives (8D+2P with 1 distributed spare). As to read speed, a single ZFS block is stored on D+P drives which isn't necessarily equal to the total # of child drives, e.g. in 81-drive draid2 configured as 8D+2P with 1 spare a ZFS block is stored on 10 drives instead of 81 drives. So reading a single ZFS block is limited by the slowest of D+P drives, instead of total child drives. In a real-world workload when many blocks are being read at a time, the IOs should be spread on all drives evenly. In your configuration, the 16 drives in the VM were actually 5 physical drives, any sequential read over the 16-drive pool will more or less appear rather random on the 5 physical drives. I'm not surprised the throughput was low. |
I am about to integrate the draidcfg command into zpool create so that there'd be no separate step to create configuration file for dRAID. The following parameters are needed for dRAID vdev creation:
The n can be derived from the number of child drives given to zpool create, so no need to specify it explicitly. Currently it's required that: (n - s) % (p + d) == 0, but we have plan to remove this restriction. Then d will not be the same for all stripes. Therefore, instead of d, g (the # of stripes or raid groups) should be specified. In summary, the dRAID vdev specification should contain: p, g, and s. I'd suggest to use: draidp:g:s, e.g. zpool create draid2:4:2 sda sdb ...... (42 drives) to create a double parity dRAID vdev of 42 drives with 2 drives spare capacity and 4 stripes (in this case 8 data drives per stripe). Another change is the special character that prefixes dRAID spare vdev names. Currently it's $, e.g. $draid1-0-s0. Being a special character in shell, $ can make it tricky for shell scripts to handle. I'll start to write code once we agree on the naming formats. Please comment. |
@thegreatgazoo That proposal sounds good. Draid is great for the huge-number-of-disks use case that you've designed it for. I think it will also be useful for other configurations that are "degenerate" in some respect. For example, it could be used as a replacement for a single raidz group, with Groups=1 and Spares=0 or 1. The advantage over RAIDZ would be simpler data layout, (truly) sequential resilver, and better performance of small blocks via the mirrored region. Considering that these degenerate use cases may also be (somewhat) popular, what would you think about making some of the quantities optional: Did you have a different special character in mind, instead of |
I was also going to suggest using One question I have is whether it is more natural to specify "G=number of parity groups" or "D=number of data devices" for the pool? IMHO, it is more natural to specify the number of data, parity, and spare drives for RAID devices like "draidP:D:S" (e.g. draid2:8:1), as this is what everyone has been using for years, rather than "2 parity drives and 4 groups" as proposed here. One drawback is that this puts a (very small) burden on the admin to split the drives into a multiple of "D+P" units, but this is also partly true if one specifies The second question (somewhat independent of the above) is whether there should be an upper limit on the number of drives in a single RAID group if this is not specified (e.g. While it might be more difficult to document this, I think one of the main design goals of ZFS is ease of use (not ease of coding), and giving users (especially those that don't know the right answer) a reasonable default configuration is worth a bit more initial development effort. |
@thegreatgazoo your proposed interface makes good sense to me along with @ahrens's proposed tweaks. |
@adilger I agree that we want ease of use, and to make it hard to shoot yourself in the foot. However, I also don't want to artificially limit the use cases. When we implemented RAIDZ we definitely talked about how it would be nuts to have >10 disks in a RAIDZ group... but that is now commonplace even among folks who understand the performance tradeoffs. As a compromise, we could require I think that allowing different-width stripes is great because it makes RAIDZ "just work" in more scenarios, without having to do any exact math (is
Do you mean that the logical stripe would be 2 parity + 8 data disks (which I think is @thegreatgazoo 's notation for |
@adilger this was my initial inclination as well, likely since it's what I'm used too. What convinced me otherwise is @ahrens and @thegreatgazoo's insight that by specify G instead of D each group can have a different number of disks internally. This let's us always satisfy the,
My feeling on this is we shouldn't strictly impose an arbitrary limit here. Often there are good reasons to want to create a pool with an exotic configuration and that should be allowed. Instead, how about leveraging the existing |
I'm concerned that folks will become even more accustomed to always specifying |
I think it may be a good balance between "shooting yourself in the foot" and any exotic configurations.
👍
👍 |
@ahrens, my understanding is that the original proposal is I do see the benefit as described by @behlendorf above, that I'm definitely not a fan of overloading
That said, since this is irreversible once the VDEV is added to a pool, a My other concern relates to @ahrens's proposal in the absence of the Computing this should be relatively straight forward, something like the following to iterate over possible geometries, and try to find one that fits evenly into the given number of drives (with a reasonable number of hot spares, if unspecified), or minimizes the imbalance between RAID groups:
|
AIUI, making D not be a power of 2 can lead to space wastage, especially when physical block size > 512 or ashift > 9. For large D this wastage can be substantial. This will confuse folks, so at the minimum a warning should be emitted if D is not a power of 2. Similarly, for large D and ashift > 9, the default volblocksize=8k can be too small. For example, if D=8 and physical block size=4k (ashift=12), a likely common configuration, optimal volblocksize = 32k, not 8k. Using 8k volblocksize results in up to 75% space wastage. |
@thegreatgazoo does this PR include bug fixes from your internal dev repo (for ex. thegreatgazoo#7 ) ? |
@gmelikov We don't have internal repo any more - this is and will always be the latest code. The ticket you mentioned is not fixed in this PR yet, because it'd require work from the metadata allocation class. |
Thanks all the comments. Sorry late response - was on vacation last week. Looks like we agreed to use % for spare naming, but still can't all agree on G vs D or the default behavior when G/D and/or S is omitted. While I agree with @adilger people are more used to specify D, I think it can be a bit confusing when they end up with some groups with less than D data drives. @ahrens I'm not sure how useful it'd be to allow S=0 which limits rebuild to single replacement drive write throughput, given that raidz resilver is getting faster with the new scan-sort-fix implementation (and resilver has the additional advantages of checksum verification). I thought S when not specified should equal P, i.e. enough spare space to handle the max concurrent failures the parity level can handle. Next I'll change spare prefix to % and rebase the code to latest ZoL master. Meanwhile we can continue to discuss the interface details. |
My point is that most ZFS users are more familiar with specifying |
I don't think that's true if you use compression. In that case the size to allocate will be randomized, so no one minimum allocation size will be substantially better than another.
I agree, we should issue a warning when creating a volume with a volblocksize that results in substantial waste (or a filesystem with recordsize). This also applies to RAIDZ. We could also issue a warning even on mirror and plain disks if enabling compression and volblocksize (or recordsize) is equal to (or maybe not much larger than) 1<<ashift. In your example, if compression is on (which I imagine is the majority of the time), volblocksize=128K would be much better than 32K. |
We discussed offline a new way of specifying the draid geometry, which can specify either If We'd like to allow any strange configurations (e.g. super-wide), without needing to add new flags (e.g. |
Actually, I think the space wastage will be substantially the same regardless of if D is a power of 2, for (almost[*]) all reasonable use cases. If you're storing already-compressed data, then you wouldn't use ZFS compression. But your data is accessed sequentially, so you should use a large block size. With recordsize=1M, sector size=4K (ashift=12), and 9 data disks per group, the maximum waste is 3%. I think we should consider changing the default recordsize to 1MB for DRAID. If your data is not already compressed, then you should use ZFS compression. In this case the size to allocate will be randomized, so no one minimum allocation size will be substantially better than another. If you are using record-structured data like a typical database or zvol, the consumer's access size is likely 8K or less. Matching the ZFS recordsize or volblocksize to this small of an access size with DRAID (or RAIDZ) is almost always going to waste lots of space, regardless of the number of disks per group. ([*] Caveat: with ashift=9, compression=off, D=4, 8, or 16 would be reasonable and an improvement over non-powers-of-two. But ashift=9 drives are increasingly rare and small compared to ashift=12 drives.) |
I like the convergence on draid config format. Regarding compression and recordsize/volblocksize there is also the threshold for usefulness. |
Signed-off-by: Isaac Huang <[email protected]>
Changed dRAID spare vdev prefix from '$' to '%'. Fixed a few build and style warnings. Fixed rebuild status report (/issues/10). Signed-off-by: Isaac Huang <[email protected]>
Signed-off-by: Isaac Huang <[email protected]>
My next actions:
|
|
||
copies = mirror ? | ||
vd->vdev_nparity + 1 : vd->vdev_nparity + cfg->dcf_data; | ||
groups_per_perm = (vd->vdev_children - cfg->dcf_spare + copies - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the most likely case is not draid-mirror, then this calculation is better relocated below the assert below, prior to its use
@thegreatgazoo - it looks like Metadata Allocation Class PR #5182 just landed. Any chance you will have time to rebase this patch? |
@don-brady will be opening a new PR with a rebased version. |
Refreshed version in #8016 |
This patch implements the dRAID vdev driver and a new rebuild mechanism (#3497). This is still work in progress: user interface may change, on-disk format may change as well.
I've added a dRAID howto. It contains only basics for now, but I'll continue to update the document. It may also help to watch the dRAID talk at the 2017 OpenZFS Developer Summit.
Please report bugs to the dRAID project.
Comments, testing, fixes, and porting are greatly appreciated!
Code structure:
The goal is to change existing code in a way that when draid is not in use the effective change is none.
Todo: