Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dRAID vdev driver #7078

Closed
wants to merge 3 commits into from
Closed

dRAID vdev driver #7078

wants to merge 3 commits into from

Conversation

thegreatgazoo
Copy link

@thegreatgazoo thegreatgazoo commented Jan 24, 2018

This patch implements the dRAID vdev driver and a new rebuild mechanism (#3497). This is still work in progress: user interface may change, on-disk format may change as well.

I've added a dRAID howto. It contains only basics for now, but I'll continue to update the document. It may also help to watch the dRAID talk at the 2017 OpenZFS Developer Summit.

Please report bugs to the dRAID project.

Comments, testing, fixes, and porting are greatly appreciated!

Code structure:

  • New code
    • module/zfs/vdev_draid.c: vdev driver for draid and draid spare
    • module/zfs/vdev_scan.c: sequential rebuild, for both draid and mirror vdevs
    • cmd/draidcfg/*.[ch]: user space tools, mainly to create permutations
  • Changes to existing code
    • module/zfs/vdev_raidz.c: the parity functions need to include draid skip sectors for computation and reconstruction.
    • module/zfs/vdev_mirror.c: minor changes to support mirror_map_t allocated by draid vdev (for hybrid mirror support)
    • module/zfs/metaslab.c: to add support for draid hybrid mirror, also disallow block allocation during rebuild
    • Other changes:
      • Add knowledge about the new vdev types and the new rebuild mechanism
      • draid spare pretends to be a leaf but is actually not. Some code needs to be aware of that, e.g. handling labels on leaf vdevs.

The goal is to change existing code in a way that when draid is not in use the effective change is none.

Todo:

  • Merge the draidcfg utility into zpool command so that there is no additional step before creating a draid pool
  • Add a draid mirror vdev type to represent the mirrored metaslabs
  • Clean up rebuild in-memory and persisted states so that it doesn't share any state with the resilver states. This will allow concurrent rebuild and resilver, e.g. a raidz vdev can be resilvering while a draid vdev is rebuilding,

@codecov
Copy link

codecov bot commented Jan 24, 2018

Codecov Report

Merging #7078 into master will decrease coverage by 34.34%.
The diff coverage is 12.29%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #7078       +/-   ##
===========================================
- Coverage   77.08%   42.73%   -34.35%     
===========================================
  Files         336      282       -54     
  Lines      107132    94446    -12686     
===========================================
- Hits        82579    40360    -42219     
- Misses      24553    54086    +29533
Flag Coverage Δ
#kernel 4.46% <2.05%> (-73.1%) ⬇️
#user 49.33% <12.33%> (-17.25%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2050753...b56a862. Read the comment docs.

@naclosagc
Copy link

I have a question. I've been playing with draid in a VM. It seems like draid is going to like to have bigger vdevs in order to spread the spares across drives. I was wondering if there is anything in draid that will eliminate the reading speed being the slowest drive in the vdev. In my testing, and again it's in a VM with a 16 drive pool across 5 spinning rust disks, it seems like sequential reads are in fact limited to approximately the bandwidth of one drive. Thanks.

@thegreatgazoo
Copy link
Author

@naclosagc Spare blocks are always evenly distributed regardless of the # of child drives in a draid vdev, e.g. in a small draid vdev like 11 drives (8D+2P with 1 distributed spare). As to read speed, a single ZFS block is stored on D+P drives which isn't necessarily equal to the total # of child drives, e.g. in 81-drive draid2 configured as 8D+2P with 1 spare a ZFS block is stored on 10 drives instead of 81 drives. So reading a single ZFS block is limited by the slowest of D+P drives, instead of total child drives. In a real-world workload when many blocks are being read at a time, the IOs should be spread on all drives evenly.

In your configuration, the 16 drives in the VM were actually 5 physical drives, any sequential read over the 16-drive pool will more or less appear rather random on the 5 physical drives. I'm not surprised the throughput was low.

@thegreatgazoo
Copy link
Author

I am about to integrate the draidcfg command into zpool create so that there'd be no separate step to create configuration file for dRAID. The following parameters are needed for dRAID vdev creation:

  • p: parity level, can be 1, 2, or 3.
  • d: # data blocks per stripe.
  • s: # distributed spare
  • n: total # of drives

The n can be derived from the number of child drives given to zpool create, so no need to specify it explicitly. Currently it's required that: (n - s) % (p + d) == 0, but we have plan to remove this restriction. Then d will not be the same for all stripes. Therefore, instead of d, g (the # of stripes or raid groups) should be specified. In summary, the dRAID vdev specification should contain: p, g, and s.

I'd suggest to use: draidp:g:s, e.g. zpool create draid2:4:2 sda sdb ...... (42 drives) to create a double parity dRAID vdev of 42 drives with 2 drives spare capacity and 4 stripes (in this case 8 data drives per stripe).

Another change is the special character that prefixes dRAID spare vdev names. Currently it's $, e.g. $draid1-0-s0. Being a special character in shell, $ can make it tricky for shell scripts to handle.

I'll start to write code once we agree on the naming formats. Please comment.

@ahrens
Copy link
Member

ahrens commented Mar 21, 2018

@thegreatgazoo That proposal sounds good.

Draid is great for the huge-number-of-disks use case that you've designed it for. I think it will also be useful for other configurations that are "degenerate" in some respect. For example, it could be used as a replacement for a single raidz group, with Groups=1 and Spares=0 or 1. The advantage over RAIDZ would be simpler data layout, (truly) sequential resilver, and better performance of small blocks via the mirrored region.

Considering that these degenerate use cases may also be (somewhat) popular, what would you think about making some of the quantities optional: draidP[:G[:S]] If the Spares is not specified, it would default to 0. If the Groups and Spares are both not specified, it would default to 1 group and 0 spares. We might also allow draidP:[G]:S, where the Spares is specified but Groups is not (defaulting to 1). We'd want to always display the full draidP:G:S, but allow the user to input just zpool create poolname draidP disk1 disk2 ....

Did you have a different special character in mind, instead of $? Perhaps %, which we use for some special dataset names?

@adilger
Copy link
Contributor

adilger commented Mar 21, 2018

I was also going to suggest using % as the special character for dRAID spare devices, since this has a low chance of conflict with the shell.

One question I have is whether it is more natural to specify "G=number of parity groups" or "D=number of data devices" for the pool? IMHO, it is more natural to specify the number of data, parity, and spare drives for RAID devices like "draidP:D:S" (e.g. draid2:8:1), as this is what everyone has been using for years, rather than "2 parity drives and 4 groups" as proposed here. One drawback is that this puts a (very small) burden on the admin to split the drives into a multiple of "D+P" units, but this is also partly true if one specifies G badly for a VDEV (e.g. draid2:3:1 in a 10-drive pool resulting in 3x 1D+2P RAID groups).

The second question (somewhat independent of the above) is whether there should be an upper limit on the number of drives in a single RAID group if this is not specified (e.g. draid2::1 in an 88-drive pool). While it is easiest to document "default to 1 group", I don't think this is something that any sane user would actually want to have a 85+2 parity device. Should there be some reasonable default maximum number of drives (e.g. 12D+2P) unless otherwise specified, possibly working out a good number based on the actual number of drives (e.g. in the range of 8D-12D that works out an even multiple of drives per group, if possible)?

While it might be more difficult to document this, I think one of the main design goals of ZFS is ease of use (not ease of coding), and giving users (especially those that don't know the right answer) a reasonable default configuration is worth a bit more initial development effort.

@behlendorf
Copy link
Contributor

@thegreatgazoo your proposed interface makes good sense to me along with @ahrens's proposed tweaks.

@ahrens
Copy link
Member

ahrens commented Mar 21, 2018

@adilger I agree that we want ease of use, and to make it hard to shoot yourself in the foot. However, I also don't want to artificially limit the use cases. When we implemented RAIDZ we definitely talked about how it would be nuts to have >10 disks in a RAIDZ group... but that is now commonplace even among folks who understand the performance tradeoffs. As a compromise, we could require --allow-wide-stripe (even though it's pretty gross).

I think that allowing different-width stripes is great because it makes RAIDZ "just work" in more scenarios, without having to do any exact math (is (N - S) % (D+P) zero?). Although I suppose if the typical use was draidP:D, and the remainder of the drives (Ndrives % Drives_per_stripe) are assumed to be spare, that would also satisfy this. But I think folks will more commonly want to specify the Spares, based on their expected failure rate, rebuild time, physical replacement time, and resilver time. I think using any "remainder" drives to increase capacity will be more valuable than increasing spares. Assuming ~100 total drives in ~10 groups of 10, we're talking about losing up to 9% of the total capacity for the "remainder" drives.

draid2:8:1 is what everyone has been using for years

Do you mean that the logical stripe would be 2 parity + 8 data disks (which I think is @thegreatgazoo 's notation for draidP:D:S), or 2 parity with 8 disks per group (which is very similar to RAIDZ2 <8 disks> - you specify the amount of parity, and then give it all the disks (which are used for both data and parity)?

@behlendorf
Copy link
Contributor

One question I have is whether it is more natural to specify "G=number of parity groups" or "D=number of data devices" for the pool? IMHO, it is more natural to specify the number of data, parity, and spare drives for RAID devices.

@adilger this was my initial inclination as well, likely since it's what I'm used too. What convinced me otherwise is @ahrens and @thegreatgazoo's insight that by specify G instead of D each group can have a different number of disks internally. This let's us always satisfy the, (n - s) % (p + d) == 0, restriction regardless of how many total/parity/spare disks are in the requested dRAID configuration. It may not always be optimal, but it will always work, which is exactly what we want for maximum ease of use.

whether there should be an upper limit on the number of drives in a single RAID group

My feeling on this is we shouldn't strictly impose an arbitrary limit here. Often there are good reasons to want to create a pool with an exotic configuration and that should be allowed. Instead, how about leveraging the existing zpool create/add -f option to warn an admin that they're about to do something potentially unwise (like what happens today when creating a pool with non-uniform redundancy). The warning can clearly explain the consequences of a large group but they're welcome to do it if they so choose.

@ahrens
Copy link
Member

ahrens commented Mar 21, 2018

Instead, how about leveraging the existing zpool create/add -f option to warn an admin that they're about to do something potentially unwise (like what happens today when creating a pool with non-uniform redundancy).

I'm concerned that folks will become even more accustomed to always specifying -f, which disables the (IMO much more dire) warning about redundancy. What would you think about adding a different option for this, instead of overloading -f? We could require the new flag (--allow-wide?) to create wide RAIDZ too - I'm sure folks have done zpool create pool raidz2 <40 drives> when they meant to do zpool create pool raidz2 <10 drives> raidz2 <10 drives> ...

@gmelikov
Copy link
Member

We could require the new flag (--allow-wide?)

I think it may be a good balance between "shooting yourself in the foot" and any exotic configurations.

Did you have a different special character in mind, instead of $? Perhaps %

👍

@adilger this was my initial inclination as well, likely since it's what I'm used too. What convinced me otherwise is @ahrens and @thegreatgazoo's insight that by specify G instead of D each group can have a different number of disks internally. This let's us always satisfy the, (n - s) % (p + d) == 0, restriction regardless of how many total/parity/spare disks are in the requested dRAID configuration. It may not always be optimal, but it will always work, which is exactly what we want for maximum ease of use.

👍

@adilger
Copy link
Contributor

adilger commented Mar 22, 2018

draid2:8:1 matches what everyone has been using for years

Do you mean that the logical stripe would be 2 parity + 8 data disks (which I think is @thegreatgazoo 's notation for draidP:D:S), or 2 parity with 8 disks per group (which is very similar to RAIDZ2 <8 disks> - you specify the amount of parity, and then give it all the disks (which are used for both data and parity)?

@ahrens, my understanding is that the original proposal is draidP:*G*:S, which means specifying draid2:8:1 with 61 disks would result in 8 RAID groups of (61 - 1) / 8 = 7 drives plus 4 extra, maybe 4x5d+2p and 4x 6d+2p. My proposal is to specify draidP:D:S, so draid2:8:1 specifies the 8d+2p directly, since most users are familiar with this. The remaining drives could still go to a smaller RAID group (with a warning) and/or spares.

I do see the benefit as described by @behlendorf above, that G is potentially easier to use (since it can allow more flexibility in configuring the number of data drives per group), but I suspect many users will get it wrong since RAID devices have always been described with D+P in the past. Using G still doesn't prevent the user from having to do the mental arithmetic of "how many drives are going to be in each RAID group", and I think users are more used to thinking in terms of "what is an acceptable ratio of data to parity going to be?" I guess if we go with G then we can hope that sanity checks will catch any gratuitously odd layouts.

I'm definitely not a fan of overloading -f, as that is a source of confusion in e2fsprogs as well, and should be avoided. Having a named argument like --allow-wide would be better for geometries, say, more than 14D+P, and --allow-narrow for geometries D <= P or similar? That avoids problems if someone specifies draid2:8:2 with 32 drives and gets 8x 1+2 instead of their expected 3x 8+2. IMHO, it would be good to print a message any time the math does not work out evenly, like:

dRAID: VDEV using 2 RAID groups with 8D+2P and 1 RAID group with 7D+2P, and 2 hot spares

That said, since this is irreversible once the VDEV is added to a pool, a proceed (y/n)? prompt is not unreasonable. In e2fsprogs we added a time-limited prompt for such issues (e.g. mke2fs over an existing filesystem), to require user input for interactive ttys, but time out after 10s for scripts and such - enough for an "oh $#@^!" <CTRL-C> interrupt but it doesn't cause scripts that are not expecting user input to fail.

My other concern relates to @ahrens's proposal in the absence of the G (or D), and S parameters I don't think we should default to 1 RAID group for all the devices, if there are a large number of devices . That just gives a bad RAID configuration, and there is no easy way to change ZFS VDEV geometry without a full backup/reformat/restore. I'd suggest if someone specifies only draid2 or draid2::1 with a long list of drives that we select a reasonable middle-of-the-road RAID configuration (in the range 8+2 to 12+2, plus a few spares (< 10%?)), rather than something useless like 78+2 that will just generate an error.

Computing this should be relatively straight forward, something like the following to iterate over possible geometries, and try to find one that fits evenly into the given number of drives (with a reasonable number of hot spares, if unspecified), or minimizes the imbalance between RAID groups:

if (spares) {
    smin = smax = spares;
} else {
    /* select a reasonable number of hot spares, about one per group */
    smin = max(1, num / 24);
    smax = max(2, num / 16);
}
for (sbest = data = 12; data >= 8; data--) {
    spares = num % (data + parity);
    groups = num / (data + parity);
    if (spares < smin) {
        continue;
    if (spares > smax) {
        /* save if best layout so far */
        if (spares < sbest) {
                sbest = spares;
                dbest = data;
        }
        continue;
    }
    sbest = spares;
    dbest = data;
    break;
}

@richardelling
Copy link
Contributor

AIUI, making D not be a power of 2 can lead to space wastage, especially when physical block size > 512 or ashift > 9. For large D this wastage can be substantial. This will confuse folks, so at the minimum a warning should be emitted if D is not a power of 2.

Similarly, for large D and ashift > 9, the default volblocksize=8k can be too small. For example, if D=8 and physical block size=4k (ashift=12), a likely common configuration, optimal volblocksize = 32k, not 8k. Using 8k volblocksize results in up to 75% space wastage.

@gmelikov
Copy link
Member

@thegreatgazoo does this PR include bug fixes from your internal dev repo (for ex. thegreatgazoo#7 ) ?

@thegreatgazoo
Copy link
Author

@gmelikov We don't have internal repo any more - this is and will always be the latest code. The ticket you mentioned is not fixed in this PR yet, because it'd require work from the metadata allocation class.

@thegreatgazoo
Copy link
Author

Thanks all the comments. Sorry late response - was on vacation last week.

Looks like we agreed to use % for spare naming, but still can't all agree on G vs D or the default behavior when G/D and/or S is omitted.

While I agree with @adilger people are more used to specify D, I think it can be a bit confusing when they end up with some groups with less than D data drives.

@ahrens I'm not sure how useful it'd be to allow S=0 which limits rebuild to single replacement drive write throughput, given that raidz resilver is getting faster with the new scan-sort-fix implementation (and resilver has the additional advantages of checksum verification). I thought S when not specified should equal P, i.e. enough spare space to handle the max concurrent failures the parity level can handle.

Next I'll change spare prefix to % and rebase the code to latest ZoL master. Meanwhile we can continue to discuss the interface details.

@ahrens
Copy link
Member

ahrens commented Apr 4, 2018

@adilger

My proposal is to specify draidP:D:S, so draid2:8:1 specifies the 8d+2p directly, since most users are familiar with this.

My point is that most ZFS users are more familiar with specifying raidz2 <10 disks>, whereas you are saying that users are familiar with specifying 8 + 2. So the semantic of D could be confusing, depending on your background (should it be 10 or 8 in this example).

@ahrens
Copy link
Member

ahrens commented Apr 4, 2018

@richardelling

AIUI, making D not be a power of 2 can lead to space wastage, especially when physical block size > 512 or ashift > 9. For large D this wastage can be substantial

I don't think that's true if you use compression. In that case the size to allocate will be randomized, so no one minimum allocation size will be substantially better than another.

Similarly, for large D and ashift > 9, the default volblocksize=8k can be too small. For example, if D=8 and physical block size=4k (ashift=12), a likely common configuration, optimal volblocksize = 32k, not 8k. Using 8k volblocksize results in up to 75% space wastage.

I agree, we should issue a warning when creating a volume with a volblocksize that results in substantial waste (or a filesystem with recordsize). This also applies to RAIDZ. We could also issue a warning even on mirror and plain disks if enabling compression and volblocksize (or recordsize) is equal to (or maybe not much larger than) 1<<ashift. In your example, if compression is on (which I imagine is the majority of the time), volblocksize=128K would be much better than 32K.

@ahrens
Copy link
Member

ahrens commented Apr 4, 2018

We discussed offline a new way of specifying the draid geometry, which can specify either D (number of data disks) or G (number of groups):
draidPp+Dd:Ss, e.g. draid2p+8d:3s
or
draidPp:Gg:Ss, e.g. draid2p:3g:3s

If D is specified, then the number of specified disks must divide evenly, otherwise we'll give an error. If G is specified, then the number of data disks will be computed based on the number of disks provided, and the number of data disks per group may differ. In this case, any number of disks (at least (1+P)*G+S) may be provided. When displaying the draid vdev configuration back to the user (e.g. in zpool status), we'll print it the same way the user specified it. (However, the initial implementation may not support specifying G, depending on how much time @thegreatgazoo has to devote to this.)

We'd like to allow any strange configurations (e.g. super-wide), without needing to add new flags (e.g. --allow-wide or --allow-narrow). However, we'll print out a message describing what will happen, including a warning if the configuration is unusual (e.g. super-wide), and possibly sleep for a few seconds to allow the user to hit ^C. We think this is a reasonable compromise since draid will almost always be created with zpool create (rather than zpool add), so if an unintended configuration is created, the message will help them realize this immediately, so they can destroy and recreate the pool while it's still empty.

@ahrens
Copy link
Member

ahrens commented Apr 4, 2018

AIUI, making D not be a power of 2 can lead to space wastage, especially when physical block size > 512 or ashift > 9. For large D this wastage can be substantial

I don't think that's true if you use compression. In that case the size to allocate will be randomized, so no one minimum allocation size will be substantially better than another.

Actually, I think the space wastage will be substantially the same regardless of if D is a power of 2, for (almost[*]) all reasonable use cases.

If you're storing already-compressed data, then you wouldn't use ZFS compression. But your data is accessed sequentially, so you should use a large block size. With recordsize=1M, sector size=4K (ashift=12), and 9 data disks per group, the maximum waste is 3%. I think we should consider changing the default recordsize to 1MB for DRAID.

If your data is not already compressed, then you should use ZFS compression. In this case the size to allocate will be randomized, so no one minimum allocation size will be substantially better than another.

If you are using record-structured data like a typical database or zvol, the consumer's access size is likely 8K or less. Matching the ZFS recordsize or volblocksize to this small of an access size with DRAID (or RAIDZ) is almost always going to waste lots of space, regardless of the number of disks per group. ([*] Caveat: with ashift=9, compression=off, D=4, 8, or 16 would be reasonable and an improvement over non-powers-of-two. But ashift=9 drives are increasingly rare and small compared to ashift=12 drives.)

@richardelling
Copy link
Contributor

I like the convergence on draid config format.

Regarding compression and recordsize/volblocksize there is also the threshold for usefulness.
Currently, a short-term global setting is proposed in #7324
and this will likely work for most folks. It is not clear to me if this is worth the effort to make
it a dataset property. In any case, it can be handled as a separate effort.

Signed-off-by: Isaac Huang <[email protected]>
Changed dRAID spare vdev prefix from '$' to '%'. Fixed a few
build and style warnings. Fixed rebuild status report
(/issues/10).

Signed-off-by: Isaac Huang <[email protected]>
@thegreatgazoo
Copy link
Author

My next actions:

  • Minor fixes: build errors, a few issues already reported
  • Add dRAID support to ztest
  • Clean up code duplication with vdev_removal.c
  • Integrate draidcfg with zpool using vdev format draidPp+Dd:Ss, e.g. draid2p+8d:3s

@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label May 16, 2018
@ahrens ahrens added the Type: Feature Feature request or new feature label Jun 7, 2018

copies = mirror ?
vd->vdev_nparity + 1 : vd->vdev_nparity + cfg->dcf_data;
groups_per_perm = (vd->vdev_children - cfg->dcf_spare + copies - 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the most likely case is not draid-mirror, then this calculation is better relocated below the assert below, prior to its use

@adilger
Copy link
Contributor

adilger commented Sep 6, 2018

@thegreatgazoo - it looks like Metadata Allocation Class PR #5182 just landed. Any chance you will have time to rebase this patch?

@behlendorf behlendorf added Status: Revision Needed Changes are required for the PR to be accepted and removed Status: Work in Progress Not yet ready for general review labels Sep 25, 2018
@behlendorf
Copy link
Contributor

@don-brady will be opening a new PR with a rebased version.

@behlendorf
Copy link
Contributor

Refreshed version in #8016

@behlendorf behlendorf closed this Oct 12, 2018
@ofaaland ofaaland mentioned this pull request Nov 6, 2019
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Revision Needed Changes are required for the PR to be accepted Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants