-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata Allocation Classes #5182
Conversation
mirror required? I tried |
cmd/zdb/zdb.c
Outdated
|
||
if (vd->vdev_mg->mg_class[SPA_CLASS_DDT] != NULL) { | ||
(void) strncat(bufptr, VDEV_CLASS_DDT",", buflen); | ||
buflen -= sizeof (VDEV_CLASS_DDT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is decrementing correctly because sizeof(VDEV_CLASS_DDT) includes the trailing NUL, which makes up for the comma that you are not counting, right?
cmd/zdb/zdb.c
Outdated
if (vd->vdev_mg->mg_class[SPA_CLASS_DDT] != NULL) | ||
(void) strcat(class, VDEV_CLASS_DDT","); | ||
|
||
if (strlen(class) == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this, for consistency with later code (and avoiding calling strlen() twice)? if (class[1] != '\0')
cmd/zdb/zdb.c
Outdated
@@ -745,6 +745,9 @@ dump_metaslab_stats(metaslab_t *msp) | |||
dump_histogram(rt->rt_histogram, RANGE_TREE_HISTOGRAM_SIZE, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are going to add a copyright line in other places, maybe add one to zdb.c as well?
cmd/zpool/zpool_vdev.c
Outdated
{"OI COMSTAR ", 8192}, | ||
{"SUN COMSTAR ", 8192}, | ||
{"NETAPP LUN ", 8192}, | ||
{"OI COMSTAR ", 8192}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong. If the GitHub reviewer is showing this correctly, you're improperly replacing spaces with tabs here.
Capacity regression? When creating a pool, even without any allocation classes, about 4 % of available capacity is lost: Plain ZFS (commit
This (commit
|
Free space accounting. As I suppose there is some issue - an idea that perhaps could work. But I've not tested it on this patch. (And it is late night here...) Approach would be a minimalistic change from plain ZFS accounting:
One drawback might be the slight user confusion that the capacity reported (by e.g. |
I have looked over the changes, and one thing worries me rather deeply: If I understand things correctly, the amount of space used for the different kinds of data (reg, log, mos, ddt, dmu) is recorded permanently in the space maps? While being nice for monitoring purposes, it means that the mapping of any kind of block would forever be tied to the now given kind of class? Otherwise, the accounting when freeing items will be done in the wrong class. In particular, would it not also give problems when adding metadata classes to an existing pool (that has not had it before)? I can see that it is nice to have some way of knowing how much space data of each kind is eating on each vdev. One way would be that e.g. scrub accumulates that information. Would not be exact, and generally a bit out of date, but give a good enough hint. Perhaps even a reduced scrub that does not check all file data, just walks metadata structures. Would be fast, if all that on solid-state media. This could be an later improvement then? |
@don-brady I am thinking of testing some things with your patch, do you perhaps already have an update of the patch that solves some of the merge conflicts? |
It's time to get this rebased over abd :-) |
…ata. Also include blocks with level > 0 i metadata category (from openzfs#5182). Example: zpool set "rotorvector=ssd<=meta:4;mixed<=64;123,hdd" <poolname> pure ssd drive takes metadata <= 4 kB mixed (mirror) takes data (or metadata) <= 64 kB others (hdd) takes remainder Example II: zpool set "rotorvector=ssd<=meta:128,4;mixed<=64;123,hdd" <poolname> pure ssd drive takes metadata <= 128 kB and data <= 4 kB mixed (mirror) takes data <= 64 kB [this metadata already taken by ssd] others (hdd) takes remainder
…ata. Also include blocks with level > 0 i metadata category (from openzfs#5182). Example: zpool set "rotorvector=ssd<=meta:4;mixed<=64;123,hdd" <poolname> pure ssd drive takes metadata <= 4 kB mixed (mirror) takes data (or metadata) <= 64 kB others (hdd) takes remainder Example II: zpool set "rotorvector=ssd<=meta:128,4;mixed<=64;123,hdd" <poolname> pure ssd drive takes metadata <= 128 kB and data <= 4 kB mixed (mirror) takes data <= 64 kB [this metadata already taken by ssd] others (hdd) takes remainder
…ata. Also include blocks with level > 0 i metadata category (from openzfs#5182). Example: zpool set "rotorvector=ssd<=meta:4;mixed<=64;123,hdd" <poolname> pure ssd drive takes metadata <= 4 kB mixed (mirror) takes data (or metadata) <= 64 kB others (hdd) takes remainder Example II: zpool set "rotorvector=ssd<=meta:128,4;mixed<=64;123,hdd" <poolname> pure ssd drive takes metadata <= 128 kB and data <= 4 kB mixed (mirror) takes data <= 64 kB [this metadata already taken by ssd] others (hdd) takes remainder
…ata. Also include blocks with level > 0 i metadata category (from openzfs#5182). Example: zpool set "rotorvector=ssd<=meta:4;mixed<=64;123,hdd" <poolname> pure ssd drive takes metadata <= 4 kB mixed (mirror) takes data (or metadata) <= 64 kB others (hdd) takes remainder Example II: zpool set "rotorvector=ssd<=meta:128,4;mixed<=64;123,hdd" <poolname> pure ssd drive takes metadata <= 128 kB and data <= 4 kB mixed (mirror) takes data <= 64 kB [this metadata already taken by ssd] others (hdd) takes remainder
…ata. Also include blocks with level > 0 i metadata category (from openzfs#5182). Example: zpool set "rotorvector=ssd<=meta:4;mixed<=64;123,hdd" <poolname> pure ssd drive takes metadata <= 4 kB mixed (mirror) takes data (or metadata) <= 64 kB others (hdd) takes remainder Example II: zpool set "rotorvector=ssd<=meta:128,4;mixed<=64;123,hdd" <poolname> pure ssd drive takes metadata <= 128 kB and data <= 4 kB mixed (mirror) takes data <= 64 kB [this metadata already taken by ssd] others (hdd) takes remainder
…ata. Also include blocks with level > 0 i metadata category (from openzfs#5182). Example: zpool set "rotorvector=ssd<=meta:4;mixed<=64;123,hdd" <poolname> pure ssd drive takes metadata <= 4 kB mixed (mirror) takes data (or metadata) <= 64 kB others (hdd) takes remainder Example II: zpool set "rotorvector=ssd<=meta:128,4;mixed<=64;123,hdd" <poolname> pure ssd drive takes metadata <= 128 kB and data <= 4 kB mixed (mirror) takes data <= 64 kB [this metadata already taken by ssd] others (hdd) takes remainder
…ata. Also include blocks with level > 0 i metadata category (from openzfs#5182). Example: zpool set "rotorvector=ssd<=meta:4;mixed<=64;123,hdd" <poolname> pure ssd drive takes metadata <= 4 kB mixed (mirror) takes data (or metadata) <= 64 kB others (hdd) takes remainder Example II: zpool set "rotorvector=ssd<=meta:128,4;mixed<=64;123,hdd" <poolname> pure ssd drive takes metadata <= 128 kB and data <= 4 kB mixed (mirror) takes data <= 64 kB [this metadata already taken by ssd] others (hdd) takes remainder
…ata. Also include blocks with level > 0 i metadata category (from openzfs#5182). Example: zpool set "rotorvector=ssd<=meta:4;mixed<=64;123,hdd" <poolname> pure ssd drive takes metadata <= 4 kB mixed (mirror) takes data (or metadata) <= 64 kB others (hdd) takes remainder Example II: zpool set "rotorvector=ssd<=meta:128,4;mixed<=64;123,hdd" <poolname> pure ssd drive takes metadata <= 128 kB and data <= 4 kB mixed (mirror) takes data <= 64 kB [this metadata already taken by ssd] others (hdd) takes remainder
Thanks @don-brady for the update! What is the reason for requiring dedicated VDEVs to be mirrors? |
Attached are graphs of measurements for some different configurations: (green) PAC patched ZFS, 10 GB HDD + 1 GB SSD dedicated metadata The test program fills the filesystem on the pool in steps where it first deletes ~1 GB of files, and then creates 1 GB (but trying to add 2 GB). Thus it after a few iterations run the pool full (vertical dashed line), and then 'ages' it by trying to cause more and more fragmentation. The files are of random size, with an exponential distribution (many small ones), averaging 200 kB. On average 20 files per directory in the varying tree, also favouring few. The pool is exported and imported between each particular test of each iteration. The segregated tests reach the filled-state situation a bit earlier as the available space is reduced by what is set aside for segregated blocks. Findings in these tests:
|
@don-brady I have looked through the code, and have some questions and comments:
|
man/man5/zpool-features.5
Outdated
.sp | ||
.ne 2 | ||
.na | ||
\fB\fBallocation classes\fR\fR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing _
, should be allocation_classes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it
man/man5/zpool-features.5
Outdated
|
||
This feature becomes \fBactive\fR when a dedicated allocation class vdev | ||
(dedup or special) is created with zpool create or zpool add. With device | ||
removal, it can returned to the \fBenabled\fR state if all the top-level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/can/can be/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it
man/man8/zpool.8
Outdated
Likewise, the indirect blocks for user data can be excluded from the special | ||
class by setting the | ||
.Sy zfs_user_indirect_is_special | ||
zfs module parameter to false (0). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest dropping this paragraph entirely from the zpool man page. This isn't something we expect most users of the feature to need to tune, just like the other module options, and it distracts from the more relevant special_small_blocks
property discussion. It would be better to include a more detailed explanation of zfs_ddt_data_is_special
and zfs_user_indirect_is_special
in zfs-module-parameters.5
for advanced users wanting to tune these module paramters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I'll drop it then.
Rebased today and switched ZTS alloc_class tests to not require partitions. |
Merged with parallel allocation feature and rebased to latest master branch. |
module/zfs/spa_misc.c
Outdated
/ 100; | ||
uint64_t filedata = metaslab_class_get_filedata(special) + size; | ||
|
||
if (filedata <= limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this conditional seems like a nop
Updated the branch to have the allocation classes feature without the class block stats (stats were only visible from zdb). |
I'm a little late on this, sorry for that. This feature, in the end, boils down to just giving the allocator a hint about on which vdev(s) it should examine the metaslabs to get the free space needed for a new block of a certain data class. The current inability to later add/remove the dedication to a special allocation class after zpool create/vdev addition is IMHO a huge problem and quite a limitation given that we can neither shrink nor remove a vdev. Having the setting fixed (and eternal) carved into stone at zpool/vdev creation would make this feature a nightmare to use (at least in deployment scenarios with changing requirements). Thus these hints to the allocator need IMHO to be runtime configureable, preferably through vdev properties that would enable something like So should this, for the start, be only set-able at zpool create/add: please build it in a way that it can be transparently upgraded later (with vdev properties, or however they'll be called) to be configureable, as flexible as possible, at runtime. I would prefer if it would be flexible from the start. This feature should also not make the pool read-only on zfs versions without it. |
@GregorKopka Thanks for your feedback. Let me try and address some of your concerns.
The VDEV bias (aka its dedication), as you noted is essentially an allocation policy for that device. There is nothing in this design that would preclude a future change that would allow the user to add/change the allocation bias associated with a given VDEV. A follow up PR could certainly accommodate a I'm not sure I agree that this feature should not make a pool read-only for implementations that are allocation class unaware. For example, consider a pool provisioned to have a large generic raidz vdev and a much smaller special SDD to hold the metadata for the pool. Importing said pool as writeable under an allocation class unaware implementation would potentially fill up the SSD with file data and in essence undo the intended isolation of the metadata in that device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A feature should not break existing functionality,
this implementation will disable zpool split.
@@ -25,6 +25,7 @@ | |||
* Copyright (c) 2014 Spectra Logic Corporation, All rights reserved. | |||
* Copyright 2013 Saso Kiselkov. All rights reserved. | |||
* Copyright (c) 2017 Datto Inc. | |||
* Copyright (c) 2017, Intel Corporation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copyright either contains a superfluous comma or aims to extend to the end of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All previous Intel commits used this format as well as other companies. I cannot say that it's right or wrong, but it has been used in past commits to our repo.
@@ -140,6 +141,14 @@ struct vdev_queue { | |||
kmutex_t vq_lock; | |||
}; | |||
|
|||
typedef enum vdev_alloc_bias { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having this as an enum allows only one option.
It would make more sense to implement this as a reasonably futureproof sized bitfield, allowing to directly configure which multiple kinds of metadata are to be allocated from that special vdev.
This would completely remove the need to determine if ddt is metadata or not through a module parameter.
Would also ease later expansion of the feature to eg. differentiate between different kinds of data blocks with a finer granularity. Just one example: allocation only of the 1st copy of a metadata block on a fast special device (for speed) while any additional copies allocate either from other special vdev that are slower but optimized for small blocks or from the normal vdevs (for extremely cheap redundancy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vdev bias is stored on-disk as a string. The enum is just for runtime
@don-brady Thanks for taking the time to answer - and for working on this in the first place. Regading the read-only I have always seen the feature flags as a way to protect the on-disk state against implementations that would potentially destroy data as of not understanding the new structure, as a solution to a technical problem. In case an implementation just tacks additional data to existing structures that will safely be ignored by non-aware implementations there should IMHO be no reason to apply this in case it isn't actually needed. You have a point with importing an allocation class aware pool in an unaware system leading to data being spread to places it wouldn't end up in otherwise. But I think that point is moot as executing a vdev removal would do exactly that: spread the contents of the to-be removed vdev over the pool to whatever places where big spans of space are currently free. At least in the current incarnation where just the device to be removed can be specified (with the rest of the pool being the target to search for free room), regardless of the problem with raidz vdevs. And regardless of device removal being available at the moment, executing it once on a special device would be orders of magnitude worse than mounting a pool on an allocation class unaware system (eg. to make a snapshot and send stuff away, something that would most likely only happen should there be problems with the pool or the system it was originally running on) as it would move at rest data to new places while the mounting would only affect newly written data. Bottom line on this: Please don't close a potential disaster recovery path should a technical reason for this not exist. Regarding the expandability I suggest to structure the on-disk control data in a way that makes extension easy. With that I mean to make it open by allowing multiple classes to be specified per vdev in the first place. See the comment in my review regarding the enum. This would solve the issues I have seen (between the lines of some comments in this discussion) where people have different thoughts about what actual classes of data should be counted as 'metadata' (or combine with each other). |
Allocation Classes adds the ability to have allocation classes in a pool that are dedicated to serving specific block categories, such as DDT data, metadata, and small file blocks. A pool can opt-in to this feature by adding a 'special' or 'dedup' top-level VDEV. Signed-off-by: Don Brady <[email protected]>
@don-brady thanks for refreshing this and incorporating all the feedback. This PR is ready to go. I put it through a final round of comprehensive testing and wasn't able to uncover any problems. @GregorKopka thanks for taking an interest and making the time to review and comment. Your core concerns have been addressed in the finalized version. See the comment above, but to summarize.
Future work:
|
Codecov Report
@@ Coverage Diff @@
## master #5182 +/- ##
==========================================
+ Coverage 78.51% 78.57% +0.06%
==========================================
Files 376 376
Lines 113561 114001 +440
==========================================
+ Hits 89159 89577 +418
- Misses 24402 24424 +22
Continue to review full report at Codecov.
|
If you have a special allocation class for metadata, are the L1/L2/...etc indirect blocks stored in the separate device separated for metadata?
|
Yes. From the output above I can see that you have a pool with a geometry roughly like:
(Obviously I'm guessing raidz, but I'm pretty confident about the mirror) You can tell because the L2 and L1 blocks, and the dataset on the top row, begin the DVAs with a prefix of |
Yes you are pretty close, I have:
So the prefix will basically tell me which disk the data sits in? When you guessed 5 was "special" would dedup be 6? etc? How do you get more precise information? @DeHackEd |
This is going off topic into mailing list territory. I'm going to answer this specific question and then stop.
|
If i use the metadata allocation classes, and run a zpool upgrade will my old pools migrate to this new metadata allocation class or do i need to remake all my pools to take advantage of this new feature? |
The usual. Either |
Is there any way to see per-vdev |
ZFS Allocation Classes [WIP]
Allocation Classes
Allocation classes can be thought of as allocation tiers that are dedicated to specific block categories. Now in addition to general data (normal class), and Intent Log data (log class), we introduce a
special
class (metadata, DDT, and optionally small file blocks) and adedup
class (dedup data only).Feature Flag Encapsulation
The
feature@allocation_classes
becomes active when a special or dedup allocation class is instantiated for a VDEV. Activating this feature makes the pool read-only on builds that don't support allocation classes.Allocation Class Granularity
The class allocation bias occurs at a top-level VDEV granularity. Each top-level VDEV now has an allocation bias.
Special VDEVs
All metaslabs in the VDEV are dedicated to the special allocation class category. A pool must always have at least one general (non-specified) VDEV when using special VDEVs. Opt-in using the
special
VDEV class designation when creating the VDEV.Dedup VDEVs
All metaslabs in the VDEV are dedicated to the dedup allocation class category. A pool must always have at least one general (non-specified) VDEV when using dedup VDEVs. Opt-in using the
dedup
VDEV class designation when creating the VDEV.Example Syntax:
zpool create demo raidz <disks> special mirror <disks>
zpool add demo dedup <disk>
New Ditto Block Policy
When there is only one VDEV available and more than one DVA is required (ditto copies > 1), the traditional ditto placement policy was to place the allocation a distance of 1/8th of total vdev allocation space away from the other DVAs. This policy has been simplified to a simple guarantee that the other DVAs land in a different metaslab. To validate that the new policy is honored, a zdb(8) block audit will also report any DVAs that landed in the same metaslab (expecting none). If there is a policy failure it will be manifest as follows in zdb output:
Dittoed blocks in same metaslab: 21
VDEV Changes
Classes show up in run-time VDEV instance as an allocation bias:
This vdev allocation class bias is stored in the per-vdev zap object as a string value:
The bias is also passed internally in the pool config nvlist during a pool create and any internal pool config query. This is used by functions in the zpool(8) command.
Observing Allocation Classes
There are several ways to observe aspects of allocation classes using the zpool(8), zdb(8) and kstat
ZPOOL
ZDB
ZDB has been adapted to accommodate allocation class information:
zdb -b <pool>
-- class summary and ditto policy (shows any same-vdev ditto blocks that landed in the same metaslab).ZTEST Coverage
Caveats and TBD