-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rotor vector allocation (small records favour SSD) #4365
Conversation
1555df6
to
73095a8
Compare
I believe the metadata allocation classes is a more comprehensive approach #3779 |
@richardelling I agree that metadata allocation classes are more comprehensive than this 'hack'. I was not able to find any pointers to code however, in the issue ticket or the slides? Another question would be if the metadata allocation classes also can be told to store small files, as that would be very useful when e.g. running some |
40b1210
to
3d9ad70
Compare
Updated rotor vector implementation:
(Since there were no comments on the earlier code as such, I replaced it with this new version, to avoid cluttering the history.) The earlier patch did not behave at all when the pool got full. This is now fixed by doing the mc_{alloc,deferred,space,dspace} accounting in metaslab_class by rotor vector category. When their values are requested from elsewhere, the first three behave as before and are simply summed up. When several categories exist, the total dspace value returned is adjusted such that when the pool is approaching full, > 75%, the value mimics the behaviour as if all categories have the same fill ratio as the most full category. This will make zfs report disk full to user applications in a controlled fashion in the normal manner. When < 25 %, the full capacity is reported, and in-between a sliding average. This actually also make dd(1) return sensible numbers. :) I believe this is the correct way to go: if one has e.g. one small SSD for small allocations and large HDDs for the rest, one does not want the allocations to spill over on the other kind of device as the pool becomes full. After the user has freed some space, the misdirected allocations would stay where were written, thereby hampering performance for the entire future of the pool. Better then to report full a bit earlier, and retain performance. Measurements:
(green) patched ZFS, 10 GB HDD + 1 GB SSD partitions (10.1 GB when testing alone, as the tests when full with rotor vector in action did use ~70 MB on the SSD in addition to 9.7 GB on the HDD.) The test program fills the filesystem on the pool in steps where it first deletes ~1 GB of files, and then creates 1 GB (but trying to add 2 GB). Thus it after a few iterations run the pool full (vertical dashed line), and then 'ages' it by trying to cause more and more fragmentation. The files are of random size, with an exponential distribution (many small ones), averaging 200 kB. On average 20 files per directory in the varying tree, also favouring few. The pool is exported and imported between each particular test of each iteration. In all the measured operations, the patch makes the addition of a small SSD to an HDD (array) perform much better than the HDD alone. Some measurements have several curves as I did several test runs.
Generalisation of technique: It would be a quite simple expansion of the patch to include more categories of vdevs, from faster to cheaper: pure SSD mirror Thus a user can choose a configuration that gives the best value with available resources, and for larger systems even a performance-optimised multi-tiered storage within the same pool becomes possible. The above categories can be auto-identified by zfs without any hints. It would still be user option (module parameters) to control above which record sizes to choose cheaper vdevs. Todo:
For the performance increase possible, I think the patch is really small, so am hoping to spur some feedback :-) |
7055c82
to
73fd7f8
Compare
7344532
to
717adea
Compare
98297d1
to
c4dc192
Compare
Updates:
Note: the assignment of vdevs to rotor categories is made on pool import. Except for the configuration pool property, there is no change of the storage format. The rotor categories can thus be easily changed, and take effect for any new allocations after an export/import cycle. |
More measurements. Same procedure and devices used as above.
The plain HDD and SSD configurations are for comparison / setting the scale of the figures. It is nice to note that the green and blue curves, which are two different approaches, but that should be dividing the blocks in the same way between the devices indeed show very similar behaviour. Since SSDs are expensive and the purpose of the method is to use it 'wisely', some variations on block assignment schemes are tried. Except for the (delete)+write graphs, they are normalised to the total amount of data stored. Still it would matter how much SSD is used per HDD - in these tests (numbers are in GB, from zpool iostat):
The number for metadata_classes_wip are a bit odd. It reports full for a smaller amount of data stored. Actually the capacity of both the SSD and HDD partitions are reported smaller by ~5% after pool creation than normally. It also seems to use more space for the metadata. The latter issue may have something to do with the pool creation not accepting a single metadata disk but requiring a mirror, where I then detached a placeholder partition, to get a comparable setup. I am aware that it is a very fresh commit, so will try to update this post when issue is resolved / explained. The slightly smaller available space possibly also explains the earlier drop of the blue curve in the find graph. With a few exceptions, most curves are very similar. Outliers:
|
It is interesting to know the metadata overhead of files and directories in order to choose a suitable vdev (SSD) size. To get numbers for that, I created a number of dummy files, either directly in a zfs filesystem or within directories under that, and recorded the allocated capacity on the metadata vdev. The remaining allocated capacity after the filesystem was purged ( The resulting metadata size (as used on a SSD, 512 byte blocks) is then assumed to be a sum of five parameters (they were solved for using linear least squares fitting):
There seems to be a rather hefty penalty for long filenames. The measurements did not show any gradual change below 50 characters. For each file of non-zero size, there was one block written to the vdev holding data. This started at a file size of 1 bytes. A test by copying my home filesystem to a meta+data pool gives 94.8M of metadata for 29.0G of data, for 263110 files and 29175 directories. Or 360 bytes/file. In addition to the above comes the overhead of block pointers for each block for large files. This should be 128 bytes / block. With the default maximum of 128 kB/block, this is 1 ‰, or 1 GB per TB. The measurement data is included below. The column residual tell how well the measured value matches the estimate using the parameters above.
|
cb0e4d3
to
51195fa
Compare
High-level questions (apologies if these are addressed already, feel free to point me elsewhere): How is this behavior controlled? Is user intervention required? What is the "rotorvector" pool property? Is it intentionally not documented in the zpool.8 manpage? How do we determine if a disk is SSD vs HDD? How do we determine which allocations to place on SSD? What happens if one class gets full? How is all of this communicated to the user (e.g. how can they tell which devices are SSD vs HDD, and how can they tell how full each type is)? Does this provide any additional functionality than the "vdev allocation classes" project (#3779) that @don-brady is working on? |
Only mirrors are mixed. If a pool consist of several mixed vdevs, it is mixed if all vdevs are either mixed, or fully nonrotational. Do not mark as mixed when fully nonrotational.
Preparation for selecting metaslab depending on allocation size. Renaming of mc_rotor -> mc_rotorv and mc_aliquot -> mc_aliquotv just to make sure all references are found.
Todo: The guid list should not have a fixed length but be dynamic.
Format: [spec<=limit];spec with spec being a comma-separated list of vdev-guids and generic type specifiers: ssd, ssd-raidz, mixed, hdd, and hdd-raidz. limit gives the limit in KiB of which allocations are allowed within the rotor vector. The last rotor has no limit. (And vdevs which are not matched by the guids or the generic types are placed in the last rotor.) Example: ssd<=4;mixed<=64;123,hdd Here, allocations less than 4 kbytes are allocated on ssd-only vdev(s) (mirror or not). Allocations less than 64 kbytes end up on ssd/hdd mixed (mirrors, such raidz makes no sense). Other allocations end up on remaining disks. 123 represents av vdev guid (placing an explicit vdev-guid in the last rotor makes little sense though). Possibly, the configuration should be split into multiple properties, one per rotor. And limits separate from types. The compact format does have advantages too...
In a pool that consist of e.g. a small but fast SSD-based mirror and a large but long-latency HDD-based RAIDZn, it is useful to have the metadata, as well as very small files, stored on the SSD. This is handled in this patch by selecting the storage based on the size of the allocation. This is done by using a vector of rotors, each of which is associated with metaslab groups of each kind of storage. If the preferred group is full, attempts are made to fill slower groups. Better groups are not attempted - rationale is that an almost full filesystem shall not spill large-size data into the expensive SSD vdev, since that will not be reclaimable without deleting the files. Better then to consider the filesystem full when the large-size storage is full. One can also have e.g. a 3-level storage: Mirror SSD for really small records, mirror HDD for medium-size records and raidzn HDD for the bulk of data. Currently, 5 rotor levels can be set up. ** The remainder of the commit message is for an earlier incarnation. Numbers should be representative nontheless. See PR for more up-to-date measurements. ** Some performance numbers: Tested on three separate pools each consisting of a 20 GB SSD partition and a 100 GB HDD partition, from the same disks. The HDD is 2 TB in total.) SSD raw reads: 350 MB/s, HDD raw reads 132 MB/s. The filesystems were filled to ~60 % with a random directory tree, each with random 0-6 subdirectories and 0-100 files, maximum depth 8. The filesize was random 0-400 kB. The fill script was run with 10 instances in parallel, aborted at ~the same size. The performance variations below are much larger than the filesystem fill differences. Setting 0 is the original 7a27ad0 commit. Setting 8000 and 16000 is the value for zfs_mixed_slowsize_threshold, i.e. below which size data is stored using rotor[0] (nonrotating SSD), instead of rotor[1] (rotating HDD). ** Note: current patch does not use zfs_mixed_slowsize_threshold or fixed rotor assignment per vdev type. ** - Setting 8000 Setting 16000 Setting 0 - ------------ ------------- ------------ Total # files 305666 304439 308962 Total size 75334 kB 75098 kB 75231 kB As per 'zfs iostat -v': Total alloc 71.8 G 71.6 G 71.7 G SSD alloc 3.34 G 3.41 G 3.71 G HDD alloc 68.5 G 68.2 G 68.0 Time for 'find' and 'zpool scrub' after fresh 'zfs import': find 5.6 s 5.5 s 42 s scrub 560 s 560 s 1510 s Time for serial 'find | xargs -P 1 md5sum' and parallell 'find | xargs -P 4 -n 10 md5sum'. (Only first 10000 files each) -P 1 md5sum 129 s 122 s 168 s -P 4 md5sum 182 s 150 s 187 s (size summed) 2443 MB 2499 MB 2423 MB --- ** Some reminders about squashed fixes: ** Must decide on rotor vector index earlier, in order to do space accounting per rotor category. Set metaslab group rotor category at end of vdev_open(). Then we do it before the group is activated. Can then get rid of metaslab_group_rotor_insert/remove also. Fixups: Moving the metaslab_group_set_rotor_category() up. ztest did an (unreproducible) failure with stack trace of spa_tryimport() -> ... vdev_load() -> ... vdev_metaslab_init() -> metaslab_init() -> ... metaslab_class_space_update() that failed its ASSERT. Inspection showed that vdev_metaslab_init() would soon call metaslab_group_activate(), i.e. we need to assign mg_nrot. (Hopefully, vdev_open was called earlier...?) Assign mg_nrot no matter how vdev_open() fails. In dealing with yet another spa_tryimport failure.
Instead of refusing to at all use better rotor categories than selected, try them when all other failed. The implementation is not very pretty. And the better ones should be tried in reverse order. At least does get it a bit further through ztest.
Adjust the dspace reported from metaslab_class such that we look full when one rotor vector category (e.g. SSD or HDD) becomes full. This since large content for the HDD should not spill into the SSD (quickly filling it up prematurely), and as we also do not want to spill small content that should be on the SSD onto the HDD.
metadata. Also include blocks with level > 0 i metadata category (from PR 5182). Example: zpool set "rotorvector=123,ssd<=meta:4;mixed<=64;hdd" <poolname> Pure ssd (and explicit vdev guid 123) drive takes metadata <= 4 kB. Mixed (mirror ssd+hdd) takes data (or metadata) <= 64 kB. Others (hdd) takes remainder. Example II: zpool set "rotorvector=ssd<=meta:128,4;mixed<=64;123,hdd" <poolname> Pure ssd (and explicit vdev guid 123) drive takes metadata <= 128 kB and data <= 4 kB. Mixed (mirror) takes data <= 64 kB (this metadata already taken by ssd). Others (hdd) takes remainder.
The random assignment is a dirty hack. Reason for this is that it has to be set before the device open finishes, or the rotor vector index will be assigned on whatever initial nonrot value the device has.
…e or mixed. Keep track of mixed nonrotational (ssd+hdd) devices. Only mirrors are mixed. If a pool consist of several mixed vdevs, it is mixed if all vdevs are either mixed, or ssd (fully nonrotational). Pass media type info to zpool cmd (mainly whether devices are solid state, or rotational). Info is passed in ZPOOL_CONFIG_VDEV_STATS_EX -> ZPOOL_CONFIG_VDEV_MEDIA_TYPE.
For the time being, abusing the -M flag (of the previous commit).
Thanks @ahrens for the quick good questions!
Thanks! Doing that sorted some explanations out, added a commit. I anyhow answered the questions below too, might be easier to discuss. (I called class "category" in an attempt to not confuse with #3779/#5182 while discussing.)
It sets both which vdevs that belong to each class (or category) as well as which blocks are eligible for each category. (I suppose it should eventually use a feature@ property?)
Completely through the "rotorvector" property. No other tunables. Removing/clearing this setting makes allocations behave as usual. (Currently, it is necessary to perform an export-import cycle to make a new/cleared setting take effect, as I have not figured out how to repopulate the metaslab / what locking is required.)
Yes, setting the property.
This information was already available from the kernel in zfsonlinux, so I used that. Explicit vdev guids can also be given in the rotorvector property. If possible, it would be good if one could optionally set a category (rotor vector) index associated with each vdev, which take precedence over any generic assignment by the pool property.
By the kind of the allocation, and the size of it. (It currently uses the compressed size, which I think is not very good. May lead to vdev hopping for data of compressed files whose compression ratio make the size straddle a cut-off point.)
This I think is the most important question for any approach to this kind of feature. With this approach, the pool is treated as full when one category is full (but not really exhausted, as usual). This is done by modifying the value reported by I have not penetrated ZFS' accounting enough to figure if this causes any problems...? (Applying this did however bring ztest and zfstest failures down to what as far as I see are the current 'random' background for zfsonlinux.) For the user, reports from e.g. Internally, all free space can be used until exhausted.
An optional (currently flag For testing, this patch currently abuses this flag to also show which rotor vector index (i.e. category) each vdev has been assigned to in
Not implemented yet, one idea would be to e.g. under the Once say 10% of pool capacity is used (and one thus could assume that some usage pattern can be inferred), it would also make sense that the hints at the beginning of With the power and flexibility given to the user to direct data to different categories also comes a need to provide mechanisms to easily monitor the multi-dimensional free space. More suggestions?
It gives the ability to also store (small) data blocks on the fast devices. And has the flexibility to direct data (or metadata) to (multiple) different storage classes depending on block size. E.g. a hierarchy of vdevs: SSD, HDD mirror, HDD raidz... (Currently it does not distinguish deduplication tables within the metadata kind of allocations, but adding a separate limit for that should be easy within the selection routine.) |
@inkdot7 Thanks for pointing me here, I just glanced at it. I never thought of mixing rotary and SSD in one pool except for l2arc, but the ideas are cool though. Also I think in the long run, rotary storage will disappear. For lack of an m.2 port and free PCI slots, I am quite happy with a 3 vdev pool of 3 ssds as primary bulk storage for the time being. But for enterprise applications it might be interesting... Reading also across don bradys metdata issue: is that not like btrfs where we can do like -m = raid1 -d raid0. To me it would be absolutely logical to have more control over metadata. I think for large setups, there should be a way to define "vdev classes" and e.g even specify that a given zfs on creation or later should prefer a certain "location" - also something btrfs somehow has with "rebalancing" and the ability to remove a disk from a pool. ZFS as the (for me) more relieable and more performant system would benefit from such features --- until ultimately, flash will be as fast as ram and as cheap as hdd, that is. |
@inkdot7 Thanks for taking the time to answer my questions. I think that @don-brady's project now includes support for separating out small user data blocks, so it includes a lot of the functionality of this project. In terms of space accounting, I think we need to be able to spill over from one type of device to another. Otherwise we may have lots of free space (on the larger type), then write a little bit of data (which happens to go to the smaller type), and then be out of space. |
@don-brady can you confirm this?
@ahrens I suppose you mean from a user-visible point of view, as to when the file-system should refuse further writes? (Internally, from the old release, I believe both approaches spill over as long as some category has free space.) The question would then be for how long ZFS should allow spilling? Until all space used? But that would mean that the general performance for content written after the small/fast storage got full is that of the large/slow device types. In contradiction with the performance purpose of both approaches. Personally, I would want ZFS to here refuse further user writes until the situation is resolved (either by deleting files or providing more small/fast capacity). However, I can see that other use cases may accept the performance degradation and prefer ZFS to spill indefinitely, i.e. until large/slow space is full. Fortunately, both policies should be able to easily co-exist - by adjusting how dspace is reported. In the latter case as the space of the large/slow device (alternatively all space, if also spilling to faster devices is allowed). In the former case, reporting it scaled with the fill factor of the relatively most used storage category (as in this project). Naturally, the best is if the admin/user notices the filling-up situation long before it is critical and takes action. The most likely cause of the small/fast storage filling up early is a too generous cut-off for small user data blocks going it. Thus it is also important to be able to modify the policy and cut-off of eligible blocks for existing vdevs, and not only allow setting policy on initial vdev creation. |
@inkdot7 thanks for giving this problem so much thought. As @ahrens mentioned the goal is that @don-brady's meta data allocation classes work should comprehensively address many of the same the issues you've uncovered in this PR. If there are specific use cases which you're concerned aren't handled let's discuss it so we can settle on a single implementation which works for the expected use cases. For example, as long as the code doesn't get too ugly I could see us adding a property to more finely control the spilling behavior on a per-dataset basis. Refusing writes when a class is fully consumed and allowing spilling both sound like reasonable policies? Are there other user configurable options we should be considering? We definitely want to make sure we get things like the command line tools right. That means making them both easy to use and flexible enough to support any reasonable allocation class configuration. Both @ahrens and I also agree that this functionality needs to have good reporting features so it's possible to assess how your storage is actually getting used. We could even consider going as far as surfacing a warning in @don-brady since we have multiple people keenly interested in this functionality it would be great if you could update the #5182 PR with your latest code. It looks like the version posted is about 4 months old and I know you've improved things since then! If you could add some example command output while you're at it I'm sure we'd all find it helpful. Updating the PR would make it possible for @inkdot7, @tuxoko and myself to provide constructive feedback. Feel free to open a new PR if that makes more sense and close the original one. @inkdot7 let's close this PR and focus on getting the work in #5182 in to a form we're all happy with. |
@inkdot7 - Thanks for your effort here and the collected data. As noted by others we added support for small blocks as well as a host of other recommended changes and some simplifications to the allocation class feature. |
@don-brady thanks, I think posting markdown in the comment section of the PR would be fine. Eventually, once the dust settles we'll of course need to update the man pages. |
@behlendorf Except for spilling policy, and data-block-to-metadata-vdev size cut-off, I have no ideas for user configuration options. But would need to see the code and understand how it behaves first. (@don-brady To see operating principles, a rebase to latest master would not be needed.) The last few days another use case did occur to me: Assume one wants to build a RAIDZ pool of hdds, and use some ssd to speed up at least metadata reads. As the ssd will be a separate vdev, it would have no redundancy, unless one invests in multiple ssds for a mirror. However, since ZFS can store blocks at multiple locations, it should be possible to store on the ssd blocks that i) have multiple copies, and ii) only the first copy. Redundancy is then provided by the RAIDZ vdev. If it is ensured that any block stored on the ssd vdev also exist on another vdev, then the pool should be able to fully survive the complete loss of the ssd vdev? I have not tested to see how ZFS would currently handle the case of such a non-critical loss of an entire vdev?, but could see me liking this use case a lot so thought I'd mention it. |
Proof-of-concept! This is not a pull request as such, but(Update: operational) for review / intended to spur discussion, and show some possible performance increases.These commits implement a simplistic strategy to make small records (e.g. metadata, but also file data) end up on faster non-rotating storage when a pool consist of two vdevs with different characteristics e.g. an SSD (mirror) and a HDD (raidzn).
For a test case (see commit log) with many small files, times for 'find' were cut by a factor 8, and scrub with a factor 3. More details in the commit log.
Note: Although I tried to search the web and read various articles etc, I have not really understood all the intricacies of space allocation, so presume there are unhandled cases, and possibly other issues with the approach. Feel free to point such out.
The purpose is the same as #3779.