-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reading from mirror vdev limited by slowest device #1461
Comments
A starting point would be to take a look here: https://github.com/zfsonlinux/zfs/blob/master/module/zfs/vdev_mirror.c#L296 and https://github.com/zfsonlinux/zfs/blob/master/module/zfs/vdev_mirror.c#L219 I believe that is the code that selects which mirror device to read from. |
I think this would be a nice optimization and This could be changed such that after excluding all vdevs which appear in the DTL, the vdev with the smallest number of outstanding IOs in the |
When selecting which vdev mirror child to read from, redirect io's to child vdevs with the smallest pending queue lengths. During the vdev selection process if a readable vdev is found with a queue length of 0 then the vdev will be used and the above process will be cut short. It is hoped that this will cause all available vdevs to be utilised while ensuring the more capable devices take on more of the workload. closes openzfs#1461
@GregorKopka are you able to re-perform your tests with SHA: 9ab95ff applied on a non-production system? |
After some fiddling i think i managed to compile it from your fork
but it dosn't seem to work as intended, looks like it favours the slow spinning disk (sda) instead of the ssd (sdb):
|
Interesting... Nice gentoo?... So that was zfs-kmod you were compiling there? ( rather than the zfs package which just the usermode part of zfs which my patch does not effect? ) |
It also favours the slower drive when i reorder the drives in the mirror:
lookin at an iostat -x 10 while the test is running:
Maybe look at %util (in case this number can be gained easily) instead of pending requests for the device, or another metric which might do the trick? |
Since the setup is booting from ZFS (usb dongle with root pool, which i can dd onto a blank one to play with and mess it up without consequences) i had to compile the kernel (this pulls spl and zfs-kmod) and then zfs, rebooted afterwards to have it take effect (i hope). |
Done a few more tests here and it seems to test out ok, but will do some more testing, in the meantime if you can confirm your loaded zfs module is from my branch. Unsure of how your setup works but are your zfs modules loaded early out of an initramfs, if so has the modules in there been updated ( do you use genkernel? ) by regenerating your initramfs. I have gentoo with zfs root but boot off small md/ext2 /boot with a custom initramfs I hacked up back in the early days, so when I update ZFS I need to regen my initramfs... So bit unfamiliar with current gentoo/zfs/root practice to advise... |
Used https://github.com/ryao/zfs-overlay/blob/master/zfs-install as cookbook (without any overlay) and it works nicely. And yes, i used genkernel to regenerate the initrd. Linux for dummies: How to check if the module is build from the correct branch? |
Also you did not say if the emerge segment above was for zfs or zfs-kmod... Are you using /etc/portage/env/sys-fs/zfs-kmod to set your git url and branch ( not just the zfs file in same directory, I like to symlink those so just maintain 1 ) I'm not sure if you can get the modified date from a loaded module... |
When you said compiling kernel pulled in spl and zfs-kmod was that because you did this on your genkernel like guide?: --callback="emerge @module-rebuild" |
On May 18, 2013, at 5:45 PM, GregorKopka wrote:
Modify your META file. There's a line there that starts with 'Release:'. Modify that to read something intelligent (such as '1.1461fix'). Then That way your rpm will contain a valid version entry, and also when for absolut certainty that it's the correct version.Life sucks and then you die |
@FransUrbo: i use gentoo, no RPMs on my system
exactly. I also hacked both the zfs-9999 and zfs-kmod-9999 ebuilds
so the emerge triggered by module-rebuild should have pulled the right versions (the snippet i posted was from the output of the zfs-kmod emerge triggered through this mechanic). In case you want further testing i'll be happy to help out with that, sadly C isn't my language so i can't help with the coding itself. |
Thanks @FransUrbo that will help, thanks for the info. @GregorKopka if you up for another go I have added a commit to this branch that will allow us to confirm that your running on the correct module. So a general procedure would be:
P.S. |
Recompiled.
Further testing revealed that in case primarycache is disabled the code behaves somewhat like intended.
But in case primarycache is active (which i always tested with, thus the regular export/import) it dosn't and favours (and is limited by) the spinning disk (sda).
Hope this helps debugging it. |
some further observations:
Either this isn't I/O bound, or my system is somehow slowing the SSD to the speed of a 5400RPM spindle. Any ideas? |
1.5k iops from an ssd over 10 seconds does seem slow... It's tricky to tell from here, dd might be better than cat in this regard In regards to seeing more reads to sda, wondering if most if your working |
This is why i recreated the pool with sda and sdb swapped (so the SSD is the first drive in the mirror). Tests with dd if=/test/dump/file of=/dev/null bs=128kwith primarycache=all
with primarycache=metadata
with primarcache=none
interesting though is that
|
I've done some more work on this, took me a while to get some systemtap scripts together and then it failed me ( anyone know why the variable I'm after is always=?, I believe I have optimizations off etc... ) Anyhow reverted to a heap of printk's and have a better idea what is going on. @GregorKopka If your still setup for some testing, would you like to have another go with the the latest commit ( SHA: 4bb4110 ). I am wondering how large your test file is (/test/dump/file)? How much RAM do you have? Unfortunately primarycache=all gives me no io at all as this being a test box, my pool is only 1GB so any test file I create fits in ARC. I have added a tunable so you need to enable the balancing by running:
To turn off the balancing:
Here are my latest results: With prmarycache=metadata:
With primarycache=none:
|
Only thing in my test pool is dump filesystem with file created by something along the lines of
RAM is 8GB, but since i issue
prior to each run the ARC is empty at the start of the test (sans what is pulled in by the import). I'll run tests with the new commit shortly. |
Done well, I was looking for a way to drop the ARC. This is with primarycache=all: ( not sure how accurate as was difficult to catch )
After thinking about this for a while, I believe that balancing mirror reads from the pending io queue is a good way to get the more capable devices to do more work and should provide a benefit in general use cases. But in your particular use case where one of the mirror devices has significantly improved throughput, seek time and latency characteristics than the other mirror devices, then balancing off the read queue will improve throughput etc. over the usual round robin. However I don't think it will provide the throughput of sending all the io's to the much faster device. I believe that because the spinning disk is much slower, even when balancing, waiting for the io's from the slow disk will slow down the overall process of reading the data. I think that the algorithm you initially suggested has the most use to the community in general, but for your case I believe the best throughput would be obtained by always selecting the ssd for reads unless a read error or checksum failure occurred ( or during scrub ) in which case the slow device would be used. For example in my tests with 1 x disk and 1 x ram device:
I'm sure this would be achievable but the tricky part would be identifying when this is the case. I'm guessing there might be some disk info that could be queried eg a rotational flag somewhere, otherwise stats could be collected in order to identify the case. Once identified the part that selects the preferred mirror at the start of an io could select the fast device rather than: mm->mm_preferred = spa_get_random(c); Anyhows probably best to look at getting your currently proposed algorithm in first I suppose and look at that later. |
@b333z:
At first glance querying the rotational flag of a device sounds like a good idea, but after thinking a bit and looking around i found some code in which looks to me to be a better starting point (since nonrotating devices could well be slower than spinning ones): To test for timeouts vdev_deadman (https://github.com/zfsonlinux/zfs/blob/master/module/zfs/vdev.c#L3196) peeks at the timestamp of the first queued request of each vdev, creating a delta to current time (at https://github.com/zfsonlinux/zfs/blob/master/module/zfs/vdev.c#L3221). Thinking a while about it my feeling is that taking this information into the consideration could help selecting the faster device. Pseudocode here: b333z#1 Notes: Please watch the comment on the pull request! |
@b333z "anyone know why the variable I'm after is always=?, I believe I have optimizations off etc..." Can you provide an example of what's not working? FYI, I dug this (previously) working example out of one of my old stp scripts:
|
@chrisrd here is my script, this seems to be something I want to do fairly often just dump out the state of the locals for all lines in a function:
But since I added the printk's, the variables seem to be resolving now... So I'm thinking that now the variables are being referenced more the optimizer can't trash them?... I have tried to remove optimizations using -O0 so am compiling with:
I might repeat against vanilla zfs master and see if I can replicate the issue. Also printing pp() causes a massive line:
Do you of any way to just output file name and line?:
|
@b333z, Your script looks good to me. Sorry, I don't have any better ideas for you, other than avoiding the optimisation as you say. I was just post-processing the output to remove the file name noise. Mind you, I'm no systemtap guru, I first used it in relation to #503. I did find the systemtap mailing list to be very helpful in working out my stap issue at that time. |
When selecting which vdev mirror child to read from, redirect io's to child vdevs with the smallest pending queue lengths. During the vdev selection process if a readable vdev is found with a queue length of 0 then the vdev will be used and the above process will be cut short. It is hoped that this will cause all available vdevs to be utilised while ensuring the more capable devices take on more of the workload. closes openzfs#1461
The read bandwidth of an N-way mirror can by increased by 50%, and the IOPs by 10%, by more carefully selecting the preferred leaf vdev. The existing algorthm selects a perferred leaf vdev based on offset of the zio request modulo the number of members in the mirror. It assumes the drives are of equal performance and that spreading the requests randomly over both drives will be sufficient to saturate them. In practice this results in the leaf vdevs being under utilized. Utilization can be improved by preferentially selecting the leaf vdev with the least pending IO. This prevents leaf vdevs from being starved and compensates for performance differences between disks in the mirror. Faster vdevs will be sent more work and the mirror performance will not be limitted by the slowest drive. In the common case where all the pending queues are full and there is no single least busy leaf vdev a batching stratagy is employed. Of the N least busy vdevs one is selected with equal probability to be the preferred vdev for T milliseconds. Compared to randomly selecting a vdev to break the tie batching the requests greatly improves the odds of merging the requests in the Linux elevator. The testing results show a significant performance improvement for all four workloads tested. The workloads were generated using the fio benchmark and are as follows. 1) 1MB sequential reads from 16 threads to 16 files (MB/s). 2) 4KB sequential reads from 16 threads to 16 files (MB/s). 3) 1MB random reads from 16 threads to 16 files (IOP/s). 4) 4KB random reads from 16 threads to 16 files (IOP/s). | Pristine | With 1461 | | Sequential Random | Sequential Random | | 1MB 4KB 1MB 4KB | 1MB 4KB 1MB 4KB | | MB/s MB/s IO/s IO/s | MB/s MB/s IO/s IO/s | ---------------+-----------------------+------------------------+ 2 Striped | 226 243 11 304 | 222 255 11 299 | 2 2-Way Mirror | 302 324 16 534 | 433 448 23 571 | 2 3-Way Mirror | 429 458 24 714 | 648 648 41 808 | 2 4-Way Mirror | 562 601 36 849 | 816 828 82 926 | Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#1461
@b333z, @GregorKopka Your guys recent activity on this issue inspired me to take a closer look at the mirror code. For a long time I've had this feeling that we're been leaving a lot of performance on the table... it turns out we were. I started with your initial patch @b333z and improved things from there, I'd love it if you could take a look. I've been able to increase sequential read bandwidth even when just using hard drives by 50% and IOPs by 10%. The patch I've proposed improves the mirror vdev selection to make better decisions about which vdev to use. How this is done is described in the patch in detail and I've posted my testing results below. The patch is ready for wider scale testing and I'd love to see how much it improves real world workloads. @GregorKopka This patch also addresses your use case of pairing an SSD with a HDD, But more importantly it handles the general case where the drives in your mirror are performing slightly differently either due to age, defect, or just non-identical hardware. |
Whoa! Impressive work, from all of you. It makes me wish I had a workload / dataset suited to mirroring! |
@behlendorf: We started with something similar (though not setting prefered device to enable merges inside the kernel scheduler) by looking only at the queue length, but as it turned out that this lead to stalls on requests to the SSD side since (i guess) zfs didn't schedule new reads before the the HDD side returned certain data. After adding the request time into the equasion the average response time for read requests went down, leading to way more requests added to the SSD. I would be interested how your patch compares to the one Andrew and i put together, could you please run the tests you did against our code (https://github.com/b333z/zfs/tree/issue-1461), or send me the nice excel sheet (so i don't have to recreate this from scratch) then i'll recreate your test on my system? |
@behlendorf: Getting a very nice increase in ops and throughput with your patch even using 2 drives with same speed, done well, great outcome for all! Giving timed bursts to the vdevs when queues are empty is a great idea... With Gregor's case I'm seeing an effect where the benefits of balancing across the devices are counteracted by having to wait for the vastly slower device io's to complete. In the patch that Gregor mentions we've managed to make it half way between what you would get if you send io's only to the fast device ( best result ), by skewing off the avg. io completion times, but it does make me wonder if for this particular case there could be a flag in the vdev tree or similar that says always read from this device unless there is a problem... |
Me too, so here are the results from the same four workloads on the system which did the original testing just for completeness. Note this pool configuration is all spinning disk, you'll notice there isn't any significant improvement over the pristine master source.
@GregorKopka your welcome to the spreadsheet as well if you like I would have attached it if Github allowed that sort of thing.
@b333z Thank you, but I think you may be slightly misunderstand why it helps improve things. The real benefit to batching the requests to the vdev occurs when there queues are full not empty. We expect for the common case to be all the pending queues are full with 10 outstanding requests. When this occurs it's optimal to batch incoming requests to a single vdev for a short period to maximize merging. Every merged IO is virtually a free IO, and requests which come is at close to the same time are the most likely to merge. Now as for @GregorKopka's case the patch helps there because the SSD device in the mirror will consume its IOs fastest causing the pending queue to more frequent drop below 10 request. When this happen we prefer the SSD and give it a burst of requests to handle. An average request time could be used for this too, but relying on the pending tree size is nice and simple. This should be pretty much optimal from an aggregate bandwidth standpoint. But as you point out the story is a little bit different if you are more concerned about latency. Since the hard drive is still contributing you still see individual IOs with a longer worst case latency, although the total IO/s the pool can sustain will be much higher. There's nothing really which can be done about this short of adding a tuning which says always and only read from the lowest latency devices in the mirror. That's easy enough to add but not exactly a common case. Just for reference here's the patch I've proposed running against a ssd+hdd pool. Because it sounds like your most interested in IO/s I've run fio using 16 threads and a 4k random read workloads.
From the above results you can quickly see a few things.
|
Thanks for the clarification @behlendorf that makes sense. Those results look good, the ssd's are breaking free nicely over the usual roundrobin, been experimenting with a number of testing tools, I haven't come across fio in my travels, will have to give it a go! |
@behlendorf: These numbers dosn't make sense to me, since the more disks are there, the less IO/s are reported:
is something wrong with the numbers? Also, since you ran you test with 1M and these with 4M they sadly are not directly compareable, could you please rerun your original test? What settings did you create the pools with (ashift, atime, primarycache)? |
@GregorKopka Sorry, I messed up formatting that initial chart. Here's the fixed version. The numbers were for 1M not 4M I/Os and I just had the numbers accidentally out of order. Also be aware the 4K Random 2x4 number (529) isn't a typo, that's what the test turned in... but I don't believe it, and think it's a fluke. I just didn't get a chance to run the test multiple times and see. For commit ac0df7e
As for settings the pool was created with all defaults. Note that if you test the patch with a large pool you're are going to need to make sure to use multiple threads. One process probably won't be able to saturate the disk.
@b333z Fio is pretty nice because it's fairly flexible and you can usually easily generate the workload you're interested in. Plus it logs some useful statistics about the run. But there are certainly other good tools out there as well, and of course nothing tests like a real world workload. |
@GregorKopka Based on your testing does the patch in #1487 address your original concerns about performance of a mirror being limited by the slowest device. Unless there are serious concerns with the patch as I'm going to merge it for 0.6.2. |
@behlendorf While your approach isn't that optimal for my specific use case (since the window where the hdd will receive all requests leads to the ssd falling empty, at least with the default 10ms) i havn't seen ill side effects on standard hdd mirrors. So i'm ok with merging it, i'll fiddle with the tuneable when i have some time and spare hardware (which i sadly both lack at the moment) to test some more. One question though: where is the point in the window (at least for random i/o) when the Linux elevator is effectively disabled for pool vdev members (since zfs sets the noop scheduler) ? Maybe there is something in ZFS merging requests so this makes sense, because else i have no clue why your code is faster in the general case compared to what b333z and me came up with. |
@GregorKopka Even in your case I'd expect it to bias quite heavily toward your local SSD. Maybe not enough to 100% saturate it but still substantially better then the existing policy. I also agree that it would be best if the code could automatically set a good value for As for the Linux elevator setting noop doesn't entirely disable the elevator. It will still perform front and back merging with the existing requests in the queue. And because ZFS will attempt to issue its requests in LBA order if those requests are sent to a single vdev there's a decent change we'll see some merging occurring. For totally random IO that clearly breaks down. Longer term it would be desirable to allow ZFS to have a larger than 128k block size. This would allow us to easily merge larger requests for the disk. Unfortunately, that work depends on us being able to handle 1MB buffers so the zio scatter gather work must be done, and then adding 1MB block support via a feature flag. |
This is better than a cure for cancer. Every time I watch you guys improve ZFS, it's like magic. |
High guys thanks for all the work on this, we've been evaluating the performance of this under FreeBSD and have noticed an issue. In the original a good portion of sequential operations are directed to the same disk, which for HD's is important in minimising latency introduced by seek times. Based on how FreeBSD's gmirror works, when configured to load balanced mode, I've made some changes which in my tests produce significantly higher throughput so would like you feedback. == Setup == === Prefetch Disabled == ==== Stripe Balanced (default) ==== ==== Load Balanced (zfsonlinx) ==== === Prefetch Enabled === ==== Stripe Balanced (default) ==== ==== Load Balanced (zfsonlinx) ==== == Setup == === Prefetch Disabled == ==== Stripe Balanced (default) ==== ==== Load Balanced (zfsonlinx) ==== === Prefetch Enabled === ==== Stripe Balanced (default) ==== ==== Load Balanced (zfsonlinx) ==== In the above zfsonlinx refers the the patch applied to FreeBSD sources not to a different OS, so we are comparing apples to apples ;-) |
This is a huge improvement for several reasons:
This removes one lock cycle per disk per request, which is huge - since the lock only protects a single ulong_t read. Nice find! In case you would add a tuneable for non-rotational bonus instead of using vfs.zfs.vdev.mirror_locality_bonus for it (and using a fraction of that for the actual locality bonus) tuning of hybrid mirrors would be possible to the point of all reads served by the SSD side so full iops of the HDD would be available for writes... Is is possible for you to convert the patch into a pull request against current master? |
@stevenh I think adding locality in to the calculation is a really nice improvement. It would be great to refresh a version of the patch against ZoL so we can get some numbers for Linux. You may have struck upon a better way to do this, but the only way to be sure is to test! You didn't say explicitly in your original comment what the workload was. Am I correct to assume it was fully sequential reads from a N different files? That's exactly the workload I'd expect your changes to excel at which is what we see, nicely done! But before declaring something better or worse we should test a variety of workloads. We don't necessarily want to improve one use case if it penalizes another. At a minimum I usually try to get data for N threads (1,2,4,etc..) running 4K/1M sequential and fully random workloads. As you know the the original patch, 556011d, was designed to take advantage of the merging performed by the Linux block device elevator. Getting better locality should improve that further. Correct me if I'm wrong but I believe FreeBSD (like Illumos) doesn't perform any additional merging after issuing the request to the block layer, so I'm not too surprise about how the ZoL patch performed under FreeBSD. I'd love to test both patches under Linux and see! Thank you for continuing to work on this! I suspect there are still improvement which can be made here! |
Thanks for the feedback @GregorKopka, I'm still playing with this patch based on comments on the FreeBSD zfs list, recent changes include:
Link updated with new version: http://blog.multiplay.co.uk/dropzone/freebsd/zfs-mirror-load.patch |
@behlendorf I've tested a number of workloads, from 1 -> 9 workers on both a 2 x HDD and 2 x HDD + 1 x SSD setup. As you thought FreeBSD doesn't perform any merging after issuing the request to the block layer. I've not familiar myself with the Linux block IO layer so the knobs what notifies ZFS of rotational / non-rotational media, would need someone to implement but the core ZFS changes should be a direct port. As mentioned above I've just updated my current in progress patch which includes a number of additional enhancements :) |
I just committed my updated version of this patch which adds I/O locality to FreeBSD that significantly improves the performance. This can be found here: |
Cool thanks @behlendorf if I can be of any help let me know. |
The read bandwidth of an N-way mirror can by increased by 50%, and the IOPs by 10%, by more carefully selecting the preferred leaf vdev. The existing algorthm selects a perferred leaf vdev based on offset of the zio request modulo the number of members in the mirror. It assumes the drives are of equal performance and that spreading the requests randomly over both drives will be sufficient to saturate them. In practice this results in the leaf vdevs being under utilized. Utilization can be improved by preferentially selecting the leaf vdev with the least pending IO. This prevents leaf vdevs from being starved and compensates for performance differences between disks in the mirror. Faster vdevs will be sent more work and the mirror performance will not be limitted by the slowest drive. In the common case where all the pending queues are full and there is no single least busy leaf vdev a batching stratagy is employed. Of the N least busy vdevs one is selected with equal probability to be the preferred vdev for T microseconds. Compared to randomly selecting a vdev to break the tie batching the requests greatly improves the odds of merging the requests in the Linux elevator. The testing results show a significant performance improvement for all four workloads tested. The workloads were generated using the fio benchmark and are as follows. 1) 1MB sequential reads from 16 threads to 16 files (MB/s). 2) 4KB sequential reads from 16 threads to 16 files (MB/s). 3) 1MB random reads from 16 threads to 16 files (IOP/s). 4) 4KB random reads from 16 threads to 16 files (IOP/s). | Pristine | With 1461 | | Sequential Random | Sequential Random | | 1MB 4KB 1MB 4KB | 1MB 4KB 1MB 4KB | | MB/s MB/s IO/s IO/s | MB/s MB/s IO/s IO/s | ---------------+-----------------------+------------------------+ 2 Striped | 226 243 11 304 | 222 255 11 299 | 2 2-Way Mirror | 302 324 16 534 | 433 448 23 571 | 2 3-Way Mirror | 429 458 24 714 | 648 648 41 808 | 2 4-Way Mirror | 562 601 36 849 | 816 828 82 926 | Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#1461
Problem
Performance issue: reading from a mirror vdev is limited by slowest device of mirror
ZFS seems to issue reads to mirror vdev members in a round-robin fashion.
Background info
In need of cheap high random read iops on ~1TB of data (even directly after reboot, so L2ARC dosn't help that much for this scenario - plus L2ARC of this size would be limited by maximum 32GB of RAM the system can handle) i had the idea to build a pool with hybrid vdev mirrors, one side an SSD to get the iops, the other side(s) cheap spindle(s) to gain redundancy in case one of the SSDs bites the dust.
Sadly this dosn't work:
Test
sda: spindle, sdb: ssd
/test/dump/* ~6GB in 20 files
solution?
issue reads to non-busy vdev members, maybe depending on
(/sys/block/[dev]/device/queue_depth - /sys/block/[dev]/inflight)
The text was updated successfully, but these errors were encountered: