-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZVOLs 3X slower than recordsets at same recordsize/volblocksize for sequential writes #4265
Comments
I've not done this exact type of performance testing but I'd like to point out that within ZFS, at least, the data component a ZVOL is no different than a file other than that it has no name in the filesystem namespace. The big difference, however, is that access to a ZVOL goes through the kernel's block IO system. Among other things, unless |
Tiotest is a quick'n dirty approach, i'll run some fio tests in direct=1 iodepth=128 and iodepth=1 to provide more results, but this is a consistent thing we see across real world workloads - random IO on the iSCSI targets being fed by the ZVOLs is much better than linear. |
My first thought is that the sequential workload might be triggered some unnecessary read-ahead in either the Linux kernel's block layer or internally in ZFS. I'd try disabling all read-ahead in both the kernel and ZFS and see what impact that has. Also be aware that the ZFS prefetch logic was just significantly reworked in master as we just pulled in Illumos 5987 7f60329. |
@behlendorf: could you provide any instruction for insuring no read-ahead occurs in ZoL? Is hdparm -A 0 sufficient for the disk itself? I'll deploy an ELK logger to consume iostat data for the runs and get some visual representation going. By the way, during direct=1 testing through a libvirt-scsi disk (not lun) abstraction, i'm seeing SCSI errors from tcm_loop on zvols which arent natively 4k. When i pass a 4k volblocksize zvol through as a libvirt-scsi LUN, everything works just fine. Something shifty is going on with volblocksize and how other things at static block sizes stack on top, since this seems eerily similar to the SCST loopback problem i saw prior on 8k zvols. Problem with using 4k is the bloody space loss on ashift=12 RAIDZ3 pools... might need to go to a striped mirror for this, but that's also not the most space efficient layout in the world. |
@sempervictus set the Regarding the SCSI errors you might try the patch in #4203 to increase the logical block size. There seem to be issues here when the logical size is 512 and the physical size is larger. Even though ZFS is correctly providing the right optimal size. |
On 2016-01-25 13:20, Richard Yao wrote:
Richard, I may give this a try too. |
@dswartz I am juggling enough things at once that while I thought my original comment was correct, I decided to delete it until I can double check things. I will repost it afterward. |
On 2016-01-25 13:33, Richard Yao wrote:
whoops, okay :) don't remember if i mentioned this in any earlier posts. i am currently |
On 2016-01-25 13:35, Richard Yao wrote:
No worries, thanks :) |
@dswartz After double checking, it turns out that my remarks were right the first time. As paradoxical as saying "we no longer have the IO queue" sounds when we have a @behlendorf We disable Linux's readahead on datasets, but not on zvols. I suspect this is the result of double prefetch overhead from interactions between the Linux readahead and zfetch rather than an I wrote a patch to address this alongside the work to kill the zvol IO queue and decided to split it into its own pull request. The patch to disable Linux readahead was rejected when it turned out that Linux @sempervictus Try applying the patch from #4272 and rerunning your tests: |
Is there a way to disable the kernel read ahead without the patch while booted system and ZFS modules loaded? I recently had time to play with ZFS again and using the latest version I still had ZVol issues that I reported in my bug report. I asked for that to be reopened. I was wondering if read ahead would cause the issue I had there. Thanks |
I did some cheap and dirty testing with the Disable Linux Readahead patch. Guest: Before Disable patch:CrystalDiskMark 5.1.1 x64 (C) 2007-2016 hiyohiyo
After Disable Patch:CrystalDiskMark 5.1.1 x64 (C) 2007-2016 hiyohiyo
EDIT: Pasted in the correct benchmarks. Sorry. |
These are exactly the same numbers. Are you sure you copy/paste them right ? |
@odoucet Oops. Pasted in the wrong ones for the after patch stats. I updated the post. Looking at the post again. It seems that the patch makes reads faster, but writes slower. Strange. Is that an issue with the ZFS Readahead? |
@dracwyrm A problem when evaluating that patch in your setup is that it has triple prefetch from zfetch, Linux readahead and Windows readahead. The interactions between three are not necessarily well behaved. I used to have a semi-similar setup where I ran Windows on physical hardware off an iSCSI backed zvol. I recall significant variations between runs of CrystalDiskMark in that setup. I suspect that might be the case for you too, which would mean that those numbers are probably not significant. There are also other potential things in play that can bottleneck or add variation to performance, such as volblocksize being mismatched and NTFS fragmentation. The latter definitely should cause performance to vary between runs unless you start the VM with the same on-zvol state each time. It might be possible to correct for all of those things to get representative numbers, but you need to control everything very tightly, which is hard to do in comparison to the the test by @sempervictus. His test also shows differences between datasets and zvols, which is what suggests there is additional performance that we could gain. |
So i've applied 4272, and here are the test results: http://pastebin.com/1FHMKiSg. |
I believe I've gotten to the bottom of the differing results when using tiotest as in @sempervictus' initial report. I set up a test rig using a slightly modified version of @sempervictus' script and also hacked tiotest to use The issue seems to be that For reasons I haven't yet tracked down, the zvol case results in a lot of contention as dbufs are dirtied. I believe the issue to be that the updates L1 and L2 blkptrs are highly contended. I'll try to track this down a bit further but for the moment, I think this mainly shows that concurrent sequential writes to separate objects perform better than strided concurrent sequential writes to the same object. |
After further testing and instrumentation, I've still not determined exactly what's causing the slowdown. Unfortunately, there's not a huge difference in performance. At the moment, The zvol test is running at 177 MB/s (with O_DIRECT enabled) and the file(s) test is running at 239 MB/s (tiotest 4K sequential writes). I'll note that the stock tiotest, without O_DIRECT runs at about 115 MB/s but I consider the O_DIRECT mode to be a more reasonable test because it's not dependent on the manner in which the page cache is flushed. |
Thank you sir, disturbing to hear you can't see the cause - I figured you
|
@sempervictus I've not given up yet :) The problem seems to be that the zvol_request->dmu_write_uio->dmu_write_uio_dnode path is slower than the zfs_write_dmu_write_uio_dbuf->dmu_write_uio_dnode path (even though they both ultimately use I'm following all the various call paths of |
@sempervictus I did a bit of a reset on my test regimen because I simply wasn't seeing anything like a 3:1 ratio for sequential writes between zvols and files. It turns out that adding I modified the original test script as in https://gist.github.com/dweeezil/3fb4ef7f0ce88c6b2e59 to test zvols with and without O_DIRECT and also only to test sequential writes. It also recreates the pool before each test. The pool I'm using is a set of 10 5G files on a tmpfs filesystem (size and number chosen to give good SSD-like performance). The 4-thread results are in https://gist.github.com/dweeezil/c444f55667216ead022f and show that So there seem to be 2 issues here: First is that The As to the overhead involved with small blocksize zvols insofar as this benchmark is concerned, I'm still looking into it. |
@dweeezil |
@sempervictus I think @dweeezil's approach of using @tuxoko Your thought about the code paths being different was right. In the normal case, A quick trace of
While a trace of tiotest on a zvol shows:
That would be |
@sempervictus It is not clear from your benchmarks if the patch to disable prefetch made a difference because the numbers do not look like they were done on the same pool configuration. That being said, #4316 has passed the buildbot and should help. |
I noticed that similar treatment is needed for the read case, so I pushed a couple of additional commits. I also see what might be some additional opportunities for improvement on both zvols and regular files in ZIL and maybe mmap too that I will try to implement next week. |
I finally figured out where much of the file-version's advantage comes from. It involves a set of circumstances rather specific to @sempervictus' original testing script:
It turns out that a freshly-created filesystem does not have a ZIL until the first modification is made to it. The following code in
in the case of this benchmark, due to the conditions listed above, will generally never call If one simply does something like this:
before running the first I am running with @ryao's latest zvol patches but I'm pretty sure they don't make a big difference in the zvol results. After modifying the script to do the dd and sync as shown above, here's the results (on a completely different testing system than I was using before): https://gist.github.com/dweeezil/b23d372755a6ec7806a1. |
For completeness, here's the testing script I used: https://gist.github.com/dweeezil/3f233823f5fc7922b276. I put the extra dd/sync in the zvol case but it's not needed there since O_DIRECT causes the zil to be created right away. |
@dweeezil I think you have satisfactorily explained the difference. It seems reasonable to think that the |
With the above patch merged in, the results are still rather pretty abysmal, and in some cases much worse than the original 3X comment. The actual use case here is a Cinder volume service (LVM/ZVOL until i or someone writes a proper ZFS driver), which i've configured to assign volumes via libvirt to the VMs as direct pass through via the virtio-scsi LUN configuration - basically no qemu/kvm middleware between the iSCSI target on the compute node and the VM.
When the volume is "fresh" the linear write speed is a bit better, once it gets a first-pass of writes though, sequential writes degrade to 1/7th the rate of random writes. Given the fact that ZoL is now being included in a number of distributions, especially Ubuntu, its only a matter of time before it becomes the defacto back-end for OpenStack storage (both for Cinder volumes, and Ceph backing), so this might become a more pressing concern. Far as using O_DIRECT, while i agree that it provides a more consistent baseline for benchmarks, there are plenty of applications which do not set the flag in real-world conditions, so it would likely make sense to test both variations. |
@sempervictus Just wondering if there's been any headway on this? (Or if you've found a workaround for it.) Reading through the issues list this afternoon and this one looks somewhat related to #4512 that I've been banging my head on for a bit now. |
blktrace seems to show that all ZVOL operations on one of the systems experiencing this problem are pegged to (almost) a single core:
That trace is from a VM doing a filesystem repair (NTFS, windows guest) atop an iSCSI export from zd32. This is with:
However, i dont see the spl_taskq_thread_dynamic parameter in sysctl -a...
returns nothing... |
all of the SPL knobs afaik never were under /proc/sys/, they were always under /sys
|
@sempervictus Nearly all zvol operations being pegged to a single core suggests that IO is being diverted to that core by the kernel. The cause is likely Try setting it to 1 and rerunning your tests. If things work well, we can patch the zvol code to make this the new default. Also, the dynamic taskqs work by spawning threads when there is work and killing them when there is none, so it is not surprising that you do not see anything. |
@kernelOfTruth @ryao: thanks for the input. dynamic taskq's are off, verified in /sys/module/spl/parameters/spl_taskq_thread_dynamic. Far as blktrace results, i took a few cycles of output before and after setting echo 1 > /sys/devices/virtual/block/zd80/queue/rq_affinity and the output looks a bit better. However, while its now spreading events across all cores, the first 4 have a 10:1 event ratio with the last 4. |
So as of todays version of #5135 (3b234ed) this problem persists, with throughput completely unaffected by the addition of a PCIe SLOG device:
For comparison, here's a dataset in the same pool:
The pool itself is SAS SSD, in a RAIDZ1, but neither it nor it with a separate write-path are able to get near the dataset's linear write throughput. Though interestingly enough, random writes are faster. Something synthetic is going on here.. |
True, but the IO patterns happen on iscsi targets feeding openstack all the time. We see zvols backing VMS at 100% util in iostat even when their vdevs are at 10-20%. Multiple zvols can peg in a pool with datasets still writing ok. For IO heavy workloads, zvols are almost useless as they back up so much IO that high throughput dbs fail to commit. It's bad, and tiotest isn't the problem. The serialized access method is probably causing this IMO, but above my pay grade for the time available right now to tackle. Whomever is working on zvols: please take a look at performance under async conditions... Iscsi or local KVM raw block device, even native zvol in an OS, all experience this sort of degradation.
…-------- Original Message --------
From: kpande <[email protected]>
Sent: Tuesday, December 20, 2016 10:34 AM
To: zfsonlinux/zfs <[email protected]>
Subject: Re: [zfsonlinux/zfs] ZVOLs 3X slower than recordsets at same recordsize/volblocksize for sequential writes (#4265)
CC: RageLtMan <[email protected]>,Mention <[email protected]>
for what it's worth, `tiotest` is terrible.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#4265 (comment)
|
Yes, and use it in production as needed - works as advertised, though doesn't help the performance issue much despite the fact that it allows for 8X fewer IO requests at the logical block tier. Once the write path is saturated, it kinda stays that way till pressure reduces.
…-------- Original Message --------
From: kpande <[email protected]>
Sent: Tuesday, December 20, 2016 01:22 PM
To: zfsonlinux/zfs <[email protected]>
Subject: Re: [zfsonlinux/zfs] ZVOLs 3X slower than recordsets at same recordsize/volblocksize for sequential writes (#4265)
CC: RageLtMan <[email protected]>,Mention <[email protected]>
@sempervictus did you ever try out the patch from #4203 which I just submitted as PR #5505 ?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#4265 (comment)
|
I believe you are being bitten by kernel RMW reads: Resolving those should drastically speed up writes. Check with zpool iostat -r to make sure that you have no or very few reads outstanding during write tests. It can help to raise txg timeout and the dirty data sync/max variables to make txg commits farther apart. Our experience is that large block ZVOLs are the fastest for write-based workloads, since there is less overhead. ZFS only does RMW reads at TxG commit time or with sub-blocksize indirect sync writes, and you can stretch that out to a longer interval with many workloads which decreases RMW even more. We use 128k-1m ZVOLs for most applications. ZVOLs never experience RMW except during txg commit because they never make sub-blocksize indirect sync writes. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
As we do more work on SSD-only pools we're starting to see consistent slowdowns across the board on ZVOLs vs normal data sets. Our use case is primarily iSCSI from ZVOLs, so this is a bit of a problem for us.
To prove this out, i wrote a simple test script for tiotest which creates zvols and data sets, benchmarks them, and destroys them:
The results (http://pastebin.com/AfU5VYuG) are odd - ~3X performance penalty on linear writes to ZVOLs, and somehow inverse for random writes. In neither case does the underlying disk show full-load on iostat. Thoughts?
The text was updated successfully, but these errors were encountered: