Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading from mirror vdev limited by slowest device #1461

Closed
GregorKopka opened this issue May 16, 2013 · 45 comments
Closed

reading from mirror vdev limited by slowest device #1461

GregorKopka opened this issue May 16, 2013 · 45 comments
Labels
Type: Feature Feature request or new feature
Milestone

Comments

@GregorKopka
Copy link
Contributor

Problem

Performance issue: reading from a mirror vdev is limited by slowest device of mirror
ZFS seems to issue reads to mirror vdev members in a round-robin fashion.

Background info

In need of cheap high random read iops on ~1TB of data (even directly after reboot, so L2ARC dosn't help that much for this scenario - plus L2ARC of this size would be limited by maximum 32GB of RAM the system can handle) i had the idea to build a pool with hybrid vdev mirrors, one side an SSD to get the iops, the other side(s) cheap spindle(s) to gain redundancy in case one of the SSDs bites the dust.

Sadly this dosn't work:

Test

sda: spindle, sdb: ssd
/test/dump/* ~6GB in 20 files

~ $ zpool import test ; zpool status
        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0

~ $ time cat /test/dump/* > /dev/null
real    0m35.928s
user    0m0.050s
sys     0m6.770s

              capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test        6,08G   105G  1,43K      0   181M      0
  mirror    6,08G   105G  1,43K      0   181M      0
    sda         -      -    728      0  90,6M      0
    sdb         -      -    732      0  90,6M      0
----------  -----  -----  -----  -----  -----  -----

~ $ zpool offline test sdb
~ $ zpool export test ; zpool import test ; zpool status
        NAME        STATE     READ WRITE CKSUM
        test        DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            sda     ONLINE       0     0     0
            sdb     OFFLINE      0     0     0

~ $ time cat /test/dump/* > /dev/null
real    1m1.448s
user    0m0.040s
sys     0m6.340s

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test        6,08G   105G    787      0  97,7M      0
  mirror    6,08G   105G    787      0  97,7M      0
    sda         -      -    787      0  97,7M      0
    sdb         -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----

~ $ zpool online test sdb
~ $ zpool offline test sda
~ $ zpool export test ; zpool import test ; zpool status
        NAME        STATE     READ WRITE CKSUM
        test        DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            sda     OFFLINE      0     0     0
            sdb     ONLINE       0     0     0

~ $ time cat /test/dump/* > /dev/null
real    0m12.332s
user    0m0.030s
sys     0m5.840s

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test        6,08G   105G  3,98K      0   505M      0
  mirror    6,08G   105G  3,98K      0   505M      0
    sda         -      -      0      0      0      0
    sdb         -      -  3,98K      0   505M      0
----------  -----  -----  -----  -----  -----  -----

solution?

issue reads to non-busy vdev members, maybe depending on
(/sys/block/[dev]/device/queue_depth - /sys/block/[dev]/inflight)

@b333z
Copy link
Contributor

b333z commented May 16, 2013

A starting point would be to take a look here:

https://github.com/zfsonlinux/zfs/blob/master/module/zfs/vdev_mirror.c#L296 and https://github.com/zfsonlinux/zfs/blob/master/module/zfs/vdev_mirror.c#L219

I believe that is the code that selects which mirror device to read from.

@behlendorf
Copy link
Contributor

I think this would be a nice optimization and vdev_mirror_child_select() is exactly the right function to change to support this. Currently it does a strict round robin after factoring in the DTL (dirty time log) which describes which blocks might not be fully replicated due to a previous failure.

This could be changed such that after excluding all vdevs which appear in the DTL, the vdev with the smallest number of outstanding IOs in the vdev.vdev_queue->vq_pending_tree is selected. It should have the best service time either because it's 1) just faster or 2) better behaved at the moment (perhaps because it's about the fail).

b333z added a commit to b333z/zfs that referenced this issue May 17, 2013
When selecting which vdev mirror child to read from,
redirect io's to child vdevs with the smallest pending
queue lengths.

During the vdev selection process if a readable vdev is found
with a queue length of 0 then the vdev will be used and the
above process will be cut short.

It is hoped that this will cause all available vdevs to be
utilised while ensuring the more capable devices take on
more of the workload.

closes openzfs#1461
@b333z
Copy link
Contributor

b333z commented May 17, 2013

@GregorKopka are you able to re-perform your tests with SHA: 9ab95ff applied on a non-production system?

@GregorKopka
Copy link
Contributor Author

After some fiddling i think i managed to compile it from your fork

>>> Unpacking source...
GIT update -->
   repository:               git://github.com/b333z/zfs.git
   at the commit:            9ab95fffe1ff8c77b09ec74378c4254d2f36da3f
   commit:                   9ab95fffe1ff8c77b09ec74378c4254d2f36da3f
   branch:                   issue-1461
   storage directory:        "/usr/portage/distfiles/egit-src/zfs.git"
   checkout type:            bare repository

but it dosn't seem to work as intended, looks like it favours the slow spinning disk (sda) instead of the ssd (sdb):

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test        11,7G  99,3G  1,21K      0   154M      0
  mirror    11,7G  99,3G  1,21K      0   154M      0
    sda         -      -    643      0  79,5M      0
    sdb         -      -    599      0  74,8M      0
----------  -----  -----  -----  -----  -----  -----
test        11,7G  99,3G  1,15K      0   146M      0
  mirror    11,7G  99,3G  1,15K      0   146M      0
    sda         -      -    648      0  80,2M      0
    sdb         -      -    528      0  65,7M      0
----------  -----  -----  -----  -----  -----  -----
test        11,7G  99,3G  1,25K      0   159M      0
  mirror    11,7G  99,3G  1,25K      0   159M      0
    sda         -      -    661      0  81,9M      0
    sdb         -      -    617      0  76,8M      0
----------  -----  -----  -----  -----  -----  -----
test        11,7G  99,3G  1,08K      0   138M      0
  mirror    11,7G  99,3G  1,08K      0   138M      0
    sda         -      -    640      0  79,4M      0
    sdb         -      -    469      0  58,3M      0
----------  -----  -----  -----  -----  -----  -----
test        11,7G  99,3G  1,27K      0   161M      0
  mirror    11,7G  99,3G  1,27K      0   161M      0
    sda         -      -    657      0  81,6M      0
    sdb         -      -    643      0  79,8M      0
----------  -----  -----  -----  -----  -----  -----
test        11,7G  99,3G  1,23K      0   157M      0
  mirror    11,7G  99,3G  1,23K      0   157M      0
    sda         -      -    667      0  82,9M      0
    sdb         -      -    595      0  73,7M      0
----------  -----  -----  -----  -----  -----  -----
test        11,7G  99,3G  1,18K      0   150M      0
  mirror    11,7G  99,3G  1,18K      0   150M      0
    sda         -      -    650      0  80,7M      0
    sdb         -      -    559      0  69,4M      0
----------  -----  -----  -----  -----  -----  -----
test        11,7G  99,3G    867      0   108M      0
  mirror    11,7G  99,3G    867      0   108M      0
    sda         -      -    498      0  61,5M      0
    sdb         -      -    369      0  46,1M      0
----------  -----  -----  -----  -----  -----  -----

@b333z
Copy link
Contributor

b333z commented May 18, 2013

Interesting... Nice gentoo?... So that was zfs-kmod you were compiling there? ( rather than the zfs package which just the usermode part of zfs which my patch does not effect? )

@GregorKopka
Copy link
Contributor Author

It also favours the slower drive when i reorder the drives in the mirror:

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,12K      0   143M      0
  mirror    10,5G   100G  1,12K      0   143M      0
    sdb         -      -    509      0  63,5M      0
    sda         -      -    643      0  79,6M      0
----------  -----  -----  -----  -----  -----  -----

lookin at an iostat -x 10 while the test is running:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00  652,50    0,00 82943,60     0,00   254,23     8,26   12,65   12,65    0,00   1,50  98,20
sdb               0,00     0,00  530,20    0,00 67312,80     0,00   253,91     1,12    2,11    2,11    0,00   0,26  14,00

Maybe look at %util (in case this number can be gained easily) instead of pending requests for the device, or another metric which might do the trick?

@GregorKopka
Copy link
Contributor Author

Linux version 3.7.10-gentoo (root@backend) (gcc version 4.6.3 (Gentoo Hardened 4.6.3 p1.11, pie-0.5.2) ) #4 SMP Sat May 18 14:55:45 CEST 2013

Since the setup is booting from ZFS (usb dongle with root pool, which i can dd onto a blank one to play with and mess it up without consequences) i had to compile the kernel (this pulls spl and zfs-kmod) and then zfs, rebooted afterwards to have it take effect (i hope).

@b333z
Copy link
Contributor

b333z commented May 18, 2013

Done a few more tests here and it seems to test out ok, but will do some more testing, in the meantime if you can confirm your loaded zfs module is from my branch.

Unsure of how your setup works but are your zfs modules loaded early out of an initramfs, if so has the modules in there been updated ( do you use genkernel? ) by regenerating your initramfs.

I have gentoo with zfs root but boot off small md/ext2 /boot with a custom initramfs I hacked up back in the early days, so when I update ZFS I need to regen my initramfs... So bit unfamiliar with current gentoo/zfs/root practice to advise...

@GregorKopka
Copy link
Contributor Author

Used https://github.com/ryao/zfs-overlay/blob/master/zfs-install as cookbook (without any overlay) and it works nicely.

And yes, i used genkernel to regenerate the initrd.

Linux for dummies: How to check if the module is build from the correct branch?

@b333z
Copy link
Contributor

b333z commented May 18, 2013

Also you did not say if the emerge segment above was for zfs or zfs-kmod...

Are you using /etc/portage/env/sys-fs/zfs-kmod to set your git url and branch ( not just the zfs file in same directory, I like to symlink those so just maintain 1 )

I'm not sure if you can get the modified date from a loaded module...

@b333z
Copy link
Contributor

b333z commented May 18, 2013

When you said compiling kernel pulled in spl and zfs-kmod was that because you did this on your genkernel like guide?: --callback="emerge @module-rebuild"
Anyhow's I'd better sleep, don't go to too much trouble its probably issue with patch, I can drop some printf's in the patch tomorrow to ensure your running right module if we need...

@FransUrbo
Copy link
Contributor

On May 18, 2013, at 5:45 PM, GregorKopka wrote:

Linux for dummies: How to check if the module is build from the correct branch?

I've been trying to push for this for quite some time now, but try this:

Modify your META file. There's a line there that starts with 'Release:'.

Modify that to read something intelligent (such as '1.1461fix'). Then
create the empty file '.nogitrelease' and rerun configure, make and
make rpm.

That way your rpm will contain a valid version entry, and also when
it's loaded, it will say 'version 0.6.1-1.1461fix' and then you'll know

for absolut certainty that it's the correct version.

Life sucks and then you die

@GregorKopka
Copy link
Contributor Author

@FransUrbo: i use gentoo, no RPMs on my system

@b333z:

When you said compiling kernel pulled in spl and zfs-kmod was that because you did this on your genkernel like guide?: 
--callback="emerge @module-rebuild"

exactly. I also hacked both the zfs-9999 and zfs-kmod-9999 ebuilds

if [ ${PV} == "9999" ] ; then
        inherit git-2 linux-mod
        EGIT_BRANCH="issue-1461"
        EGIT_COMMIT="9ab95fffe1ff8c77b09ec74378c4254d2f36da3f"
        EGIT_REPO_URI="git://github.com/b333z/${PN}.git"
else

so the emerge triggered by module-rebuild should have pulled the right versions (the snippet i posted was from the output of the zfs-kmod emerge triggered through this mechanic).

In case you want further testing i'll be happy to help out with that, sadly C isn't my language so i can't help with the coding itself.

@b333z
Copy link
Contributor

b333z commented May 19, 2013

Thanks @FransUrbo that will help, thanks for the info.

@GregorKopka if you up for another go I have added a commit to this branch that will allow us to confirm that your running on the correct module.

So a general procedure would be:

  1. Adjust commit in zfs-kmod-9999.build and zfs-9999.build ( you can leave the EGIT_COMMIT variable out and just use EGIT_BRANCH and it will get the latest commit but if your specifying it it needs to move to next commit )
    EGIT_COMMIT="cad5a75"
  2. Rebuild zfs modules and initramfs
# emerge -v "=zfs-kmod-9999"
# genkernel --no-clean --no-mountboot --zfs --bootloader=grub2 initramfs 
  1. Reboot
  2. Test you are on the correct version:
    If you are running on the correct module you should have a line similar to this in your dmesg with b333z-issue-1461 in it:
# dmesg | grep 1461
[   15.276594] ZFS: Loaded module v0.6.1-b333z-issue-1461  (DEBUG mode), ZFS pool version 5000, ZFS filesystem version 5
  1. If above ok, run your tests, otherwise let us know and I'm sure we can provide a set of steps to find where the update went wrong.

P.S.
I wrote a gist on a way to adjust source control values temporarily for your 9999 ebuilds, without changing the ebuilds: https://gist.github.com/b333z/5606656
I use this for my zfs test/dev box so this may be overkill for standard use ( It's been a pleasantly long time since I jumped to any strange branches on any production machines ), but its a handy gentoo thing anyways...
There looks to be a newer way of doing this that I havn't cut over to yet see: http://wiki.gentoo.org/wiki//etc/portage/env

@GregorKopka
Copy link
Contributor Author

Recompiled.

dmesg | grep issue
[   12.147463] ZFS: Loaded module v0.6.1-b333z-issue-1461 , ZFS pool version 5000, ZFS filesystem version 5

Further testing revealed that in case primarycache is disabled the code behaves somewhat like intended.

~ $ zfs set primarycache=none /test/dump ; zpool export test ; zpool import test -d /dev
~ $ cat /test/dump/* > /dev/null & zpool iostat -v 10
[1] 6977
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,21K      0  54,7M      0
  mirror    10,5G   100G  1,21K      0  54,7M      0
    sdb         -      -    981      0  30,4M      0
    sda         -      -    259      0  26,2M      0
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,22K      0  55,4M      0
  mirror    10,5G   100G  1,22K      0  55,4M      0
    sdb         -      -    650      0  29,5M      0
    sda         -      -    604      0  27,7M      0
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,24K      0  55,9M      0
  mirror    10,5G   100G  1,24K      0  55,9M      0
    sdb         -      -   1004      0  30,9M      0
    sda         -      -    262      0  26,8M      0
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,22K      0  55,0M      0
  mirror    10,5G   100G  1,22K      0  55,0M      0
    sdb         -      -    984      0  30,5M      0
    sda         -      -    261      0  26,4M      0
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,22K      0  55,1M      0
  mirror    10,5G   100G  1,22K      0  55,1M      0
    sdb         -      -    831      0  29,5M      0
    sda         -      -    416      0  27,4M      0
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,21K      0  54,7M      0
  mirror    10,5G   100G  1,21K      0  54,7M      0
    sdb         -      -    271      0  26,3M      0
    sda         -      -    968      0  30,2M      0
----------  -----  -----  -----  -----  -----  -----

But in case primarycache is active (which i always tested with, thus the regular export/import) it dosn't and favours (and is limited by) the spinning disk (sda).

~ $ zfs inherit primarycache /test/dump ; zpool export test ; zpool import test -d /dev
~ $ cat /test/dump/* > /dev/null & zpool iostat -v 10
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,20K      0   152M      0
  mirror    10,5G   100G  1,20K      0   152M      0
    sdb         -      -    605      0  74,9M      0
    sda         -      -    618      0  77,0M      0
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,22K      0   155M      0
  mirror    10,5G   100G  1,22K      0   155M      0
    sdb         -      -    618      0  76,7M      0
    sda         -      -    633      0  78,7M      0
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,09K      0   138M      0
  mirror    10,5G   100G  1,09K      0   138M      0
    sdb         -      -    529      0  65,9M      0
    sda         -      -    585      0  72,5M      0
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,04K      0   132M      0
  mirror    10,5G   100G  1,04K      0   132M      0
    sdb         -      -    490      0  61,2M      0
    sda         -      -    576      0  71,2M      0
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,13K      0   144M      0
  mirror    10,5G   100G  1,13K      0   144M      0
    sdb         -      -    556      0  69,0M      0
    sda         -      -    603      0  74,9M      0
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,11K      0   141M      0
  mirror    10,5G   100G  1,11K      0   141M      0
    sdb         -      -    554      0  68,9M      0
    sda         -      -    579      0  71,8M      0
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,09K      0   138M      0
  mirror    10,5G   100G  1,09K      0   138M      0
    sdb         -      -    466      0  57,9M      0
    sda         -      -    647      0  80,2M      0
----------  -----  -----  -----  -----  -----  -----

Hope this helps debugging it.

@GregorKopka
Copy link
Contributor Author

some further observations:

~ $ zfs set primarycache=none test
~ $ zpool export test ; zpool import test -d /dev
~ $ cat /test/dump/* > /dev/null & zpool iostat -v 10
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,22K      0  55,1M      0
  mirror    10,5G   100G  1,22K      0  55,1M      0
    sdb         -      -    649      0  29,4M      0
    sda         -      -    599      0  27,6M      0
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,24K      0  56,0M      0
  mirror    10,5G   100G  1,24K      0  56,0M      0
    sdb         -      -   1006      0  31,0M      0
    sda         -      -    262      0  26,8M      0
----------  -----  -----  -----  -----  -----  -----

~ $ zpool offline test sda
~ $ zpool export test ; zpool import test -d /dev
~ $ cat /test/dump/* > /dev/null & zpool iostat -v 10
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,51K      0  68,2M      0
  mirror    10,5G   100G  1,51K      0  68,2M      0
    sdb         -      -  1,51K      0  70,4M      0
    sda         -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,49K      0  67,4M      0
  mirror    10,5G   100G  1,49K      0  67,4M      0
    sdb         -      -  1,49K      0  69,6M      0
    sda         -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----

~ $ zpool online test sda; sleep 2; zpool offline test sdb
~ $ zpool export test ; zpool import test -d /dev
~ $ cat /test/dump/* > /dev/null & zpool iostat -v 10

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,20K      6  54,2M  10,9K
  mirror    10,5G   100G  1,20K      6  54,2M  10,9K
    sdb         -      -      0      0      0      0
    sda         -      -  1,20K      5  56,0M  30,4K
----------  -----  -----  -----  -----  -----  -----
test        10,5G   100G  1,23K      0  55,7M      0
  mirror    10,5G   100G  1,23K      0  55,7M      0
    sdb         -      -      0      0      0      0
    sda         -      -  1,23K      0  57,5M      0
----------  -----  -----  -----  -----  -----  -----

Either this isn't I/O bound, or my system is somehow slowing the SSD to the speed of a 5400RPM spindle.

Any ideas?

@b333z
Copy link
Contributor

b333z commented May 19, 2013

1.5k iops from an ssd over 10 seconds does seem slow...

It's tricky to tell from here, dd might be better than cat in this regard
as it will show how much data is being transferred in total, and don't know
you working set size and available ram, also not sure if cat will transfer
single chars instead of blocks causing you CPU to bottle neck... dd you can
bs=4k or similar... although for this sort of thing I suppose we need to
be scientific and use the right tools... Will probably need to do some
bonnie++ and/or iozone testing to really evaluate this patches usefulness +
ensuring working set that is large enough to bypass caches etc.

In regards to seeing more reads to sda, wondering if most if your working
set is cached in the arc/linux cache causing it to only need to request the
odd block off disk, then when its requesting the block and selecting the
mirror, the first dev sda has an empty request queue most of the time
making it appear favoured. ( in the algorithm, if it finds a device with
nothing in its pending queue, it accepts that without checking other )
where maybe its flat out most the time reading from arc/ram and/or hitting
cpu max because of cat...

@GregorKopka
Copy link
Contributor Author

This is why i recreated the pool with sda and sdb swapped (so the SSD is the first drive in the mirror).
Results above: sda is still favored with primarycache=all while with =none the ssd is hit more.

Tests with dd if=/test/dump/file of=/dev/null bs=128k

with primarycache=all

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test        19,0G  92,0G  1,26K      0   160M      0
  mirror    19,0G  92,0G  1,26K      0   160M      0
    sda         -      -    670      0  83,2M      0
    sdb         -      -    616      0  76,5M      0
----------  -----  -----  -----  -----  -----  -----
test        19,0G  92,0G  1,16K      0   148M      0
  mirror    19,0G  92,0G  1,16K      0   148M      0
    sda         -      -    639      0  79,1M      0
    sdb         -      -    553      0  68,9M      0
----------  -----  -----  -----  -----  -----  -----
test        19,0G  92,0G  1,37K      0   174M      0
  mirror    19,0G  92,0G  1,37K      0   174M      0
    sda         -      -    693      0  86,1M      0
    sdb         -      -    707      0  87,7M      0
----------  -----  -----  -----  -----  -----  -----
test        19,0G  92,0G  1,35K      0   172M      0
  mirror    19,0G  92,0G  1,35K      0   172M      0
    sda         -      -    680      0  84,5M      0
    sdb         -      -    707      0  87,7M      0

with primarycache=metadata

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test        19,0G  92,0G    777      0  96,4M      0
  mirror    19,0G  92,0G    777      0  96,4M      0
    sda         -      -    386      0  48,3M      0
    sdb         -      -    391      0  48,2M      0
----------  -----  -----  -----  -----  -----  -----
test        19,0G  92,0G    755      0  93,7M      0
  mirror    19,0G  92,0G    755      0  93,7M      0
    sda         -      -    378      0  46,8M      0
    sdb         -      -    377      0  46,9M      0
----------  -----  -----  -----  -----  -----  -----
test        19,0G  92,0G    731      0  90,7M      0
  mirror    19,0G  92,0G    731      0  90,7M      0
    sda         -      -    368      0  45,4M      0
    sdb         -      -    363      0  45,4M      0
----------  -----  -----  -----  -----  -----  -----

with primarcache=none
effect is visible, but strange things happen

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
  mirror    19,0G  92,0G  1,27K      0  44,9M      0
    sda         -      -    450      0  22,6M      0
    sdb         -      -    852      0  24,5M      0
----------  -----  -----  -----  -----  -----  -----
  mirror    19,0G  92,0G  1,28K      0  45,0M      0
    sda         -      -    490      0  23,1M      0
    sdb         -      -    816      0  24,2M      0
----------  -----  -----  -----  -----  -----  -----
  mirror    19,0G  92,0G  1,27K      0  44,7M      0
    sda         -      -    485      0  22,7M      0
    sdb         -      -    811      0  24,1M      0
----------  -----  -----  -----  -----  -----  -----
  mirror    19,0G  92,0G  1,28K      0  45,2M      0
    sda         -      -    262      0  21,3M      0
    sdb         -      -  1,02K      0  26,1M      0
----------  -----  -----  -----  -----  -----  -----
  mirror    19,0G  92,0G  1,30K      0  45,8M      0
    sda         -      -    306      0  21,9M      0
    sdb         -      -   1022      0  26,2M      0
----------  -----  -----  -----  -----  -----  -----
  mirror    19,0G  92,0G  1,30K      0  45,8M      0
    sda         -      -    499      0  23,4M      0
    sdb         -      -    832      0  24,8M      0
----------  -----  -----  -----  -----  -----  -----
  mirror    19,0G  92,0G  1,30K      0  45,8M      0
    sda         -      -    575      0  24,0M      0
    sdb         -      -    755      0  24,1M      0
----------  -----  -----  -----  -----  -----  -----
  mirror    19,0G  92,0G  1,29K      0  45,5M      0
    sda         -      -    813      0  25,8M      0
    sdb         -      -    507      0  22,0M      0
----------  -----  -----  -----  -----  -----  -----
  mirror    19,0G  92,0G  1,25K      0  43,9M      0
    sda         -      -    784      0  24,8M      0
    sdb         -      -    491      0  21,3M      0
----------  -----  -----  -----  -----  -----  -----
  mirror    19,0G  92,0G  1,28K      0  45,3M      0
    sda         -      -    821      0  25,6M      0
    sdb         -      -    493      0  21,9M      0
----------  -----  -----  -----  -----  -----  -----

interesting though is that

  1. setting of primarycache massively changes the outcome
  2. with primarycache=none bandwith is edit:nearly identical for both devices, even though iops differ massively

@b333z
Copy link
Contributor

b333z commented May 24, 2013

I've done some more work on this, took me a while to get some systemtap scripts together and then it failed me ( anyone know why the variable I'm after is always=?, I believe I have optimizations off etc... )

Anyhow reverted to a heap of printk's and have a better idea what is going on.

@GregorKopka If your still setup for some testing, would you like to have another go with the the latest commit ( SHA: 4bb4110 ). I am wondering how large your test file is (/test/dump/file)? How much RAM do you have? Unfortunately primarycache=all gives me no io at all as this being a test box, my pool is only 1GB so any test file I create fits in ARC.

I have added a tunable so you need to enable the balancing by running:

echo 1 > /sys/module/zfs/parameters/zfs_vdev_mirror_pending_balance

To turn off the balancing:

echo 0 > /sys/module/zfs/parameters/zfs_vdev_mirror_pending_balance

Here are my latest results:

With prmarycache=metadata:

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank1        301M   707M  2.30K      0   295M      0
  mirror     301M   707M  2.30K      0   295M      0
    vda1        -      -    694      0  86.8M      0
    ram0        -      -  1.62K      0   208M      0

With primarycache=none:

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank1        301M   707M  1.86K      0   192M      0
  mirror     301M   707M  1.86K      0   192M      0
    vda1        -      -    723      0  56.7M      0
    ram0        -      -  1.15K      0   135M      0

@GregorKopka
Copy link
Contributor Author

Only thing in my test pool is dump filesystem with file created by something along the lines of

dd if=/dev/urandom of=/test/dump/a bs=1M count=512
for i in {1..10} ; do cat /test/dump/a >> /test/dump/file ; done

RAM is 8GB, but since i issue

zpool export test; zpool import test -d /dev

prior to each run the ARC is empty at the start of the test (sans what is pulled in by the import).

I'll run tests with the new commit shortly.

@b333z
Copy link
Contributor

b333z commented May 24, 2013

Done well, I was looking for a way to drop the ARC.

This is with primarycache=all: ( not sure how accurate as was difficult to catch )

               capacity     operations    bandwidth  
pool        alloc   free   read  write   read  write 
----------  -----  -----  -----  -----  -----  ----- 
tank1        301M   707M  1.11K      0   140M      0 
  mirror     301M   707M  1.11K      0   140M      0 
    ram0        -      -    939      6   117M  13.5K 
    vdb1        -      -    194      0  23.8M      0 

After thinking about this for a while, I believe that balancing mirror reads from the pending io queue is a good way to get the more capable devices to do more work and should provide a benefit in general use cases.

But in your particular use case where one of the mirror devices has significantly improved throughput, seek time and latency characteristics than the other mirror devices, then balancing off the read queue will improve throughput etc. over the usual round robin. However I don't think it will provide the throughput of sending all the io's to the much faster device. I believe that because the spinning disk is much slower, even when balancing, waiting for the io's from the slow disk will slow down the overall process of reading the data.

I think that the algorithm you initially suggested has the most use to the community in general, but for your case I believe the best throughput would be obtained by always selecting the ssd for reads unless a read error or checksum failure occurred ( or during scrub ) in which case the slow device would be used.

For example in my tests with 1 x disk and 1 x ram device:

  1. Normal round robin: 80MB/s+80MB/s = 160MB/sec
  2. Balancing off pending: 80MB/s + 200MB/s = 280MB/s
  3. With slow vdevs offlined: 0MB/s + 400MB/s = 400MB/s

I'm sure this would be achievable but the tricky part would be identifying when this is the case. I'm guessing there might be some disk info that could be queried eg a rotational flag somewhere, otherwise stats could be collected in order to identify the case. Once identified the part that selects the preferred mirror at the start of an io could select the fast device rather than: mm->mm_preferred = spa_get_random(c); Anyhows probably best to look at getting your currently proposed algorithm in first I suppose and look at that later.

@GregorKopka
Copy link
Contributor Author

@b333z:
Sadly i had to put all spare SSDs i had at hand to work to speedup a pool reshape, so i'll be unable to run the tests till they revert back to idle - i guess you can expect some results on sunday.

I'm guessing there might be some disk info that could be queried eg a rotational flag somewhere,
otherwise stats could be collected in order to identify the case.

At first glance querying the rotational flag of a device sounds like a good idea, but after thinking a bit and looking around i found some code in which looks to me to be a better starting point (since nonrotating devices could well be slower than spinning ones):

To test for timeouts vdev_deadman (https://github.com/zfsonlinux/zfs/blob/master/module/zfs/vdev.c#L3196) peeks at the timestamp of the first queued request of each vdev, creating a delta to current time (at https://github.com/zfsonlinux/zfs/blob/master/module/zfs/vdev.c#L3221).

Thinking a while about it my feeling is that taking this information into the consideration could help selecting the faster device. Pseudocode here: b333z#1

Notes: Please watch the comment on the pull request!

@chrisrd
Copy link
Contributor

chrisrd commented May 25, 2013

@b333z "anyone know why the variable I'm after is always=?, I believe I have optimizations off etc..."

Can you provide an example of what's not working? FYI, I dug this (previously) working example out of one of my old stp scripts:

function trace(entry_p, extra) {
  printf("%s%s%s %s\n",
         thread_indent (entry_p),
         (entry_p>0?"->":"<-"),
         probefunc (),
         extra)
}

probe module("znvpair").function("*").call    { trace(1, $$parms) }
probe module("znvpair").function("*").return  { trace(-1, $$return) }

probe module("znvpair").statement("*@nvpair/nvpair.c:3269") {
  println(pp());
  println("  ",$$vars);
  println("  *size=",kernel_long($size));
  println("  bytesrec=", $bytesrec$$);
  println("  xdr=", $xdr$$);
}

@b333z
Copy link
Contributor

b333z commented May 25, 2013

@chrisrd here is my script, this seems to be something I want to do fairly often just dump out the state of the locals for all lines in a function:

#! /usr/bin/env stap

probe module("zfs").statement("vdev_mirror_child_select@vdev_mirror.c:*") {
  printf("%s - %s\n",pp(), $$vars)
}

probe module("zfs").function("vdev_pending_queued").return {
  printf("vdev_pending_queued.return - %s\n", $$return)
}

probe module("zfs").function("vdev_mirror_child_select").return {
  printf("vdev_mirror_child_select.return - %s\n", $$return)
}

But since I added the printk's, the variables seem to be resolving now... So I'm thinking that now the variables are being referenced more the optimizer can't trash them?... I have tried to remove optimizations using -O0 so am compiling with:

CFLAGS="-march=amdfam10 -O0 -ggdb -pipe"
CXXFLAGS="-march=amdfam10 -O0 -ggdb -pipe"

I might repeat against vanilla zfs master and see if I can replicate the issue.

Also printing pp() causes a massive line:

module("zfs").statement("vdev_mirror_child_select@/var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999/module/zfs/../../module/zfs/vdev_mirror.c:262")

Do you of any way to just output file name and line?:

vdev_mirror.c:262

@chrisrd
Copy link
Contributor

chrisrd commented May 26, 2013

@b333z, Your script looks good to me. Sorry, I don't have any better ideas for you, other than avoiding the optimisation as you say. I was just post-processing the output to remove the file name noise. Mind you, I'm no systemtap guru, I first used it in relation to #503. I did find the systemtap mailing list to be very helpful in working out my stap issue at that time.

b333z added a commit to b333z/zfs that referenced this issue May 30, 2013
When selecting which vdev mirror child to read from,
redirect io's to child vdevs with the smallest pending
queue lengths.

During the vdev selection process if a readable vdev is found
with a queue length of 0 then the vdev will be used and the
above process will be cut short.

It is hoped that this will cause all available vdevs to be
utilised while ensuring the more capable devices take on
more of the workload.

closes openzfs#1461
behlendorf added a commit to behlendorf/zfs that referenced this issue May 31, 2013
The read bandwidth of an N-way mirror can by increased by 50%,
and the IOPs by 10%, by more carefully selecting the preferred
leaf vdev.

The existing algorthm selects a perferred leaf vdev based on
offset of the zio request modulo the number of members in the
mirror.  It assumes the drives are of equal performance and
that spreading the requests randomly over both drives will be
sufficient to saturate them.  In practice this results in the
leaf vdevs being under utilized.

Utilization can be improved by preferentially selecting the leaf
vdev with the least pending IO.  This prevents leaf vdevs from
being starved and compensates for performance differences between
disks in the mirror.  Faster vdevs will be sent more work and
the mirror performance will not be limitted by the slowest drive.

In the common case where all the pending queues are full and there
is no single least busy leaf vdev a batching stratagy is employed.
Of the N least busy vdevs one is selected with equal probability
to be the preferred vdev for T milliseconds.  Compared to randomly
selecting a vdev to break the tie batching the requests greatly
improves the odds of merging the requests in the Linux elevator.

The testing results show a significant performance improvement
for all four workloads tested.  The workloads were generated
using the fio benchmark and are as follows.

1) 1MB sequential reads from 16 threads to 16 files (MB/s).
2) 4KB sequential reads from 16 threads to 16 files (MB/s).
3) 1MB random reads from 16 threads to 16 files (IOP/s).
4) 4KB random reads from 16 threads to 16 files (IOP/s).

               | Pristine              |  With 1461             |
               | Sequential  Random    |  Sequential  Random    |
               | 1MB  4KB    1MB  4KB  |  1MB  4KB    1MB  4KB  |
               | MB/s MB/s   IO/s IO/s |  MB/s MB/s   IO/s IO/s |
---------------+-----------------------+------------------------+
2 Striped      | 226  243     11  304  |  222  255     11  299  |
2 2-Way Mirror | 302  324     16  534  |  433  448     23  571  |
2 3-Way Mirror | 429  458     24  714  |  648  648     41  808  |
2 4-Way Mirror | 562  601     36  849  |  816  828     82  926  |

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#1461
@behlendorf
Copy link
Contributor

@b333z, @GregorKopka Your guys recent activity on this issue inspired me to take a closer look at the mirror code. For a long time I've had this feeling that we're been leaving a lot of performance on the table... it turns out we were. I started with your initial patch @b333z and improved things from there, I'd love it if you could take a look. I've been able to increase sequential read bandwidth even when just using hard drives by 50% and IOPs by 10%.

The patch I've proposed improves the mirror vdev selection to make better decisions about which vdev to use. How this is done is described in the patch in detail and I've posted my testing results below. The patch is ready for wider scale testing and I'd love to see how much it improves real world workloads.

image

@GregorKopka This patch also addresses your use case of pairing an SSD with a HDD, But more importantly it handles the general case where the drives in your mirror are performing slightly differently either due to age, defect, or just non-identical hardware.

@chrisrd
Copy link
Contributor

chrisrd commented May 31, 2013

Whoa! Impressive work, from all of you. It makes me wish I had a workload / dataset suited to mirroring!

@GregorKopka
Copy link
Contributor Author

@behlendorf: We started with something similar (though not setting prefered device to enable merges inside the kernel scheduler) by looking only at the queue length, but as it turned out that this lead to stalls on requests to the SSD side since (i guess) zfs didn't schedule new reads before the the HDD side returned certain data. After adding the request time into the equasion the average response time for read requests went down, leading to way more requests added to the SSD.

I would be interested how your patch compares to the one Andrew and i put together, could you please run the tests you did against our code (https://github.com/b333z/zfs/tree/issue-1461), or send me the nice excel sheet (so i don't have to recreate this from scratch) then i'll recreate your test on my system?

@b333z
Copy link
Contributor

b333z commented Jun 1, 2013

@behlendorf: Getting a very nice increase in ops and throughput with your patch even using 2 drives with same speed, done well, great outcome for all! Giving timed bursts to the vdevs when queues are empty is a great idea...

With Gregor's case I'm seeing an effect where the benefits of balancing across the devices are counteracted by having to wait for the vastly slower device io's to complete.

In the patch that Gregor mentions we've managed to make it half way between what you would get if you send io's only to the fast device ( best result ), by skewing off the avg. io completion times, but it does make me wonder if for this particular case there could be a flag in the vdev tree or similar that says always read from this device unless there is a problem...

@behlendorf
Copy link
Contributor

I would be interested how your patch compares to the one Andrew and i put together

Me too, so here are the results from the same four workloads on the system which did the original testing just for completeness. Note this pool configuration is all spinning disk, you'll notice there isn't any significant improvement over the pristine master source.

     Seq (MB/s)  Rand (IO/s)
     4M   4K     4M   4K
----------------------------
2x4  582  505     11  295
2x3  536  513     22  568
2x2  411  434     33  779

@GregorKopka your welcome to the spreadsheet as well if you like I would have attached it if Github allowed that sort of thing.

Giving timed bursts to the vdevs when queues are empty is a great idea.

@b333z Thank you, but I think you may be slightly misunderstand why it helps improve things. The real benefit to batching the requests to the vdev occurs when there queues are full not empty. We expect for the common case to be all the pending queues are full with 10 outstanding requests. When this occurs it's optimal to batch incoming requests to a single vdev for a short period to maximize merging. Every merged IO is virtually a free IO, and requests which come is at close to the same time are the most likely to merge.

Now as for @GregorKopka's case the patch helps there because the SSD device in the mirror will consume its IOs fastest causing the pending queue to more frequent drop below 10 request. When this happen we prefer the SSD and give it a burst of requests to handle. An average request time could be used for this too, but relying on the pending tree size is nice and simple.

This should be pretty much optimal from an aggregate bandwidth standpoint. But as you point out the story is a little bit different if you are more concerned about latency. Since the hard drive is still contributing you still see individual IOs with a longer worst case latency, although the total IO/s the pool can sustain will be much higher. There's nothing really which can be done about this short of adding a tuning which says always and only read from the lowest latency devices in the mirror. That's easy enough to add but not exactly a common case.

Just for reference here's the patch I've proposed running against a ssd+hdd pool. Because it sounds like your most interested in IO/s I've run fio using 16 threads and a 4k random read workloads.

$ zpool status -v
    NAME        STATE     READ WRITE CKSUM
    hybrid      ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        ssd-1   ONLINE       0     0     0
        hdd-1   ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        ssd-2   ONLINE       0     0     0
        hdd-2   ONLINE       0     0     0
      mirror-2  ONLINE       0     0     0
        ssd-3   ONLINE       0     0     0
        hdd-3   ONLINE       0     0     0
      mirror-3  ONLINE       0     0     0
        ssd-4   ONLINE       0     0     0
        hdd-4   ONLINE       0     0     0

$ iostat -mxz 5
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
hdd-1               0.00     0.00  156.80    0.00    19.60     0.00   256.00     2.52   16.09   6.37  99.88
hdd-2               0.00     0.00  149.80    0.00    18.73     0.00   256.00     2.51   16.78   6.66  99.80
hdd-3               0.00     0.00  150.40    0.00    18.80     0.00   256.00     2.52   16.74   6.63  99.72
hdd-4               0.00     0.00  154.20    0.00    19.28     0.00   256.00     2.46   15.98   6.47  99.70
ssd-1               0.00     0.00 1250.80    0.00   156.35     0.00   256.00     1.32    1.06   0.62  77.70
ssd-2               0.00     0.00 1237.20    0.00   154.65     0.00   256.00     1.31    1.06   0.62  77.16
ssd-3               0.00     0.00 1243.40    0.00   155.42     0.00   256.00     1.33    1.07   0.62  77.68
ssd-4               0.00     0.00 1227.60    0.00   153.45     0.00   256.00     1.29    1.05   0.63  77.36

$ fio rnd-read-4k
read: (groupid=0, jobs=16): err= 0: pid=13087
  read : io=1324.3MB, bw=22594KB/s, iops=5648 , runt= 60018msec
    clat (usec): min=6 , max=276720 , avg=2762.90, stdev=1493.94
     lat (usec): min=6 , max=276720 , avg=2763.11, stdev=1493.94
    bw (KB/s) : min=  109, max= 2743, per=6.40%, avg=1446.40, stdev=67.99
  cpu          : usr=0.16%, sys=1.37%, ctx=336494, majf=0, minf=432
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,>=64=0.0%
     issued r/w/d: total=339009/0/0, short=0/0/0
     lat (usec): 10=0.53%, 20=2.81%, 50=0.02%, 100=0.01%, 750=0.01%
     lat (usec): 1000=33.99%
     lat (msec): 2=50.03%, 4=1.93%, 10=2.75%, 20=5.09%, 50=2.69%
     lat (msec): 100=0.15%, 250=0.01%, 500=0.01%

Run status group 0 (all jobs):
   READ: io=1324.3MB, aggrb=22593KB/s, minb=23136KB/s, maxb=23136KB/s,
mint=60018msec, maxt=60018msec

From the above results you can quickly see a few things.

  1. As expected the hard drives are nearly 100% utilized and limited at roughly 150 IO/s. The ssds are not quite maxed out but they are handling the vast majority of the IO/s at roughly 1200+ operations per second. Some additional tuning is probably possible to increase that utilization but that's not too shabby.

  2. Sustained IO/s for the pool is a respectable 5648, but as expected you have a bimodal distribution for the latency. Some IO/s are very fast when serviced by the SSD others are slow when the HDD is used, for this case the worst case latency was a quite large 276ms with an average of 2.7ms. We could probably do a bit more to bound the worst case latencies but I think this is a good start.

@b333z
Copy link
Contributor

b333z commented Jun 1, 2013

Thanks for the clarification @behlendorf that makes sense.

Those results look good, the ssd's are breaking free nicely over the usual roundrobin, been experimenting with a number of testing tools, I haven't come across fio in my travels, will have to give it a go!

@GregorKopka
Copy link
Contributor Author

@behlendorf: These numbers dosn't make sense to me, since the more disks are there, the less IO/s are reported:

     Seq (MB/s)  Rand (IO/s)
     4M   4K     4M   4K
----------------------------
2x4  582  505     11  295
2x3  536  513     22  568
2x2  411  434     33  779

is something wrong with the numbers?

Also, since you ran you test with 1M and these with 4M they sadly are not directly compareable, could you please rerun your original test?

What settings did you create the pools with (ashift, atime, primarycache)?

@behlendorf
Copy link
Contributor

@GregorKopka Sorry, I messed up formatting that initial chart. Here's the fixed version. The numbers were for 1M not 4M I/Os and I just had the numbers accidentally out of order. Also be aware the 4K Random 2x4 number (529) isn't a typo, that's what the test turned in... but I don't believe it, and think it's a fluke. I just didn't get a chance to run the test multiple times and see.

For commit ac0df7e

     Seq (MB/s)  Rand (IO/s)
     1M   4K     1M   4K
----------------------------
2x1  221  270    11  295
2x2  411  434    22  568
2x3  536  513    33  779
2x4  582  505    49  529

As for settings the pool was created with all defaults. Note that if you test the patch with a large pool you're are going to need to make sure to use multiple threads. One process probably won't be able to saturate the disk.

  • ashift = 9
  • atime = on
  • primarycache = all
  • recordsize = 128k

@b333z Fio is pretty nice because it's fairly flexible and you can usually easily generate the workload you're interested in. Plus it logs some useful statistics about the run. But there are certainly other good tools out there as well, and of course nothing tests like a real world workload.

@behlendorf
Copy link
Contributor

@GregorKopka Based on your testing does the patch in #1487 address your original concerns about performance of a mirror being limited by the slowest device. Unless there are serious concerns with the patch as I'm going to merge it for 0.6.2.

@GregorKopka
Copy link
Contributor Author

@behlendorf While your approach isn't that optimal for my specific use case (since the window where the hdd will receive all requests leads to the ssd falling empty, at least with the default 10ms) i havn't seen ill side effects on standard hdd mirrors.

So i'm ok with merging it, i'll fiddle with the tuneable when i have some time and spare hardware (which i sadly both lack at the moment) to test some more.

One question though: where is the point in the window (at least for random i/o) when the Linux elevator is effectively disabled for pool vdev members (since zfs sets the noop scheduler) ? Maybe there is something in ZFS merging requests so this makes sense, because else i have no clue why your code is faster in the general case compared to what b333z and me came up with.

@behlendorf
Copy link
Contributor

@GregorKopka Even in your case I'd expect it to bias quite heavily toward your local SSD. Maybe not enough to 100% saturate it but still substantially better then the existing policy.

I also agree that it would be best if the code could automatically set a good value for zfs_vdev_mirror_switch_ms perhaps based on the observed latency of the devices. A 10ms interval is good for HDDs but a much smaller value would be preferable for SSDs. I suspect calculating a decaying average of the last observed latency over all the mirror members would work quite well. Adopting something similar to what you and @b333z were proposing would be helpful here. But for now the tunable allows people to set a good value for their pool, and there's nothing preventing us from making a follow up patch to make this more intelligent.

As for the Linux elevator setting noop doesn't entirely disable the elevator. It will still perform front and back merging with the existing requests in the queue. And because ZFS will attempt to issue its requests in LBA order if those requests are sent to a single vdev there's a decent change we'll see some merging occurring. For totally random IO that clearly breaks down.

Longer term it would be desirable to allow ZFS to have a larger than 128k block size. This would allow us to easily merge larger requests for the disk. Unfortunately, that work depends on us being able to handle 1MB buffers so the zio scatter gather work must be done, and then adding 1MB block support via a feature flag.

@Rudd-O
Copy link
Contributor

Rudd-O commented Jul 5, 2013

This is better than a cure for cancer. Every time I watch you guys improve ZFS, it's like magic.

@stevenh
Copy link
Contributor

stevenh commented Aug 15, 2013

High guys thanks for all the work on this, we've been evaluating the performance of this under FreeBSD and have noticed an issue. In the original a good portion of sequential operations are directed to the same disk, which for HD's is important in minimising latency introduced by seek times.

Based on how FreeBSD's gmirror works, when configured to load balanced mode, I've made some changes which in my tests produce significantly higher throughput so would like you feedback.

== Setup ==
3 Way Mirror with 2 x HD's and 1 x SSD

=== Prefetch Disabled ==
==== Load Balanced (locality) ====
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s
Read 30720MB using bs: 1048576, readers: 6, took 77 seconds @ 398 MB/s
Read 46080MB using bs: 1048576, readers: 9, took 89 seconds @ 517 MB/s

==== Stripe Balanced (default) ====
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s

==== Load Balanced (zfsonlinx) ====
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s

=== Prefetch Enabled ===
==== Load Balanced (locality) ====
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s

==== Stripe Balanced (default) ====
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s

==== Load Balanced (zfsonlinx) ====
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s

== Setup ==
2 Way Mirror with 2 x HD's

=== Prefetch Disabled ==
==== Load Balanced (locality) ====
Read 10240MB using bs: 1048576, readers: 2, took 131 seconds @ 78 MB/s

==== Stripe Balanced (default) ====
Read 10240MB using bs: 1048576, readers: 2, took 160 seconds @ 64 MB/s

==== Load Balanced (zfsonlinx) ====
Read 10240MB using bs: 1048576, readers: 2, took 207 seconds @ 49 MB/s

=== Prefetch Enabled ===
==== Load Balanced (locality) ====
Read 10240MB using bs: 1048576, readers: 2, took 85 seconds @ 120 MB/s

==== Stripe Balanced (default) ====
Read 10240MB using bs: 1048576, readers: 2, took 109 seconds @ 93 MB/s

==== Load Balanced (zfsonlinx) ====
Read 10240MB using bs: 1048576, readers: 2, took 94 seconds @ 108 MB/s

In the above zfsonlinx refers the the patch applied to FreeBSD sources not to a different OS, so we are comparing apples to apples ;-)
The current version of the patch can be found here:-
http://blog.multiplay.co.uk/dropzone/freebsd/zfs-mirror-load.patch

@GregorKopka
Copy link
Contributor Author

This is a huge improvement for several reasons:

  • nice overall performance improvement which seems to scale better than linear for more disks (since you group by locality which works better the more disks are there, especially for multiple readers)
  • no performance degradation in 'Prefetch Disabled' tests compared to stock

@stevenh:

 As these three methods are only used for load calculations we're not precious
 if we get an incorrect value on 32bit platforms due to lack of vq_lock mutex
 uses here. Instead we prefer to keep it lock free for the performance. 

This removes one lock cycle per disk per request, which is huge - since the lock only protects a single ulong_t read. Nice find!

In case you would add a tuneable for non-rotational bonus instead of using vfs.zfs.vdev.mirror_locality_bonus for it (and using a fraction of that for the actual locality bonus) tuning of hybrid mirrors would be possible to the point of all reads served by the SSD side so full iops of the HDD would be available for writes...

Is is possible for you to convert the patch into a pull request against current master?

@behlendorf
Copy link
Contributor

@stevenh I think adding locality in to the calculation is a really nice improvement. It would be great to refresh a version of the patch against ZoL so we can get some numbers for Linux. You may have struck upon a better way to do this, but the only way to be sure is to test!

You didn't say explicitly in your original comment what the workload was. Am I correct to assume it was fully sequential reads from a N different files? That's exactly the workload I'd expect your changes to excel at which is what we see, nicely done!

But before declaring something better or worse we should test a variety of workloads. We don't necessarily want to improve one use case if it penalizes another. At a minimum I usually try to get data for N threads (1,2,4,etc..) running 4K/1M sequential and fully random workloads.

As you know the the original patch, 556011d, was designed to take advantage of the merging performed by the Linux block device elevator. Getting better locality should improve that further. Correct me if I'm wrong but I believe FreeBSD (like Illumos) doesn't perform any additional merging after issuing the request to the block layer, so I'm not too surprise about how the ZoL patch performed under FreeBSD. I'd love to test both patches under Linux and see!

Thank you for continuing to work on this! I suspect there are still improvement which can be made here!

@stevenh
Copy link
Contributor

stevenh commented Aug 15, 2013

Thanks for the feedback @GregorKopka, I'm still playing with this patch based on comments on the FreeBSD zfs list, recent changes include:

  • Moving to a "seek increments" instead of "locality bonus" as it makes it easier to understand.
  • Adding back in an optimised version of the vdev selection randomisation for the case where multiple vdevs have the same minimum load.
  • Added the additional configuration options which should allow you to tune for SSD only reads and other combinations too.
  • Removed compatibility option to use the old algorithm.
  • Optimised loops

Link updated with new version: http://blog.multiplay.co.uk/dropzone/freebsd/zfs-mirror-load.patch

@stevenh
Copy link
Contributor

stevenh commented Aug 15, 2013

@behlendorf I've tested a number of workloads, from 1 -> 9 workers on both a 2 x HDD and 2 x HDD + 1 x SSD setup.

As you thought FreeBSD doesn't perform any merging after issuing the request to the block layer.

I've not familiar myself with the Linux block IO layer so the knobs what notifies ZFS of rotational / non-rotational media, would need someone to implement but the core ZFS changes should be a direct port.

As mentioned above I've just updated my current in progress patch which includes a number of additional enhancements :)

@stevenh
Copy link
Contributor

stevenh commented Oct 23, 2013

I just committed my updated version of this patch which adds I/O locality to FreeBSD that significantly improves the performance. This can be found here:
http://svnweb.freebsd.org/changeset/base/256956

@behlendorf
Copy link
Contributor

@stevenh Since this issue was closed I've opened #1803 to track getting your improvements ported and benchmarked for ZoL.

@stevenh
Copy link
Contributor

stevenh commented Oct 23, 2013

Cool thanks @behlendorf if I can be of any help let me know.

unya pushed a commit to unya/zfs that referenced this issue Dec 13, 2013
The read bandwidth of an N-way mirror can by increased by 50%,
and the IOPs by 10%, by more carefully selecting the preferred
leaf vdev.

The existing algorthm selects a perferred leaf vdev based on
offset of the zio request modulo the number of members in the
mirror.  It assumes the drives are of equal performance and
that spreading the requests randomly over both drives will be
sufficient to saturate them.  In practice this results in the
leaf vdevs being under utilized.

Utilization can be improved by preferentially selecting the leaf
vdev with the least pending IO.  This prevents leaf vdevs from
being starved and compensates for performance differences between
disks in the mirror.  Faster vdevs will be sent more work and
the mirror performance will not be limitted by the slowest drive.

In the common case where all the pending queues are full and there
is no single least busy leaf vdev a batching stratagy is employed.
Of the N least busy vdevs one is selected with equal probability
to be the preferred vdev for T microseconds.  Compared to randomly
selecting a vdev to break the tie batching the requests greatly
improves the odds of merging the requests in the Linux elevator.

The testing results show a significant performance improvement
for all four workloads tested.  The workloads were generated
using the fio benchmark and are as follows.

1) 1MB sequential reads from 16 threads to 16 files (MB/s).
2) 4KB sequential reads from 16 threads to 16 files (MB/s).
3) 1MB random reads from 16 threads to 16 files (IOP/s).
4) 4KB random reads from 16 threads to 16 files (IOP/s).

               | Pristine              |  With 1461             |
               | Sequential  Random    |  Sequential  Random    |
               | 1MB  4KB    1MB  4KB  |  1MB  4KB    1MB  4KB  |
               | MB/s MB/s   IO/s IO/s |  MB/s MB/s   IO/s IO/s |
---------------+-----------------------+------------------------+
2 Striped      | 226  243     11  304  |  222  255     11  299  |
2 2-Way Mirror | 302  324     16  534  |  433  448     23  571  |
2 3-Way Mirror | 429  458     24  714  |  648  648     41  808  |
2 4-Way Mirror | 562  601     36  849  |  816  828     82  926  |

Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#1461
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants