Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading from mirror vdev limited by slowest device, II. #1742

Open
jjYBdx4IL opened this issue Sep 20, 2013 · 22 comments
Open

reading from mirror vdev limited by slowest device, II. #1742

jjYBdx4IL opened this issue Sep 20, 2013 · 22 comments
Labels
Type: Feature Feature request or new feature Type: Performance Performance improvement or performance problem

Comments

@jjYBdx4IL
Copy link

Being very related to

#1461

I suggest a simpler solution:

mdadm (Linux' own softraid md1 implementation) has the -W option when creating a raid setup ("write mostly") which indicates the disk to use primarily for writes, ie. directing all reads to the other disk when possible. Would be great if zfs would have such an option, too, for the mirror vdev so that one drive may go to sleep, thereby reducing noise and power consumption in situations where there are mostly no writes at all (mirror containing daily backups etc.).

@GregorKopka
Copy link
Contributor

Nice idea, still there would be the zfs heartbeat in form of uberblock updates on every txg commit.

Also the question comes to my mind how to decide about 'when possible', what shall this mean for a computer?

@behlendorf
Copy link
Contributor

My understanding is that md1 devices marked --write-mostly are simply not read from unless it's required for a rebuild. This functionality is largely provided when the administrator knows one half of his mirror is much slower than the other. Writes must still go to both devices so unless the pool is largely read-only I'm not sure how much of an opportunity there will be to spin down half of the mirror.

@behlendorf
Copy link
Contributor

This issue was addressed by 556011d.

@jjYBdx4IL
Copy link
Author

Not really the same as 556011d, but nice work nonetheless. This issue rather suggests "soft blocking" some redundant devices for read operations so the administrator can set up arrays that are used only for reading (anon ftp servers come to mind) seldomly used data (archival purposes). If accesses happen, only one (out of two+) disk has to spin up then, thereby saving energy.

@behlendorf
Copy link
Contributor

@jjYBdx4IL Yes, I see what you're saying. If you'd like we can leave this open as a feature request. Perhaps someone will find the time to work on it.

@behlendorf behlendorf removed this from the 0.7.0 milestone Oct 30, 2014
@GregorKopka
Copy link
Contributor

@behlendorf while 556011d lessens the problem it will still commit reads toward a slow device (f.ex. the spinning disk side of SSD/HDD mirror) slowing I/O to the whole vdev.

I think it would be nice if the admin would have the ability to make informed decisions about how ZFS should manage the bare-metal devices (in case he knows better than the default algorythm).

Please keep this open as a feature request, at least as long as the read scheduler dosn't know about if a device is spinning or not (the BSD port has some code to tune for such scenarios: http://svnweb.freebsd.org/base?view=revision&revision=256956)

@behlendorf behlendorf reopened this Nov 3, 2014
@gkkovacs
Copy link

gkkovacs commented Jun 1, 2015

@behlendorf @GregorKopka Has there been any development regarding this issue? I have been testing 2-way mirror read behaviour recently in SSD+HDD pools, and there seems to be no strong preference to the SSD when both vdevs are online.

Benchmark: reading 6GB of random data in a 2GB RAM guest using a 2-way mirror, ZFS on Linux 0.6.4.1

SSD+HDD mirror zpool online: 20 sec
HDD only reads (SSD offline): 43 sec
SSD only reads (HDD offline): 9 sec

Will do more precise benchmarking in the near future.

@GregorKopka
Copy link
Contributor

The problem still is that zfs knows nothing about the individual drive is rotating or not:
Especially when reads for metadata end up on the slow side of the mirror, throughput will drop since zfs needs that block(s) to know what to read next...

@behlendorf
Copy link
Contributor

Under Linux at least it's relatively straight forward to tell if a vdev is rotational or non-rotational. It would be straight forward to use this information to always prefer the non-rotational disk. However, this is not always the right thing to do since non-rotational is not 100% synonymous with fastest. There would need to be some kind of reasonable interface which could be used to control this behavior.

@GregorKopka
Copy link
Contributor

@behlendorf In case we could attach properties to vdevs and vdev-members in the form of zpool set property=value tank/vdev[/member] the administrator could be enabled to add tunings for this (and other things) on a vdev/-member basis.

Additional use cases for something like that which come into my mind:
get ashift tank/vdev (get ashift value for vdev)
set readonly=on tank/vdev (disable new writes to a vdev by skipping allocation on it)
and it could be used to expose values obtainable from zpool status (health) and iostat (space).

Does this sound like a good idea?

@behlendorf
Copy link
Contributor

@GregorKopka that was discussion of adding per-vdev property support by one of the illumos developers. There are quite a number of things it would be useful for. I've no idea how far that work got but I liked the idea.

@behlendorf
Copy link
Contributor

Mirror improvements were merged in 9f50093. Additional improvements could also be built on the per-vdev functionality which was recently merged d516761.

@mailinglists35
Copy link

I just stumbled upon the need to have the exact equivalent of mdadm --write-mostly feature.

Even if title of the op issue is slightly different and solved by mentioned commits, the body of the message leaves the --write-mostly feature still missing from zfs.

Can this still be reopened as the feature request to be able to do exactly like mdadm?

for example:
zpool set writemostly=names,of,disks,which,should,never,attempt,reads poolname
zpool set feature@writemostly=enabled poolname

@milj
Copy link

milj commented May 24, 2020

Hello all. I'm considering migrating my volumes from mdadm to ZFS. Currently I'm doing "feasibility study".

For my /home I'm using a mdadm RAID1 HDD+SSD setup with --write-mostly HDD. It's a cost-effective balance of SSD speed and redundancy provided by the slow HDD. Has a similar feature been implemented in ZFS or is it still a feature request?

@GregorKopka
Copy link
Contributor

You can reach something near that, but the downside is that it is a system wide tuning. The basic idea is to tune to cost for reads from rotational media to absurd levels, in the hope that ZFS won't queue to these devices.

In case you want to try this: read up about https://openzfs.github.io/openzfs-docs/Performance%20and%20tuning/ZFS%20on%20Linux%20Module%20Parameters.html#zfs-vdev-mirror-rotating-inc and the following module parameters, then play with them.

@mailinglists35
Copy link

this issue should still be open, @behlendorf

despite all commits mentioned, this is the real world sequential read speed with and without the slow leg: 300MB/s vs 100MB/s. Ubuntu 18.04 LTS, zfs 0.8.3 (I doubt is there any change related to this from 0.8.3 to 0.8.4)

config:

        NAME        STATE     READ WRITE CKSUM
        pool123     ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdb8    ONLINE       0     0     0
            dm-0    ONLINE       0     0     0

let's read from this highly compressed file to ensure there is little or nothing to decompress at zfs level

# ls -lhsta /srv/pool123/volatile/etc/win2k8r2-oem-neactivat.qcow2.lz4
12G -rw-r--r-- 1 root root 12G Sep 30  2016 /srv/pool123/volatile/etc/win2k8r2-oem-neactivat.qcow2.lz4

root@pool123:~# zfs get all pool123/volatile |grep -v default|grep -v '\-$'                  NAME              PROPERTY                         VALUE                            SOURCE
pool123/volatile  mountpoint                       /srv/pool123/volatile            inherited from pool123
pool123/volatile  compression                      on                               inherited from pool123
pool123/volatile  atime                            off                              inherited from pool123
pool123/volatile  canmount                         on                               local
pool123/volatile  primarycache                     metadata                         local
pool123/volatile  sync                             disabled                         local
pool123/volatile  com.sun:auto-snapshot            false                            local
pool123/volatile  org.complete.simplesnap:exclude  on                               local

# zfs get all pool123 |grep -v default|grep -v '\-$'
NAME     PROPERTY               VALUE                  SOURCE
pool123  mountpoint             /srv/pool123           local
pool123  compression            on                     local
pool123  atime                  off                    local
pool123  com.sun:auto-snapshot  true                   local

with slow device present:

root@pool123:~# timeout 30 dd if=/srv/pool123/volatile/etc/win2k8r2-oem-neactivat.qcow2.lz4 of=/dev/null status=progress bs=16M
3103784960 bytes (3.1 GB, 2.9 GiB) copied, 29 s, 107 MB/s

root@pool123:~# sleep 30; iostat -x /dev/sdb8 /dev/dm-0 30
Linux 4.15.0-88-generic (pool123)       05/29/2020      _x86_64_        (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.19    0.39    1.57    0.24    0.00   95.61

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdb8             2.30    0.34    380.25      7.82     0.01     0.00   0.32   0.12    1.19    2.08   0.00   165.05    23.18   0.80   0.21
dm-0             0.72    0.33    341.10      7.83     0.00     0.00   0.00   0.00   15.54    2.22   0.01   471.58    23.96   6.03   0.63

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.97    1.69   31.85   13.00    0.00   47.50

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdb8           378.07    0.00  47773.87      0.00     1.87     0.00   0.49   0.00    0.79    0.00   0.30   126.36     0.00   0.34  12.68
dm-0           379.00    0.00  47659.82      0.00     0.00     0.00   0.00   0.00    7.67    0.00   2.92   125.75     0.00   2.30  87.00

with slow device absent:

root@pool123:~# timeout 30 dd if=/srv/pool123/volatile/etc/win2k8r2-oem-neactivat.qcow2.lz4 of=/dev/null status=progress bs=16M
10888413184 bytes (11 GB, 10 GiB) copied, 29 s, 375 MB/s

root@pool123:~# sleep 30; iostat -x /dev/sdb8 /dev/dm-0 30
Linux 4.15.0-88-generic (pool123)       05/29/2020      _x86_64_        (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.19    0.39    1.57    0.24    0.00   95.61

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdb8             2.29    0.34    378.81      7.82     0.01     0.00   0.31   0.12    1.19    2.08   0.00   165.22    23.19   0.80   0.21
dm-0             0.72    0.33    341.11      7.83     0.00     0.00   0.00   0.00   15.54    2.22   0.01   471.58    23.97   6.03   0.63

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.90    0.00   10.85   14.32    0.00   73.93

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdb8          2248.63    0.00 291888.55      0.00    50.07     0.00   2.18   0.00    1.14    0.00   2.56   129.81     0.00   0.30  68.49
dm-0             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

@dm17
Copy link

dm17 commented Dec 21, 2022

Will this be reopened as a feature request? I'm in a situation where I want mirroring but one NVME is a lot slower than another due to limited bandwidth to one m.2 interface - I figured this was the only potential feature or functionality to remedy that while keeping mirroring.
Thanks!

@mailinglists35
Copy link

mailinglists35 commented Dec 22, 2022

@behlendorf Brian, would you please review all comments piled up since closure of this?

Mentioned commits that close this issue don't solve problem. slowest device in pool still bottlenecks reads.

@mailinglists35
Copy link

mailinglists35 commented Dec 7, 2023

hi @behlendorf
latest zfs code still has the limitation depending on slowest device read.

any chance this issue can be revisited and reopened?

simple reproducer:
create a mirrored pool of two devices, one local and one iscsi
switch the network interface speed to 10MB
watch zfs on it's knees on reads from pool

@GregorKopka
Copy link
Contributor

In case you can divide your devices by being rotational (the slow ones) or not (the fast ones), you might be able to tune the rotational ones to a cost for read that's high enough to never be selected - at least as long as the fast ones queue does not fill completely (then all bets are off).

As a feature request there already is #3810

@behlendorf
Copy link
Contributor

Reopening so this can be investigated.

@behlendorf behlendorf reopened this Dec 7, 2023
@dm17
Copy link

dm17 commented Dec 7, 2023

I don't see why we should not be able to have a mirror with a delimiter for its slowest device. In reality, most laptops don't have 2 2280 NVMes, but a lot more have a smaller form factor 2nd NVMe (2230/2242) - which would slow down the pool. Since ZFS is a lot less valuable being run on a single device, then it would make sense to enable it to be mirrored in situations where the mirror would slow it down. The suggestion is typically "just ZFS send if you're going to do that" - which makes sense now, but it is clearly much easier to keep the data up-to-date in a mirror.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

7 participants