Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: zpool get/set/list being able to address pool vdevs/vdev members #3810

Open
GregorKopka opened this issue Sep 21, 2015 · 8 comments
Labels
Status: Design Review Needed Architecture or design is under discussion Status: Inactive Not being actively updated Type: Feature Feature request or new feature

Comments

@GregorKopka
Copy link
Contributor

It would be beneficial to allow zpool get/set operations to target vdevs and their members.

Get ashift of pool (#2024):

$ zpool get ashift tank/raidz2-1
tank/raidz2-1 ashift 12
$ zpool get ashift tank -r
tank          ashift 9
tank/raidz2-0 ashift 9
tank/raidz2-1 ashift 12

Allow features like

No new allocations on vdev (for whatever reasons), or even automatically when degraded (manually set, or while replace of member disk is running) while still updating superblocks to keep the pool consistent:

$ zpool set readonly=on tank/raidz2-1

Do not serve reads from this vdev member while redundancy allows it (slow drive in mirror for more cheap redundancy, #1742):

$ zpool set writeonly=on tank/mirror-0/slowdrive1

This would also deliver the interface to tune features like Metadata Allocation Class (#3779)

$ zpool set allocate=metadata tank/mirror-0

Combined this would allow for interesting performing setups, (example) pool with:

  • Metadata on 4-way mirror, served by fast SSD, with additional cheap redundancy from HDDs
  • DDT on 3-way mirror, served by pair of fast SSD, with additional cheap redundancy from HDDs
  • Data on raidz3 vdevs, one currently resilvering, thus no new allocates on that one
$ zpool list -o name,size,ashift,allocate,writeonly,readonly tank -r

NAME                            SIZE  ASHIFT    ALLOCATE  WRITEONLY  READONLY
tank                           <big>      12         all          -        off
tank/mirror-0                   500G       9    metadata          -        off
tank/mirror-0/SSD1              500G       9           -        off          -
tank/mirror-0/HDD1              500G       9           -         on          -
tank/mirror-0/HDD2              500G       9           -         on          -
tank/mirror-0/HDD3              500G       9           -         on          -
tank/mirror-1                   250G       9         ddt          -        off
tank/mirror-1/SSD2              250G       9           -        off          -
tank/mirror-1/SSD3              250G       9           -        off          -
tank/mirror-1/HDD5              250G       9           -         on          -
tank/mirror-1/HDD6              250G       9           -         on          -
tank/raidz3-0                    16T      12        data          -        off
   [...]
tank/raidz3-1                    16T      12        data          -         on
tank/raidz3-1/HDD3-1              4T      12        data          -          -
tank/raidz3-1/HDD3-2              4T      12        data          -          -
tank/raidz3-1/replacing-0         4T      12        data          -          -
tank/raidz3-1/replacing-0/HDD3-3  4T      12        data          -         on
tank/raidz3-1/replacing-0/new     4T      12        data          -          -
tank/raidz3-1/HDD3-4              4T      12        data          -          -
tank/raidz3-1/HDD3-5              4T      12        data          -          -
tank/raidz3-1/HDD3-6              4T      12        data          -          -
tank/raidz3-1/HDD3-7              4T      12        data          -          -
tank/raidz3-2                    16T      12        data          -        off
[...]
@GregorKopka
Copy link
Contributor Author

To bump this a little after the OpenFS Summit talk about vdev properties (youtube)

@allanjude Thanks for working on this.
I think the user interface (as presented) needs a different approach.

As the vdevs are a hirarchical tree structure, wouldn't it make more sense to extend the existing interface of zpool get/set/list by allowing the [pool] argument of them to be poolname[/vdev-name[/...]] and adding -r and -d to recurse over children ?

This would change the example in your talk from

zpool get size@ada0p3,alloc@ada0p3,free@ada0p3,ashift@ada0p3,state@ada0p3 zroot
NAME   PROPERTY       VALUE   SOURCE
zroot  size@ada0p3    45.5G   -
zroot  alloc@ada0p3   19.5G   -
zroot  free@ada0p3    26.0G   -
zroot  ashift@ada0p3  12      -
zroot  state@ada0p3   ONLINE  -

to

zpool get size,alloc,free,ashift,state zroot/ada0p3
NAME          PROPERTY  VALUE   SOURCE
zroot/ada0p3  size      45.5G   -
zroot/ada0p3  alloc     19.5G   -
zroot/ada0p3  free      26.0G   -
zroot/ada0p3  ashift    12      -
zroot/ada0p3  state     ONLINE  -

Which would be easier to read and way less (especially when querying multiple vdevs) to type.

Plus this kind information would also be accessible through zpool list, given -r|-d and -o arguments.

@allanjude
Copy link
Contributor

We have been discussing this in the #vdev-properties channel on the OpenZFS slack, and we came up with:
zpool vdev-set <property>=<value> <pool1:vdev1,vdev2,vdev3,...> <pool2:vdev4,vdev5,vdev6> ...
zpool vdev-get <property|all> <pool1:vdev1,vdev2,vdev3,...> <pool2:vdev4,vdev5,vdev6> ...
Although with the new way of specifying the vdev, we could probably go back to just using get/set instead of the special vdev version.

We have been through about 5 iterations so far this week ;)

@allanjude
Copy link
Contributor

allanjude commented Sep 14, 2018

About your other point, for 'write only' on members of a mirror, the current plan is to implement that something like this:
zpool set readbias=ssd1[,ssd2] zroot:mirror-0

That is to say, implement it as a property of the top-level vdev, rather than as a property of the leaf, so that it would persist through a zpool replace etc.

@behlendorf behlendorf added the Type: Feature Feature request or new feature label Sep 14, 2018
@GregorKopka
Copy link
Contributor Author

zpool vdev-set = pool1:vdev1,vdev2,vdev3,... pool2:vdev4,vdev5,vdev6 ...
zpool vdev-get <property|all> pool1:vdev1,vdev2,vdev3,... pool2:vdev4,vdev5,vdev6 ...
Although with the new way of specifying the vdev, we could probably go back to just using get/set instead of the special vdev version.

Apart from the question of what is wrong with / as a path separator, as it's universally used basically everywhere (filesystems, URI zfs datasets, whatnot):

What is the reasoning behind aiming for a flat addressing scheme for a hirarchical data structure?

zpool set readbias=ssd1[,ssd2] zroot:mirror-0
That is to say, implement it as a property of the top-level vdev, rather than as a property of the leaf, so that it would persist through a zpool replace etc.

Why the need to survive a replace?

And why put that into the mirror vdev instead of the member where it actually belongs (don't read from this, unless needed) so that the code picking the mirror disk to read from (which is already working with the leafs to look at queue depths and whatnot) can easily access it and quickly test for it? Rotational and other flags relevant to the disk are also in the mirror vdev leafs, not their parent node...

@GregorKopka
Copy link
Contributor Author

GregorKopka commented Sep 15, 2018

@allanjude futher thought on readbias:
If it's a tendency, as the word implies, it's useless compared to writeonly - as the latter would make sure no reads at all (unless needed for scrub or resilver) are queued to the tagged device.

Using a 7.2k RPM HDD with 100 IOPS vs. a SSD with 250k read IOPS to make the point:
Have one data read queued to the HDD and latency increases four orders of magnitude, should it be a metadata demand read then also throughput drops four orders of magnitude (as all further reads that depend on the metadata can't start).

Bias isn't what I (and I guess everyone else) want. Should readbias work as writeonly, just with reversed tagging: Please name it less misleading to avoid confusion.

@allanjude
Copy link
Contributor

And why put that into the mirror vdev instead of the member where it actually belongs (don't read from this, unless needed) so that the code picking the mirror disk to read from (which is already working with the leafs to look at queue depths and whatnot) can easily access it and quickly test for it? Rotational and other flags relevant to the disk are also in the mirror vdev leafs, not their parent node...

It was interesting to learn from @grwilson that the queueing happens at the top-level vdev, not the leaf. So that was one of the reasons we decided we didn't need to worry about the leaf vdevs as much.

Currently, leaf vdev ZAPs are empty aside from vdev removal cases.

Based on the discussion, the first implementation will focus on top-level vdevs, and we'll look at leaf vdevs for an expanded version after. The complexity of 'zpool replace', and the promotion/demotion from top level vdev, when you zpool attach/detach and turn single devices into mirrors, or mirrors into single devices, was greater than we expected when we white boarded it at the ZFS DevSummit Hackathon

@GregorKopka
Copy link
Contributor Author

@allanjude Any progress on this?

@GregorKopka
Copy link
Contributor Author

And why put that into the mirror vdev instead of the member where it actually belongs (don't read from this, unless needed) so that the code picking the mirror disk to read from (which is already working with the leafs to look at queue depths and whatnot) can easily access it and quickly test for it? Rotational and other flags relevant to the disk are also in the mirror vdev leafs, not their parent node...

It was interesting to learn from @grwilson that the queueing happens at the top-level vdev, not the leaf. So that was one of the reasons we decided we didn't need to worry about the leaf vdevs as much.

To pick this topic up, a little while later:
Yes, the queuing happens on the top-level vdev, but using the information on the leafs (rotational, queue length, ...).
Thus the information of "do not read from this, unless needed because no other sources to fulfill the request exist" should IMHO be stored in the leaf, not the parent.

Currently, leaf vdev ZAPs are empty aside from vdev removal cases.

Based on the discussion, the first implementation will focus on top-level vdevs, and we'll look at leaf vdevs for an expanded version after. The complexity of 'zpool replace', and the promotion/demotion from top level vdev, when you zpool attach/detach and turn single devices into mirrors, or mirrors into single devices, was greater than we expected when we white boarded it at the ZFS DevSummit Hackathon

I think something like writeonly does not need to survive a replacement of the leaf (at least I can't see a use case where this would make any sense or is needed - YMMV and if it does: please enlighten me) or turning a mirror into a single drive vdev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Design Review Needed Architecture or design is under discussion Status: Inactive Not being actively updated Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

4 participants
@behlendorf @GregorKopka @allanjude and others