-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenZFS 7090 - zfs should throttle allocations #5258
Conversation
Authored by: George Wilson <[email protected]> Reviewed by: Alex Reece <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Approved by: Matthew Ahrens <[email protected]> When write I/Os are issued, they are issued in block order but the ZIO pipeline will drive them asynchronously through the allocation stage which can result in blocks being allocated out-of-order. It would be nice to preserve as much of the logical order as possible. In addition, the allocations are equally scattered across all top-level VDEVs but not all top-level VDEVs are created equally. The pipeline should be able to detect devices that are more capable of handling allocations and should allocate more blocks to those devices. This allows for dynamic allocation distribution when devices are imbalanced as fuller devices will tend to be slower than empty devices. The change includes a new pool-wide allocation queue which would throttle and order allocations in the ZIO pipeline. The queue would be ordered by issued time and offset and would provide an initial amount of allocation of work to each top-level vdev. The allocation logic utilizes a reservation system to reserve allocations that will be performed by the allocator. Once an allocation is successfully completed it's scheduled on a given top-level vdev. Each top-level vdev maintains a maximum number of allocations that it can handle (mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab groups and round robin across all eligible metaslab groups to distribute the work. As top-levels complete their work, they receive additional work from the pool-wide allocation queue until the allocation queue is emptied. Ported-by: Don Brady <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7090 OpenZFS-commit: openzfs/openzfs@4756c3d7 Porting Notes: - Maintained minimal stack in zio_done - Preserve linux-specific io sizes in zio_write_compress
@@ -24,7 +24,7 @@ | |||
*/ | |||
|
|||
/* | |||
* Copyright (c) 2013 by Delphix. All rights reserved. | |||
* Copyright (c) 2012, 2015 by Delphix. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is dropping the 2013 copyright. Was that intentional, and if so, why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not certain how it's in this case but I also handled it that way - if e.g. upstream commits from Illumos did it
@behlendorf @ahrens would be good to know how the general policy is on how to proceed on these cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was intentional. We changed our copyright notice policy from "list only the most recent year of modification" to "list the first and last years of modification".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify who "We" is (Delphix, Illumos, OpenZFS) in that policy? And you don't want to connect those years with a dash (e.g. 2012-2015)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"We" is Delphix, the copyright holder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good observation. This change was directly from Delpix in their commit to Open ZFS. illumos/illumos-gate@4756c3d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I see no reason to object to a copyright holder changing their own copyright line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rlaager The first, last
year format is what is used in other projects I'm involved in also. It isn't correct to use first-last
since there may not have been modifications for all of the intervening years. For copyright purposes, the in-between years have no meaning, and only the most recent year determines when the copyright will expire.
@@ -24,7 +24,7 @@ | |||
*/ | |||
|
|||
/* | |||
* Copyright (c) 2012, 2014 by Delphix. All rights reserved. | |||
* Copyright (c) 2012, 2015 by Delphix. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, this drops a 2014 copyright.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some Linux specific changes.
@@ -1034,7 +1034,8 @@ typedef enum { | |||
SPA_LOAD_IMPORT, /* import in progress */ | |||
SPA_LOAD_TRYIMPORT, /* tryimport in progress */ | |||
SPA_LOAD_RECOVER, /* recovery requested */ | |||
SPA_LOAD_ERROR /* load failed */ | |||
SPA_LOAD_ERROR, /* load failed */ | |||
SPA_LOAD_CREATE /* creation in progress */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new state shouldn't cause an issue since it was added at the end but older user space utilities won't know how to interpret this state.
int zfs_vdev_queue_depth_pct = 1000; | ||
#else | ||
int zfs_vdev_queue_depth_pct = 300; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zfs_vdev_queue_depth_pct
should be added as a module option. We're undoubtedly going to want access to this for performance analysis. It should also be added to man/man5/zfs-module-parameters.5
with a description of what it does.
The same goes for zio_dva_throttle_enabled
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Will add them.
"iss", "iss_h", "int", "int_h" | ||
#else | ||
"issue", "issue_high", "intr", "intr_high" | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for the record these names were shorted for Linux so they'd still be descriptive when truncated. I don't have any objection to this change but it seems unrelated to the rest of the patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change in zio.c for zio_type_name
was the motivation. I had to go chase down why they didn't match openzfs. I'd be happy to just make it a comment for future integrators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. A comment would be clearer.
@@ -6469,6 +6478,10 @@ spa_sync(spa_t *spa, uint64_t txg) | |||
spa->spa_syncing_txg = txg; | |||
spa->spa_sync_pass = 0; | |||
|
|||
mutex_enter(&spa->spa_alloc_lock); | |||
VERIFY0(avl_numnodes(&spa->spa_alloc_tree)); | |||
mutex_exit(&spa->spa_alloc_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels pretty heavy handed for a release build. I'd be tempted to wrap it in ZFS_DEBUG but there's no point diverging here unless it's actually a problem.
@@ -6676,6 +6721,10 @@ spa_sync(spa_t *spa, uint64_t txg) | |||
|
|||
dsl_pool_sync_done(dp, txg); | |||
|
|||
mutex_enter(&spa->spa_alloc_lock); | |||
VERIFY0(avl_numnodes(&spa->spa_alloc_tree)); | |||
mutex_exit(&spa->spa_alloc_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
|
||
return (0); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Linux we optimized the other AVL tree comparator functions in ee36c70. We should give this one the same treatment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, very interesting. Applied the AVL compare macros.
Authored by: George Wilson [email protected]
Reviewed by: Alex Reece [email protected]
Reviewed by: Christopher Siden [email protected]
Reviewed by: Dan Kimmel [email protected]
Reviewed by: Matthew Ahrens [email protected]
Reviewed by: Paul Dagnelie [email protected]
Reviewed by: Prakash Surya [email protected]
Reviewed by: Sebastien Roy [email protected]
Approved by: Matthew Ahrens [email protected]
When write I/Os are issued, they are issued in block order but the ZIO
pipeline will drive them asynchronously through the allocation stage
which can result in blocks being allocated out-of-order. It would be
nice to preserve as much of the logical order as possible.
In addition, the allocations are equally scattered across all top-level
VDEVs but not all top-level VDEVs are created equally. The pipeline
should be able to detect devices that are more capable of handling
allocations and should allocate more blocks to those devices. This
allows for dynamic allocation distribution when devices are imbalanced
as fuller devices will tend to be slower than empty devices.
The change includes a new pool-wide allocation queue which would
throttle and order allocations in the ZIO pipeline. The queue would be
ordered by issued time and offset and would provide an initial amount of
allocation of work to each top-level vdev. The allocation logic utilizes
a reservation system to reserve allocations that will be performed by
the allocator. Once an allocation is successfully completed it's
scheduled on a given top-level vdev. Each top-level vdev maintains a
maximum number of allocations that it can handle (mg_alloc_queue_depth).
The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth)
are distributed across the top-level vdevs metaslab groups and round
robin across all eligible metaslab groups to distribute the work. As
top-levels complete their work, they receive additional work from the
pool-wide allocation queue until the allocation queue is emptied.
Ported-by: Don Brady [email protected]
OpenZFS-issue: https://www.illumos.org/issues/7090
OpenZFS-commit: openzfs/openzfs@4756c3d7
Porting Notes:
- Maintained minimal stack in zio_done
- Preserve linux-specific io sizes in zio_write_compress