Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenZFS 7090 - zfs should throttle allocations #5258

Merged
merged 3 commits into from
Oct 14, 2016
Merged

OpenZFS 7090 - zfs should throttle allocations #5258

merged 3 commits into from
Oct 14, 2016

Conversation

don-brady
Copy link
Contributor

Authored by: George Wilson [email protected]
Reviewed by: Alex Reece [email protected]
Reviewed by: Christopher Siden [email protected]
Reviewed by: Dan Kimmel [email protected]
Reviewed by: Matthew Ahrens [email protected]
Reviewed by: Paul Dagnelie [email protected]
Reviewed by: Prakash Surya [email protected]
Reviewed by: Sebastien Roy [email protected]

Approved by: Matthew Ahrens [email protected]

When write I/Os are issued, they are issued in block order but the ZIO
pipeline will drive them asynchronously through the allocation stage
which can result in blocks being allocated out-of-order. It would be
nice to preserve as much of the logical order as possible.

In addition, the allocations are equally scattered across all top-level
VDEVs but not all top-level VDEVs are created equally. The pipeline
should be able to detect devices that are more capable of handling
allocations and should allocate more blocks to those devices. This
allows for dynamic allocation distribution when devices are imbalanced
as fuller devices will tend to be slower than empty devices.

The change includes a new pool-wide allocation queue which would
throttle and order allocations in the ZIO pipeline. The queue would be
ordered by issued time and offset and would provide an initial amount of
allocation of work to each top-level vdev. The allocation logic utilizes
a reservation system to reserve allocations that will be performed by
the allocator. Once an allocation is successfully completed it's
scheduled on a given top-level vdev. Each top-level vdev maintains a
maximum number of allocations that it can handle (mg_alloc_queue_depth).
The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth)
are distributed across the top-level vdevs metaslab groups and round
robin across all eligible metaslab groups to distribute the work. As
top-levels complete their work, they receive additional work from the
pool-wide allocation queue until the allocation queue is emptied.

Ported-by: Don Brady [email protected]
OpenZFS-issue: https://www.illumos.org/issues/7090
OpenZFS-commit: openzfs/openzfs@4756c3d7

Porting Notes:
- Maintained minimal stack in zio_done
- Preserve linux-specific io sizes in zio_write_compress

Authored by: George Wilson <[email protected]>
Reviewed by: Alex Reece <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Paul Dagnelie <[email protected]>
Reviewed by: Prakash Surya <[email protected]>
Reviewed by: Sebastien Roy <[email protected]>

Approved by: Matthew Ahrens <[email protected]>

When write I/Os are issued, they are issued in block order but the ZIO
pipeline will drive them asynchronously through the allocation stage
which can result in blocks being allocated out-of-order. It would be
nice to preserve as much of the logical order as possible.

In addition, the allocations are equally scattered across all top-level
VDEVs but not all top-level VDEVs are created equally. The pipeline
should be able to detect devices that are more capable of handling
allocations and should allocate more blocks to those devices. This
allows for dynamic allocation distribution when devices are imbalanced
as fuller devices will tend to be slower than empty devices.

The change includes a new pool-wide allocation queue which would
throttle and order allocations in the ZIO pipeline. The queue would be
ordered by issued time and offset and would provide an initial amount of
allocation of work to each top-level vdev. The allocation logic utilizes
a reservation system to reserve allocations that will be performed by
the allocator. Once an allocation is successfully completed it's
scheduled on a given top-level vdev. Each top-level vdev maintains a
maximum number of allocations that it can handle (mg_alloc_queue_depth).
The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth)
are distributed across the top-level vdevs metaslab groups and round
robin across all eligible metaslab groups to distribute the work. As
top-levels complete their work, they receive additional work from the
pool-wide allocation queue until the allocation queue is emptied.

Ported-by: Don Brady <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/7090
OpenZFS-commit: openzfs/openzfs@4756c3d7

Porting Notes:
	- Maintained minimal stack in zio_done
	- Preserve linux-specific io sizes in zio_write_compress
@@ -24,7 +24,7 @@
*/

/*
* Copyright (c) 2013 by Delphix. All rights reserved.
* Copyright (c) 2012, 2015 by Delphix. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dropping the 2013 copyright. Was that intentional, and if so, why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not certain how it's in this case but I also handled it that way - if e.g. upstream commits from Illumos did it

@behlendorf @ahrens would be good to know how the general policy is on how to proceed on these cases

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was intentional. We changed our copyright notice policy from "list only the most recent year of modification" to "list the first and last years of modification".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify who "We" is (Delphix, Illumos, OpenZFS) in that policy? And you don't want to connect those years with a dash (e.g. 2012-2015)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"We" is Delphix, the copyright holder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good observation. This change was directly from Delpix in their commit to Open ZFS. illumos/illumos-gate@4756c3d

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I see no reason to object to a copyright holder changing their own copyright line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlaager The first, last year format is what is used in other projects I'm involved in also. It isn't correct to use first-last since there may not have been modifications for all of the intervening years. For copyright purposes, the in-between years have no meaning, and only the most recent year determines when the copyright will expire.

@@ -24,7 +24,7 @@
*/

/*
* Copyright (c) 2012, 2014 by Delphix. All rights reserved.
* Copyright (c) 2012, 2015 by Delphix. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, this drops a 2014 copyright.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some Linux specific changes.

@@ -1034,7 +1034,8 @@ typedef enum {
SPA_LOAD_IMPORT, /* import in progress */
SPA_LOAD_TRYIMPORT, /* tryimport in progress */
SPA_LOAD_RECOVER, /* recovery requested */
SPA_LOAD_ERROR /* load failed */
SPA_LOAD_ERROR, /* load failed */
SPA_LOAD_CREATE /* creation in progress */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new state shouldn't cause an issue since it was added at the end but older user space utilities won't know how to interpret this state.

int zfs_vdev_queue_depth_pct = 1000;
#else
int zfs_vdev_queue_depth_pct = 300;
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zfs_vdev_queue_depth_pct should be added as a module option. We're undoubtedly going to want access to this for performance analysis. It should also be added to man/man5/zfs-module-parameters.5 with a description of what it does.

The same goes for zio_dva_throttle_enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Will add them.

"iss", "iss_h", "int", "int_h"
#else
"issue", "issue_high", "intr", "intr_high"
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the record these names were shorted for Linux so they'd still be descriptive when truncated. I don't have any objection to this change but it seems unrelated to the rest of the patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change in zio.c for zio_type_name was the motivation. I had to go chase down why they didn't match openzfs. I'd be happy to just make it a comment for future integrators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. A comment would be clearer.

@@ -6469,6 +6478,10 @@ spa_sync(spa_t *spa, uint64_t txg)
spa->spa_syncing_txg = txg;
spa->spa_sync_pass = 0;

mutex_enter(&spa->spa_alloc_lock);
VERIFY0(avl_numnodes(&spa->spa_alloc_tree));
mutex_exit(&spa->spa_alloc_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels pretty heavy handed for a release build. I'd be tempted to wrap it in ZFS_DEBUG but there's no point diverging here unless it's actually a problem.

@@ -6676,6 +6721,10 @@ spa_sync(spa_t *spa, uint64_t txg)

dsl_pool_sync_done(dp, txg);

mutex_enter(&spa->spa_alloc_lock);
VERIFY0(avl_numnodes(&spa->spa_alloc_tree));
mutex_exit(&spa->spa_alloc_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.


return (0);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Linux we optimized the other AVL tree comparator functions in ee36c70. We should give this one the same treatment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, very interesting. Applied the AVL compare macros.

@behlendorf behlendorf merged commit 3dfb57a into openzfs:master Oct 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants