OpenZFS 7090 - zfs should throttle allocations #5258

don-brady · 2016-10-10T20:58:32Z

Authored by: George Wilson [email protected]
Reviewed by: Alex Reece [email protected]
Reviewed by: Christopher Siden [email protected]
Reviewed by: Dan Kimmel [email protected]
Reviewed by: Matthew Ahrens [email protected]
Reviewed by: Paul Dagnelie [email protected]
Reviewed by: Prakash Surya [email protected]
Reviewed by: Sebastien Roy [email protected]

Approved by: Matthew Ahrens [email protected]

When write I/Os are issued, they are issued in block order but the ZIO
pipeline will drive them asynchronously through the allocation stage
which can result in blocks being allocated out-of-order. It would be
nice to preserve as much of the logical order as possible.

In addition, the allocations are equally scattered across all top-level
VDEVs but not all top-level VDEVs are created equally. The pipeline
should be able to detect devices that are more capable of handling
allocations and should allocate more blocks to those devices. This
allows for dynamic allocation distribution when devices are imbalanced
as fuller devices will tend to be slower than empty devices.

The change includes a new pool-wide allocation queue which would
throttle and order allocations in the ZIO pipeline. The queue would be
ordered by issued time and offset and would provide an initial amount of
allocation of work to each top-level vdev. The allocation logic utilizes
a reservation system to reserve allocations that will be performed by
the allocator. Once an allocation is successfully completed it's
scheduled on a given top-level vdev. Each top-level vdev maintains a
maximum number of allocations that it can handle (mg_alloc_queue_depth).
The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth)
are distributed across the top-level vdevs metaslab groups and round
robin across all eligible metaslab groups to distribute the work. As
top-levels complete their work, they receive additional work from the
pool-wide allocation queue until the allocation queue is emptied.

Ported-by: Don Brady [email protected]
OpenZFS-issue: https://www.illumos.org/issues/7090
OpenZFS-commit: openzfs/openzfs@4756c3d7

Porting Notes:
- Maintained minimal stack in zio_done
- Preserve linux-specific io sizes in zio_write_compress

Authored by: George Wilson <[email protected]> Reviewed by: Alex Reece <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Approved by: Matthew Ahrens <[email protected]> When write I/Os are issued, they are issued in block order but the ZIO pipeline will drive them asynchronously through the allocation stage which can result in blocks being allocated out-of-order. It would be nice to preserve as much of the logical order as possible. In addition, the allocations are equally scattered across all top-level VDEVs but not all top-level VDEVs are created equally. The pipeline should be able to detect devices that are more capable of handling allocations and should allocate more blocks to those devices. This allows for dynamic allocation distribution when devices are imbalanced as fuller devices will tend to be slower than empty devices. The change includes a new pool-wide allocation queue which would throttle and order allocations in the ZIO pipeline. The queue would be ordered by issued time and offset and would provide an initial amount of allocation of work to each top-level vdev. The allocation logic utilizes a reservation system to reserve allocations that will be performed by the allocator. Once an allocation is successfully completed it's scheduled on a given top-level vdev. Each top-level vdev maintains a maximum number of allocations that it can handle (mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab groups and round robin across all eligible metaslab groups to distribute the work. As top-levels complete their work, they receive additional work from the pool-wide allocation queue until the allocation queue is emptied. Ported-by: Don Brady <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7090 OpenZFS-commit: openzfs/openzfs@4756c3d7 Porting Notes: - Maintained minimal stack in zio_done - Preserve linux-specific io sizes in zio_write_compress

rlaager · 2016-10-10T21:06:30Z

include/sys/zio_impl.h

@@ -24,7 +24,7 @@
 */

 /*
- * Copyright (c) 2013 by Delphix. All rights reserved.
+ * Copyright (c) 2012, 2015 by Delphix. All rights reserved.


This is dropping the 2013 copyright. Was that intentional, and if so, why?

Not certain how it's in this case but I also handled it that way - if e.g. upstream commits from Illumos did it

@behlendorf @ahrens would be good to know how the general policy is on how to proceed on these cases

This change was intentional. We changed our copyright notice policy from "list only the most recent year of modification" to "list the first and last years of modification".

Can you clarify who "We" is (Delphix, Illumos, OpenZFS) in that policy? And you don't want to connect those years with a dash (e.g. 2012-2015)?

"We" is Delphix, the copyright holder.

good observation. This change was directly from Delpix in their commit to Open ZFS. illumos/illumos-gate@4756c3d

Thanks. I see no reason to object to a copyright holder changing their own copyright line.

@rlaager The first, last year format is what is used in other projects I'm involved in also. It isn't correct to use first-last since there may not have been modifications for all of the intervening years. For copyright purposes, the in-between years have no meaning, and only the most recent year determines when the copyright will expire.

rlaager · 2016-10-10T21:07:43Z

module/zfs/vdev_mirror.c

@@ -24,7 +24,7 @@
 */

 /*
- * Copyright (c) 2012, 2014 by Delphix. All rights reserved.
+ * Copyright (c) 2012, 2015 by Delphix. All rights reserved.


Likewise, this drops a 2014 copyright.

behlendorf

Just some Linux specific changes.

behlendorf · 2016-10-12T21:12:46Z

include/sys/fs/zfs.h

@@ -1034,7 +1034,8 @@ typedef enum {
 	SPA_LOAD_IMPORT,	/* import in progress	*/
 	SPA_LOAD_TRYIMPORT,	/* tryimport in progress */
 	SPA_LOAD_RECOVER,	/* recovery requested	*/
-	SPA_LOAD_ERROR		/* load failed		*/
+	SPA_LOAD_ERROR,		/* load failed		*/
+	SPA_LOAD_CREATE		/* creation in progress */


This new state shouldn't cause an issue since it was added at the end but older user space utilities won't know how to interpret this state.

behlendorf · 2016-10-12T21:21:58Z

module/zfs/vdev_queue.c

+int zfs_vdev_queue_depth_pct = 1000;
+#else
+int zfs_vdev_queue_depth_pct = 300;
+#endif


zfs_vdev_queue_depth_pct should be added as a module option. We're undoubtedly going to want access to this for performance analysis. It should also be added to man/man5/zfs-module-parameters.5 with a description of what it does.

The same goes for zio_dva_throttle_enabled.

Got it. Will add them.

behlendorf · 2016-10-12T21:58:22Z

module/zfs/spa.c

 	"iss", "iss_h", "int", "int_h"
+#else
+	"issue", "issue_high", "intr", "intr_high"
+#endif


Just for the record these names were shorted for Linux so they'd still be descriptive when truncated. I don't have any objection to this change but it seems unrelated to the rest of the patch.

The change in zio.c for zio_type_name was the motivation. I had to go chase down why they didn't match openzfs. I'd be happy to just make it a comment for future integrators.

Yeah. A comment would be clearer.

behlendorf · 2016-10-12T22:01:16Z

module/zfs/spa.c

@@ -6469,6 +6478,10 @@ spa_sync(spa_t *spa, uint64_t txg)
 	spa->spa_syncing_txg = txg;
 	spa->spa_sync_pass = 0;

+	mutex_enter(&spa->spa_alloc_lock);
+	VERIFY0(avl_numnodes(&spa->spa_alloc_tree));
+	mutex_exit(&spa->spa_alloc_lock);


This feels pretty heavy handed for a release build. I'd be tempted to wrap it in ZFS_DEBUG but there's no point diverging here unless it's actually a problem.

behlendorf · 2016-10-12T22:01:54Z

module/zfs/spa.c

@@ -6676,6 +6721,10 @@ spa_sync(spa_t *spa, uint64_t txg)

 	dsl_pool_sync_done(dp, txg);

+	mutex_enter(&spa->spa_alloc_lock);
+	VERIFY0(avl_numnodes(&spa->spa_alloc_tree));
+	mutex_exit(&spa->spa_alloc_lock);


behlendorf · 2016-10-12T22:09:36Z

module/zfs/zio.c

+
+	return (0);
+}
+


For Linux we optimized the other AVL tree comparator functions in ee36c70. We should give this one the same treatment.

Ah, very interesting. Applied the AVL compare macros.

rlaager reviewed Oct 10, 2016

View reviewed changes

behlendorf requested changes Oct 12, 2016

View reviewed changes

Don Brady added 2 commits October 13, 2016 01:02

Address review feedback. Module params and AVL cmp

e2bfeb6

additional review feedback changes

ef31efc

behlendorf approved these changes Oct 13, 2016

View reviewed changes

behlendorf merged commit 3dfb57a into openzfs:master Oct 14, 2016

don-brady mentioned this pull request Feb 22, 2017

Metadata Allocation Classes #5182

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenZFS 7090 - zfs should throttle allocations #5258

OpenZFS 7090 - zfs should throttle allocations #5258

don-brady commented Oct 10, 2016

rlaager Oct 10, 2016

kernelOfTruth Oct 10, 2016

ahrens Oct 10, 2016

rlaager Oct 10, 2016

ahrens Oct 10, 2016

don-brady Oct 10, 2016

rlaager Oct 11, 2016

adilger Oct 12, 2016

rlaager Oct 10, 2016

behlendorf left a comment

behlendorf Oct 12, 2016

behlendorf Oct 12, 2016

don-brady Oct 13, 2016

behlendorf Oct 12, 2016

don-brady Oct 13, 2016

behlendorf Oct 13, 2016

behlendorf Oct 12, 2016

behlendorf Oct 12, 2016

behlendorf Oct 12, 2016

don-brady Oct 13, 2016

OpenZFS 7090 - zfs should throttle allocations #5258

OpenZFS 7090 - zfs should throttle allocations #5258

Conversation

don-brady commented Oct 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behlendorf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment