OpenZFS 8005 - poor performance of 1MB writes on certain RAID-Z configurations #5931

ahrens · 2017-03-27T18:26:53Z

Background

RAID-Z requires that space be allocated in multiples of P+1 sectors,
because this is the minimum size block that can have the required amount
of parity. Thus blocks on RAIDZ1 must be allocated in a multiple of 2
sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4. A sector
is a unit of 2^ashift bytes, typically 512B or 4KB.

To satisfy this constraint, the allocation size is rounded up to the
proper multiple, resulting in up to 3 "pad sectors" at the end of some
blocks. The contents of these pad sectors are not used, so we do not
need to read or write these sectors. However, some storage hardware
performs much worse (around 1/2 as fast) on mostly-contiguous writes
when there are small gaps of non-overwritten data between the writes.
Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that
include pad sectors. If writing a pad sector will fill the gap between
two (required) writes, we will issue the optional zio, thus doubling
performance. The gap-filling performance improvement was introduced in
July 2009.

Problem

Writing the optional zio is done by the io aggregation code in
vdev_queue.c. The problem is that it is also subject to the limit on
the size of aggregate writes, zfs_vdev_aggregation_limit, which is by
default 128KB. For a given block, if the amount of data plus padding
written to a leaf device exceeds zfs_vdev_aggregation_limit, the
optional zio will not be written, resulting in a ~2x performance
degradation.

Situations in which the problem occurs

The problem occurs only for certain values of ashift, compressed block
size, and RAID-Z configuration (number of parity and data disks). It
cannot occur with the default recordsize=128KB. If compression is
enabled, all configurations with recordsize=1MB or larger will be
impacted to some degree.

The problem notably occurs with recordsize=1MB, compression=off, with 10
disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors). Therefore
this problem has been known as "the 1MB 10-wide RAIDZ2 (or 3) problem".

The problem also occurs with the following configurations:

With recordsize=512KB or 256KB, compression=off, the problem occurs only
in rarely-used configurations:

4-wide RAIDZ1 with recordsize=512KB and ashift=12 (4KB sectors)
4-wide RAIDZ2 (either recordsize, either ashift)
5-wide RAIDZ2 with recordsize=512KB (either ashift)
6-wide RAIDZ2 with recordsize=512KB (either ashift)

With recordsize=1MB, compression=off, ashift=9 (512B sectors)

RAIDZ1 with 4 or 8 disks
RAIDZ2 with 4, 8, or 10 disks
RAIDZ3 with 6, 8, 9, or 10 disks

With recordsize=1MB, compression=off, ashift=12 (4KB sectors)

RAIDZ1 with 7 or 8 disks
RAIDZ2 with 4, 5, or 10 disks
RAIDZ3 with 6, 9, or 10 disks

With recordsize=2MB and larger (which can only be selected by changing
kernel tunables), many configurations are affected, including with
higher numbers of disks (up to 18 disks with recordsize=2MB).

Workaround

Increase zfs_vdev_aggregation_limit to allow the optional zio to be
aggregated, thus eliminating the problem. Setting it to 256KB fixes all
commonly-used configurations.

Solution

The solution is to aggregate optional zio's regardless of the
aggregation size limit.

Analysis sponsored by Intel Corp.

Results

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
Change has been approved by a ZFS on Linux member.

mention-bot · 2017-03-27T18:26:55Z

@ahrens, thanks for your PR! By analyzing the history of the files in this pull request, we identified @behlendorf, @mkjorling and @dpquigl to be potential reviewers.

adilger · 2017-03-27T19:50:59Z

Brian, once this has soaked a bit on master, it would be great if it could be included into a 0.6.x maintenance release, since my understanding is that 0.7.0 is still a ways away and this impacts Lustre quite often.

behlendorf · 2017-03-27T20:07:09Z

@adilger I agree since improvement here is significant and the change itself it so straight forward.

behlendorf

We're going to want to reconcile this fix with commit a58df6f which introduced some bounds checking for the zfs_vdev_aggregation_limit. It would be great to use this as a opportunity to bring the OpenZFS and Linux versions of this function back in to sync.

Specifically we want to keep the bounds checking which prevents users from setting zfs_vdev_aggregation_limit to something unsafe. And it we want to ensure the same value for zfs_vdev_aggregation_limit is used for the duration of vdev_queue_aggregate() otherwise it's possible to trigger the last ASSERT.

The current version of the patch hits this same ASSERT.

[ 2006.830120] VERIFY3(size <= limit) failed (131584 <= 131072)
[ 2006.835771] PANIC at vdev_queue.c:634:vdev_queue_aggregate()
[ 2006.841140] Showing stack for process 4367
[ 2006.845889] CPU: 0 PID: 4367 Comm: z_wr_int_4 Tainted: P           OE  ------------   3.10.0-514.10.2.el7.x86_64 #1
[ 2006.854425] Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
[ 2006.860213]  ffffffffa07e9b64 0000000080d33e64 ffff8801bf303ae8 ffffffff816864ef
[ 2006.866869]  ffff8801bf303af8 ffffffffa04b0e24 ffff8801bf303c80 ffffffffa04b0eef
[ 2006.873433]  0000000000000092 0000000000000030 ffff8801bf303c90 ffff8801bf303c30
[ 2006.880114] Call Trace:
[ 2006.883749]  [<ffffffff816864ef>] dump_stack+0x19/0x1b
[ 2006.888901]  [<ffffffffa04b0e24>] spl_dumpstack+0x44/0x50 [spl]
[ 2006.894391]  [<ffffffffa04b0eef>] spl_panic+0xbf/0xf0 [spl]
[ 2006.899519]  [<ffffffff81698831>] ? ftrace_call+0x5/0x2f
[ 2006.904563]  [<ffffffff81698831>] ? ftrace_call+0x5/0x2f
[ 2006.909529]  [<ffffffffa06c54b8>] vdev_queue_io_to_issue+0x998/0xc90 [zfs]
[ 2006.915318]  [<ffffffffa06c5e3b>] vdev_queue_io_done+0x1cb/0x370 [zfs]
[ 2006.920959]  [<ffffffffa0715a48>] zio_vdev_io_done+0xe8/0x210 [zfs]
[ 2006.926412]  [<ffffffffa0716d8e>] zio_execute+0xee/0x300 [zfs]
[ 2006.931663]  [<ffffffffa04adeba>] taskq_thread+0x28a/0x590 [spl]
[ 2006.937029]  [<ffffffff810c5080>] ? wake_up_state+0x20/0x20
[ 2006.942178]  [<ffffffffa04adc30>] ? taskq_thread_should_stop+0xa0/0xa0 [spl]
[ 2006.947958]  [<ffffffff810b06ff>] kthread+0xcf/0xe0
[ 2006.952591]  [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
[ 2006.957379]  [<ffffffff81696a58>] ret_from_fork+0x58/0x90
[ 2006.962110]  [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140

ahrens · 2017-03-27T20:27:56Z

@behlendorf You're right that the assertion needs to be removed/relaxed. I think the other things you're asking for are done (keep the bounds checking of zfs_vdev_aggregation_limit), but it's true we could bring that to illumos.

behlendorf · 2017-03-27T22:07:00Z

module/zfs/vdev_queue.c

@@ -631,7 +631,6 @@ vdev_queue_aggregate(vdev_queue_t *vq, zio_t *zio)
 		return (NULL);

 	size = IO_SPAN(first, last);
-	ASSERT3U(size, <=, limit);


We should drop this anyway since it's redundant with the VERIFY checks in abd_alloc_linear() / abd_alloc() which more correctly check that the size is less than SPA_MAXBLOCKSIZE.

behlendorf · 2017-03-27T22:07:49Z

@ahrens not a blocker but just FYI if you rebase this patch on master it will resolve the Kernel.org build failures.

RAID-Z requires that space be allocated in multiples of P+1 sectors, because this is the minimum size block that can have the required amount of parity. Thus blocks on RAIDZ1 must be allocated in a multiple of 2 sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4. A sector is a unit of 2^ashift bytes, typically 512B or 4KB. To satisfy this constraint, the allocation size is rounded up to the proper multiple, resulting in up to 3 "pad sectors" at the end of some blocks. The contents of these pad sectors are not used, so we do not need to read or write these sectors. However, some storage hardware performs much worse (around 1/2 as fast) on mostly-contiguous writes when there are small gaps of non-overwritten data between the writes. Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that include pad sectors. If writing a pad sector will fill the gap between two (required) writes, we will issue the optional zio, thus doubling performance. The gap-filling performance improvement was introduced in July 2009. Writing the optional zio is done by the io aggregation code in vdev_queue.c. The problem is that it is also subject to the limit on the size of aggregate writes, zfs_vdev_aggregation_limit, which is by default 128KB. For a given block, if the amount of data plus padding written to a leaf device exceeds zfs_vdev_aggregation_limit, the optional zio will not be written, resulting in a ~2x performance degradation. The problem occurs only for certain values of ashift, compressed block size, and RAID-Z configuration (number of parity and data disks). It cannot occur with the default recordsize=128KB. If compression is enabled, all configurations with recordsize=1MB or larger will be impacted to some degree. The problem notably occurs with recordsize=1MB, compression=off, with 10 disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors). Therefore this problem has been known as "the 1MB 10-wide RAIDZ2 (or 3) problem". The problem also occurs with the following configurations: With recordsize=512KB or 256KB, compression=off, the problem occurs only in rarely-used configurations: * 4-wide RAIDZ1 with recordsize=512KB and ashift=12 (4KB sectors) * 4-wide RAIDZ2 (either recordsize, either ashift) * 5-wide RAIDZ2 with recordsize=512KB (either ashift) * 6-wide RAIDZ2 with recordsize=512KB (either ashift) With recordsize=1MB, compression=off, ashift=9 (512B sectors) * RAIDZ1 with 4 or 8 disks * RAIDZ2 with 4, 8, or 10 disks * RAIDZ3 with 6, 8, 9, or 10 disks With recordsize=1MB, compression=off, ashift=12 (4KB sectors) * RAIDZ1 with 7 or 8 disks * RAIDZ2 with 4, 5, or 10 disks * RAIDZ3 with 6, 9, or 10 disks With recordsize=2MB and larger (which can only be selected by changing kernel tunables), many configurations are affected, including with higher numbers of disks (up to 18 disks with recordsize=2MB). Increase zfs_vdev_aggregation_limit to allow the optional zio to be aggregated, thus eliminating the problem. Setting it to 256KB fixes all commonly-used configurations. The solution is to aggregate optional zio's regardless of the aggregation size limit. Analysis sponsored by Intel Corp.

RAID-Z requires that space be allocated in multiples of P+1 sectors, because this is the minimum size block that can have the required amount of parity. Thus blocks on RAIDZ1 must be allocated in a multiple of 2 sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4. A sector is a unit of 2^ashift bytes, typically 512B or 4KB. To satisfy this constraint, the allocation size is rounded up to the proper multiple, resulting in up to 3 "pad sectors" at the end of some blocks. The contents of these pad sectors are not used, so we do not need to read or write these sectors. However, some storage hardware performs much worse (around 1/2 as fast) on mostly-contiguous writes when there are small gaps of non-overwritten data between the writes. Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that include pad sectors. If writing a pad sector will fill the gap between two (required) writes, we will issue the optional zio, thus doubling performance. The gap-filling performance improvement was introduced in July 2009. Writing the optional zio is done by the io aggregation code in vdev_queue.c. The problem is that it is also subject to the limit on the size of aggregate writes, zfs_vdev_aggregation_limit, which is by default 128KB. For a given block, if the amount of data plus padding written to a leaf device exceeds zfs_vdev_aggregation_limit, the optional zio will not be written, resulting in a ~2x performance degradation. The problem occurs only for certain values of ashift, compressed block size, and RAID-Z configuration (number of parity and data disks). It cannot occur with the default recordsize=128KB. If compression is enabled, all configurations with recordsize=1MB or larger will be impacted to some degree. The problem notably occurs with recordsize=1MB, compression=off, with 10 disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors). Therefore this problem has been known as "the 1MB 10-wide RAIDZ2 (or 3) problem". The problem also occurs with the following configurations: With recordsize=512KB or 256KB, compression=off, the problem occurs only in rarely-used configurations: * 4-wide RAIDZ1 with recordsize=512KB and ashift=12 (4KB sectors) * 4-wide RAIDZ2 (either recordsize, either ashift) * 5-wide RAIDZ2 with recordsize=512KB (either ashift) * 6-wide RAIDZ2 with recordsize=512KB (either ashift) With recordsize=1MB, compression=off, ashift=9 (512B sectors) * RAIDZ1 with 4 or 8 disks * RAIDZ2 with 4, 8, or 10 disks * RAIDZ3 with 6, 8, 9, or 10 disks With recordsize=1MB, compression=off, ashift=12 (4KB sectors) * RAIDZ1 with 7 or 8 disks * RAIDZ2 with 4, 5, or 10 disks * RAIDZ3 with 6, 9, or 10 disks With recordsize=2MB and larger (which can only be selected by changing kernel tunables), many configurations are affected, including with higher numbers of disks (up to 18 disks with recordsize=2MB). Increase zfs_vdev_aggregation_limit to allow the optional zio to be aggregated, thus eliminating the problem. Setting it to 256KB fixes all commonly-used configurations. The solution is to aggregate optional zio's regardless of the aggregation size limit. Analysis sponsored by Intel Corp. Reviewed by: Saso Kiselkov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Issue openzfs#5931

don-brady

LGTM

Reviewed at: http://reviews.delphix.com/r/33245/

…gurations Authored by: Matt Ahrens <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Don Brady <[email protected]> Ported-by: Matt Ahrens <[email protected]> RAID-Z requires that space be allocated in multiples of P+1 sectors, because this is the minimum size block that can have the required amount of parity. Thus blocks on RAIDZ1 must be allocated in a multiple of 2 sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4. A sector is a unit of 2^ashift bytes, typically 512B or 4KB. To satisfy this constraint, the allocation size is rounded up to the proper multiple, resulting in up to 3 "pad sectors" at the end of some blocks. The contents of these pad sectors are not used, so we do not need to read or write these sectors. However, some storage hardware performs much worse (around 1/2 as fast) on mostly-contiguous writes when there are small gaps of non-overwritten data between the writes. Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that include pad sectors. If writing a pad sector will fill the gap between two (required) writes, we will issue the optional zio, thus doubling performance. The gap-filling performance improvement was introduced in July 2009. Writing the optional zio is done by the io aggregation code in vdev_queue.c. The problem is that it is also subject to the limit on the size of aggregate writes, zfs_vdev_aggregation_limit, which is by default 128KB. For a given block, if the amount of data plus padding written to a leaf device exceeds zfs_vdev_aggregation_limit, the optional zio will not be written, resulting in a ~2x performance degradation. The problem occurs only for certain values of ashift, compressed block size, and RAID-Z configuration (number of parity and data disks). It cannot occur with the default recordsize=128KB. If compression is enabled, all configurations with recordsize=1MB or larger will be impacted to some degree. The problem notably occurs with recordsize=1MB, compression=off, with 10 disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors). Therefore this problem has been known as "the 1MB 10-wide RAIDZ2 (or 3) problem". The problem also occurs with the following configurations: With recordsize=512KB or 256KB, compression=off, the problem occurs only in rarely-used configurations: * 4-wide RAIDZ1 with recordsize=512KB and ashift=12 (4KB sectors) * 4-wide RAIDZ2 (either recordsize, either ashift) * 5-wide RAIDZ2 with recordsize=512KB (either ashift) * 6-wide RAIDZ2 with recordsize=512KB (either ashift) With recordsize=1MB, compression=off, ashift=9 (512B sectors) * RAIDZ1 with 4 or 8 disks * RAIDZ2 with 4, 8, or 10 disks * RAIDZ3 with 6, 8, 9, or 10 disks With recordsize=1MB, compression=off, ashift=12 (4KB sectors) * RAIDZ1 with 7 or 8 disks * RAIDZ2 with 4, 5, or 10 disks * RAIDZ3 with 6, 9, or 10 disks With recordsize=2MB and larger (which can only be selected by changing kernel tunables), many configurations are affected, including with higher numbers of disks (up to 18 disks with recordsize=2MB). Increase zfs_vdev_aggregation_limit to allow the optional zio to be aggregated, thus eliminating the problem. Setting it to 256KB fixes all commonly-used configurations. The solution is to aggregate optional zio's regardless of the aggregation size limit. Analysis sponsored by Intel Corp. OpenZFS-issue: https://www.illumos.org/issues/8005 OpenZFS-commit: openzfs/openzfs#321 Closes openzfs#5931

…gurations Authored by: Matt Ahrens <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Don Brady <[email protected]> Ported-by: Matt Ahrens <[email protected]> RAID-Z requires that space be allocated in multiples of P+1 sectors, because this is the minimum size block that can have the required amount of parity. Thus blocks on RAIDZ1 must be allocated in a multiple of 2 sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4. A sector is a unit of 2^ashift bytes, typically 512B or 4KB. To satisfy this constraint, the allocation size is rounded up to the proper multiple, resulting in up to 3 "pad sectors" at the end of some blocks. The contents of these pad sectors are not used, so we do not need to read or write these sectors. However, some storage hardware performs much worse (around 1/2 as fast) on mostly-contiguous writes when there are small gaps of non-overwritten data between the writes. Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that include pad sectors. If writing a pad sector will fill the gap between two (required) writes, we will issue the optional zio, thus doubling performance. The gap-filling performance improvement was introduced in July 2009. Writing the optional zio is done by the io aggregation code in vdev_queue.c. The problem is that it is also subject to the limit on the size of aggregate writes, zfs_vdev_aggregation_limit, which is by default 128KB. For a given block, if the amount of data plus padding written to a leaf device exceeds zfs_vdev_aggregation_limit, the optional zio will not be written, resulting in a ~2x performance degradation. The problem occurs only for certain values of ashift, compressed block size, and RAID-Z configuration (number of parity and data disks). It cannot occur with the default recordsize=128KB. If compression is enabled, all configurations with recordsize=1MB or larger will be impacted to some degree. The problem notably occurs with recordsize=1MB, compression=off, with 10 disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors). Therefore this problem has been known as "the 1MB 10-wide RAIDZ2 (or 3) problem". The problem also occurs with the following configurations: With recordsize=512KB or 256KB, compression=off, the problem occurs only in rarely-used configurations: * 4-wide RAIDZ1 with recordsize=512KB and ashift=12 (4KB sectors) * 4-wide RAIDZ2 (either recordsize, either ashift) * 5-wide RAIDZ2 with recordsize=512KB (either ashift) * 6-wide RAIDZ2 with recordsize=512KB (either ashift) With recordsize=1MB, compression=off, ashift=9 (512B sectors) * RAIDZ1 with 4 or 8 disks * RAIDZ2 with 4, 8, or 10 disks * RAIDZ3 with 6, 8, 9, or 10 disks With recordsize=1MB, compression=off, ashift=12 (4KB sectors) * RAIDZ1 with 7 or 8 disks * RAIDZ2 with 4, 5, or 10 disks * RAIDZ3 with 6, 9, or 10 disks With recordsize=2MB and larger (which can only be selected by changing kernel tunables), many configurations are affected, including with higher numbers of disks (up to 18 disks with recordsize=2MB). Increase zfs_vdev_aggregation_limit to allow the optional zio to be aggregated, thus eliminating the problem. Setting it to 256KB fixes all commonly-used configurations. The solution is to aggregate optional zio's regardless of the aggregation size limit. Analysis sponsored by Intel Corp. OpenZFS-issue: https://www.illumos.org/issues/8005 OpenZFS-commit: openzfs/openzfs#321 Closes #5931

RubenKelevra · 2017-06-25T02:19:07Z

Thanks a lot! This fixed my poor raidz performance on one machine. :)

ahrens mentioned this pull request Mar 27, 2017

8005 poor performance of 1MB writes on certain RAID-Z configurations openzfs/openzfs#321

Closed

behlendorf added this to the 0.7.0 milestone Mar 27, 2017

behlendorf added the Type: Performance Performance improvement or performance problem label Mar 27, 2017

behlendorf requested changes Mar 27, 2017

View reviewed changes

behlendorf reviewed Mar 27, 2017

View reviewed changes

behlendorf approved these changes Mar 27, 2017

View reviewed changes

ahrens added 2 commits March 27, 2017 15:36

remove assertion

06bce4d

ahrens force-pushed the hdd branch from 10dcb60 to 06bce4d Compare March 27, 2017 22:37

don-brady approved these changes Mar 30, 2017

View reviewed changes

gmelikov approved these changes Mar 30, 2017

View reviewed changes

comment

50f8167

Reviewed at: http://reviews.delphix.com/r/33245/

behlendorf merged commit 8542ef8 into openzfs:master Apr 10, 2017

ahrens deleted the hdd branch June 27, 2020 03:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenZFS 8005 - poor performance of 1MB writes on certain RAID-Z configurations #5931

OpenZFS 8005 - poor performance of 1MB writes on certain RAID-Z configurations #5931

ahrens commented Mar 27, 2017

mention-bot commented Mar 27, 2017

adilger commented Mar 27, 2017

behlendorf commented Mar 27, 2017

behlendorf left a comment •

edited

Loading

ahrens commented Mar 27, 2017

behlendorf Mar 27, 2017

behlendorf commented Mar 27, 2017

don-brady left a comment

RubenKelevra commented Jun 25, 2017

OpenZFS 8005 - poor performance of 1MB writes on certain RAID-Z configurations #5931

OpenZFS 8005 - poor performance of 1MB writes on certain RAID-Z configurations #5931

Conversation

ahrens commented Mar 27, 2017

Background

Problem

Situations in which the problem occurs

Workaround

Solution

Results

Types of changes

Checklist:

mention-bot commented Mar 27, 2017

adilger commented Mar 27, 2017

behlendorf commented Mar 27, 2017

behlendorf left a comment • edited Loading

Choose a reason for hiding this comment

ahrens commented Mar 27, 2017

behlendorf Mar 27, 2017

Choose a reason for hiding this comment

behlendorf commented Mar 27, 2017

don-brady left a comment

Choose a reason for hiding this comment

RubenKelevra commented Jun 25, 2017

behlendorf left a comment •

edited

Loading