ZIL performance improvements #1013

dechamps · 2012-10-04T15:22:14Z

This brings some ZIL performance optimizations, with significant increases in synchronous write performance for some workloads and pool configurations.

See the individual commit messages for details.

This branch just passed two hours of ztest; I'll be running it overnight to confirm stability.

Both the SPL and the ZFS libspl export most of the atomic_* functions, except atomic_sub_* functions which are only exported by the SPL, not by libspl. This patch remedies that by implementing atomic_sub_* functions in libspl.

Currently, ZIL blocks are spread over vdevs using hint block pointers managed by the ZIL commit code and passed to metaslab_alloc(). Spreading log blocks accross vdevs is important for performance: indeed, using mutliple disks in parallel decreases the ZIL commit latency, which is the main performance metric for synchronous writes. However, the current implementation suffers from the following issues: 1) It would be best if the ZIL module was not aware of such low-level details. They should be handled by the ZIO and metaslab modules; 2) Because the hint block pointer is managed per log, simultaneous commits from multiple logs might use the same vdevs at the same time, which is inefficient; 3) Because dmu_write() does not honor the block pointer hint, indirect writes are not spread. The naive solution of rotating the metaslab rotor each time a block is allocated for the ZIL or dmu_sync() doesn't work in practice because the first ZIL block to be written is actually allocated during the previous commit. Consequently, when metaslab_alloc() decides the vdev for this block, it will do so while a bunch of other allocations are happening at the same time (from dmu_sync() and other ZILs). This means the vdev for this block is chosen more or less at random. When the next commit happens, there is a high chance (especially when the number of blocks per commit is slightly less than the number of the disks) that one disk will have to write two blocks (with a potential seek) while other disks are sitting idle, which defeats spreading and increases the commit latency. This commit introduces a new concept in the metaslab allocator: fastwrites. Basically, each top-level vdev maintains a counter indicating the number of synchronous writes (from dmu_sync() and the ZIL) which have been allocated but not yet completed. When the metaslab is called with the FASTWRITE flag, it will choose the vdev with the least amount of pending synchronous writes. If there are multiple vdevs with the same value, the first matching vdev (starting from the rotor) is used. Once metaslab_alloc() has decided which vdev the block is allocated to, it updates the fastwrite counter for this vdev. The rationale goes like this: when an allocation is done with FASTWRITE, it "reserves" the vdev until the data is written. Until then, all future allocations will naturally avoid this vdev, even after a full rotation of the rotor. As a result, pending synchronous writes at a given point in time will be nicely spread over all vdevs. This contrasts with the previous algorithm, which is based on the implicit assumption that blocks are written instantaneously after they're allocated. metaslab_fastwrite_mark() and metaslab_fastwrite_unmark() are used to manually increase or decrease fastwrite counters, respectively. They should be used with caution, as there is no per-BP tracking of fastwrite information, so leaks and "double-unmarks" are possible. There is, however, an assert in the vdev teardown code which will fire if the fastwrite counters are not zero when the pool is exported or the vdev removed. Note that as stated above, marking is also done implictly by metaslab_alloc(). ZIO also got a new FASTWRITE flag; when it is used, ZIO will pass it to the metaslab when allocating (assuming ZIO does the allocation, which is only true in the case of dmu_sync). This flag will also trigger an unmark when zio_done() fires. A side-effect of the new algorithm is that when a ZIL stops being used, its last block can stay in the pending state (allocated but not yet written) for a long time, polluting the fastwrite counters. To avoid that, I've implemented a somewhat crude but working solution which unmarks these pending blocks in zil_sync(), thus guaranteeing that linguering fastwrites will get pruned at each sync event. The best performance improvements are observed with pools using a large number of top-level vdevs and heavy synchronous write workflows (especially indirect writes and concurrent writes from multiple ZILs). Real-life testing shows a 200% to 300% performance increase with indirect writes and various commit sizes.

In the current code, logbias=throughput implies the following: 1) All synchronous writes are logged in indirect mode. 2) The slog is not used. (1) makes sense because it avoids writing the data twice, which is obviously a good thing when the user wants maximum pool throughput. (2), however, is a surprising decision. Considering all writes are indirect, the log record doesn't contain the actual data, only pointers to DMU blocks. As a result, log records written in logbias=throughput mode are quite small, and as such, it doesn't make any sense to write them to the main pool since slogs are usually optimized for small synchronous writes. In fact, the current behavior is actually harmful for performance, because log blocks and data blocks from dmu_sync() seldom have the same allocation size and as a result are usually allocated from different metaslabs. This means that if a spindle has to write both log blocks and DMU blocks (which is likely to happen under heavy load), it will have to seek between the two. Allocating the log blocks from the slog pool instead of the main pool avoids these unnecessary seeks. This commit makes ZFS use the slog on datasets with logbias=throughput. Real-life performance testing shows a 50% synchronous write performance increase with some large commit sizes, and no negative effect in other cases.

htkmmo · 2012-10-04T20:12:28Z

This is great news :) What kernel/setup are you using for testing?

I'm a recent user of ZFS and I'm quite thrilled with its capabilities (root disk is a 2xIntel X25-E SSD with gzip-4 compression). Works great!

I do have a problem though, since the upgrade on kernel 3.5.3 (and latter 3.5.5) the server started experiencing sudden deaths (complete hang-up, no error log, no BMC error, nothing visible ...). It can happen once or 2-3 times a day (the machine is mostly idle). I had ZFS as modules and latter built-in. This is not necessarily a ZFS issue, could be hardware, real bugs (spiders), etc ... No idea what to do except maybe switch to a kernel you have thoroughly tested on. (if you read this far, "Thanks!")

dechamps · 2012-10-04T20:23:59Z

This has nothing to do with the pull request. If you're experiencing issues, then please post on the zfs-discuss mailing list, or create an issue on this tracker if you have detailed information to provide.

(a side note though: if you're using SSDs as your disks, you might be interested in #924.)

htkmmo · 2012-10-04T20:34:35Z

Sorry. I read your comment and just wanted to know your test kernel version. (I didn't want to open an issue until I'm certain it's ZFS related).

dechamps · 2012-10-04T20:37:19Z

Most of my testing is done on 3.2.28. 3.5 is supposed to be supported, so if you have reason to believe there is an issue with ZFS on 3.5, then by all means, go ahead.

htkmmo · 2012-10-04T20:53:39Z

Thanks! I'll spend some time with 3.2.28 then.

I have a reason to suspect ZFS since the server was rock stable before including the ZFS kernel modules (and latter as built-in). Perhaps it's 3.5.x related ... The machine is an Intel S5000PSL board 24GB ECC-RAM, X5470 CPU so hardware wise rather solid. The problem is I can't get any info from the syslog when it hangs and none is visible when keeping a syslog trace on a remote machine. Not related with any user activity either so no idea what can trigger it :(

TRIM support for SSD's is great (X25-E's don't have it though so I'll keep it disabled also in the future testing).

behlendorf · 2012-10-04T21:08:53Z

These are some potentially huge performance improvements, but it's going to take me a while to digest them.

dechamps · 2012-10-04T21:14:03Z

I understand. I would have published these sooner but I didn't get permission from my employer until very recently. I was planning to fix #1012 at the same time but I won't have time to do that.

Both the SPL and the ZFS libspl export most of the atomic_* functions, except atomic_sub_* functions which are only exported by the SPL, not by libspl. This patch remedies that by implementing atomic_sub_* functions in libspl. Signed-off-by: Brian Behlendorf <[email protected]> Issue #1013

Currently, ZIL blocks are spread over vdevs using hint block pointers managed by the ZIL commit code and passed to metaslab_alloc(). Spreading log blocks accross vdevs is important for performance: indeed, using mutliple disks in parallel decreases the ZIL commit latency, which is the main performance metric for synchronous writes. However, the current implementation suffers from the following issues: 1) It would be best if the ZIL module was not aware of such low-level details. They should be handled by the ZIO and metaslab modules; 2) Because the hint block pointer is managed per log, simultaneous commits from multiple logs might use the same vdevs at the same time, which is inefficient; 3) Because dmu_write() does not honor the block pointer hint, indirect writes are not spread. The naive solution of rotating the metaslab rotor each time a block is allocated for the ZIL or dmu_sync() doesn't work in practice because the first ZIL block to be written is actually allocated during the previous commit. Consequently, when metaslab_alloc() decides the vdev for this block, it will do so while a bunch of other allocations are happening at the same time (from dmu_sync() and other ZILs). This means the vdev for this block is chosen more or less at random. When the next commit happens, there is a high chance (especially when the number of blocks per commit is slightly less than the number of the disks) that one disk will have to write two blocks (with a potential seek) while other disks are sitting idle, which defeats spreading and increases the commit latency. This commit introduces a new concept in the metaslab allocator: fastwrites. Basically, each top-level vdev maintains a counter indicating the number of synchronous writes (from dmu_sync() and the ZIL) which have been allocated but not yet completed. When the metaslab is called with the FASTWRITE flag, it will choose the vdev with the least amount of pending synchronous writes. If there are multiple vdevs with the same value, the first matching vdev (starting from the rotor) is used. Once metaslab_alloc() has decided which vdev the block is allocated to, it updates the fastwrite counter for this vdev. The rationale goes like this: when an allocation is done with FASTWRITE, it "reserves" the vdev until the data is written. Until then, all future allocations will naturally avoid this vdev, even after a full rotation of the rotor. As a result, pending synchronous writes at a given point in time will be nicely spread over all vdevs. This contrasts with the previous algorithm, which is based on the implicit assumption that blocks are written instantaneously after they're allocated. metaslab_fastwrite_mark() and metaslab_fastwrite_unmark() are used to manually increase or decrease fastwrite counters, respectively. They should be used with caution, as there is no per-BP tracking of fastwrite information, so leaks and "double-unmarks" are possible. There is, however, an assert in the vdev teardown code which will fire if the fastwrite counters are not zero when the pool is exported or the vdev removed. Note that as stated above, marking is also done implictly by metaslab_alloc(). ZIO also got a new FASTWRITE flag; when it is used, ZIO will pass it to the metaslab when allocating (assuming ZIO does the allocation, which is only true in the case of dmu_sync). This flag will also trigger an unmark when zio_done() fires. A side-effect of the new algorithm is that when a ZIL stops being used, its last block can stay in the pending state (allocated but not yet written) for a long time, polluting the fastwrite counters. To avoid that, I've implemented a somewhat crude but working solution which unmarks these pending blocks in zil_sync(), thus guaranteeing that linguering fastwrites will get pruned at each sync event. The best performance improvements are observed with pools using a large number of top-level vdevs and heavy synchronous write workflows (especially indirect writes and concurrent writes from multiple ZILs). Real-life testing shows a 200% to 300% performance increase with indirect writes and various commit sizes. Signed-off-by: Brian Behlendorf <[email protected]> Issue #1013

In the current code, logbias=throughput implies the following: 1) All synchronous writes are logged in indirect mode. 2) The slog is not used. (1) makes sense because it avoids writing the data twice, which is obviously a good thing when the user wants maximum pool throughput. (2), however, is a surprising decision. Considering all writes are indirect, the log record doesn't contain the actual data, only pointers to DMU blocks. As a result, log records written in logbias=throughput mode are quite small, and as such, it doesn't make any sense to write them to the main pool since slogs are usually optimized for small synchronous writes. In fact, the current behavior is actually harmful for performance, because log blocks and data blocks from dmu_sync() seldom have the same allocation size and as a result are usually allocated from different metaslabs. This means that if a spindle has to write both log blocks and DMU blocks (which is likely to happen under heavy load), it will have to seek between the two. Allocating the log blocks from the slog pool instead of the main pool avoids these unnecessary seeks. This commit makes ZFS use the slog on datasets with logbias=throughput. Real-life performance testing shows a 50% synchronous write performance increase with some large commit sizes, and no negative effect in other cases. Signed-off-by: Brian Behlendorf <[email protected]> Issue #1013

behlendorf · 2012-10-17T18:08:04Z

Merged after minor formatting fixing, and some extensive testing with and without ZIL devices. Thanks for pushing these. Concerning the TRIM support, I still want to get that in, perhaps for -rc13. However, it's proving a bit harder to test than I'd originally anticipated.

needs to work to compile

Both the SPL and the ZFS libspl export most of the atomic_* functions, except atomic_sub_* functions which are only exported by the SPL, not by libspl. This patch remedies that by implementing atomic_sub_* functions in libspl. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#1013

Currently, ZIL blocks are spread over vdevs using hint block pointers managed by the ZIL commit code and passed to metaslab_alloc(). Spreading log blocks accross vdevs is important for performance: indeed, using mutliple disks in parallel decreases the ZIL commit latency, which is the main performance metric for synchronous writes. However, the current implementation suffers from the following issues: 1) It would be best if the ZIL module was not aware of such low-level details. They should be handled by the ZIO and metaslab modules; 2) Because the hint block pointer is managed per log, simultaneous commits from multiple logs might use the same vdevs at the same time, which is inefficient; 3) Because dmu_write() does not honor the block pointer hint, indirect writes are not spread. The naive solution of rotating the metaslab rotor each time a block is allocated for the ZIL or dmu_sync() doesn't work in practice because the first ZIL block to be written is actually allocated during the previous commit. Consequently, when metaslab_alloc() decides the vdev for this block, it will do so while a bunch of other allocations are happening at the same time (from dmu_sync() and other ZILs). This means the vdev for this block is chosen more or less at random. When the next commit happens, there is a high chance (especially when the number of blocks per commit is slightly less than the number of the disks) that one disk will have to write two blocks (with a potential seek) while other disks are sitting idle, which defeats spreading and increases the commit latency. This commit introduces a new concept in the metaslab allocator: fastwrites. Basically, each top-level vdev maintains a counter indicating the number of synchronous writes (from dmu_sync() and the ZIL) which have been allocated but not yet completed. When the metaslab is called with the FASTWRITE flag, it will choose the vdev with the least amount of pending synchronous writes. If there are multiple vdevs with the same value, the first matching vdev (starting from the rotor) is used. Once metaslab_alloc() has decided which vdev the block is allocated to, it updates the fastwrite counter for this vdev. The rationale goes like this: when an allocation is done with FASTWRITE, it "reserves" the vdev until the data is written. Until then, all future allocations will naturally avoid this vdev, even after a full rotation of the rotor. As a result, pending synchronous writes at a given point in time will be nicely spread over all vdevs. This contrasts with the previous algorithm, which is based on the implicit assumption that blocks are written instantaneously after they're allocated. metaslab_fastwrite_mark() and metaslab_fastwrite_unmark() are used to manually increase or decrease fastwrite counters, respectively. They should be used with caution, as there is no per-BP tracking of fastwrite information, so leaks and "double-unmarks" are possible. There is, however, an assert in the vdev teardown code which will fire if the fastwrite counters are not zero when the pool is exported or the vdev removed. Note that as stated above, marking is also done implictly by metaslab_alloc(). ZIO also got a new FASTWRITE flag; when it is used, ZIO will pass it to the metaslab when allocating (assuming ZIO does the allocation, which is only true in the case of dmu_sync). This flag will also trigger an unmark when zio_done() fires. A side-effect of the new algorithm is that when a ZIL stops being used, its last block can stay in the pending state (allocated but not yet written) for a long time, polluting the fastwrite counters. To avoid that, I've implemented a somewhat crude but working solution which unmarks these pending blocks in zil_sync(), thus guaranteeing that linguering fastwrites will get pruned at each sync event. The best performance improvements are observed with pools using a large number of top-level vdevs and heavy synchronous write workflows (especially indirect writes and concurrent writes from multiple ZILs). Real-life testing shows a 200% to 300% performance increase with indirect writes and various commit sizes. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#1013

In the current code, logbias=throughput implies the following: 1) All synchronous writes are logged in indirect mode. 2) The slog is not used. (1) makes sense because it avoids writing the data twice, which is obviously a good thing when the user wants maximum pool throughput. (2), however, is a surprising decision. Considering all writes are indirect, the log record doesn't contain the actual data, only pointers to DMU blocks. As a result, log records written in logbias=throughput mode are quite small, and as such, it doesn't make any sense to write them to the main pool since slogs are usually optimized for small synchronous writes. In fact, the current behavior is actually harmful for performance, because log blocks and data blocks from dmu_sync() seldom have the same allocation size and as a result are usually allocated from different metaslabs. This means that if a spindle has to write both log blocks and DMU blocks (which is likely to happen under heavy load), it will have to seek between the two. Allocating the log blocks from the slog pool instead of the main pool avoids these unnecessary seeks. This commit makes ZFS use the slog on datasets with logbias=throughput. Real-life performance testing shows a 50% synchronous write performance increase with some large commit sizes, and no negative effect in other cases. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#1013

…nt (openzfs#1013) Bumps [serde_path_to_error](https://github.com/dtolnay/path-to-error) from 0.1.12 to 0.1.13. - [Release notes](https://github.com/dtolnay/path-to-error/releases) - [Commits](dtolnay/path-to-error@0.1.12...0.1.13) --- updated-dependencies: - dependency-name: serde_path_to_error dependency-type: indirect update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

dechamps added 3 commits October 4, 2012 14:39

Add atomic_sub_* functions to libspl.

b3b0114

Both the SPL and the ZFS libspl export most of the atomic_* functions, except atomic_sub_* functions which are only exported by the SPL, not by libspl. This patch remedies that by implementing atomic_sub_* functions in libspl.

ryao mentioned this pull request Oct 11, 2012

zfs-0.6.0rc8 (repo) mkdir can hang on ARM 32bit system #749

Closed

behlendorf closed this in 658a014 Oct 17, 2012

pyavdr mentioned this pull request Mar 26, 2013

Large synchronous writes are slow when a slog is present #1012

Closed

ryao referenced this pull request in openzfsonosx/zfs May 2, 2013

Bring in IllumOS zio.c, as it has different reexecute logic. Still

52abc31

needs to work to compile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZIL performance improvements #1013

ZIL performance improvements #1013

dechamps commented Oct 4, 2012

htkmmo commented Oct 4, 2012

dechamps commented Oct 4, 2012

htkmmo commented Oct 4, 2012

dechamps commented Oct 4, 2012

htkmmo commented Oct 4, 2012

behlendorf commented Oct 4, 2012

dechamps commented Oct 4, 2012

behlendorf commented Oct 17, 2012

ZIL performance improvements #1013

ZIL performance improvements #1013

Conversation

dechamps commented Oct 4, 2012

htkmmo commented Oct 4, 2012

dechamps commented Oct 4, 2012

htkmmo commented Oct 4, 2012

dechamps commented Oct 4, 2012

htkmmo commented Oct 4, 2012

behlendorf commented Oct 4, 2012

dechamps commented Oct 4, 2012

behlendorf commented Oct 17, 2012