zfs sync parallelism #15197

ednadolski-ix · 2023-08-21T02:05:53Z

These changes address some performance and scaling limitations in the sync path:

Allow parallel syncing of >1 dataset/objset.
Sync no more (and if possible no fewer) dnodes at a time than there are allocators in a pool.
Reduce lock contention by (a) assigning each syncthread its own allocator; and (b) create a separate write issue taskq for a specified number of CPUs, and bind each taskq to an assigned syncthread.

Motivation and Context

Write performance scalability.
Targeted case is improving blocks per second, for volblocksize=4k

Description

Per commit descriptions.

How Has This Been Tested?

Local zfs-tests and buildbot tests.
Fio performance testing with multiple zvols in a pool
Testing includes with --enable-debug set.

System: 32-core AMD EPYC 7313 16-Core Processors, 256 GB DRAM
Drives: 2x Samsung NVMe SSD PM173X (3x 2T NS each)
OS: 22.04.1-Ubuntu, 6.2.0-26-generic

fio w/12x zvols prior to changes:
Run status group 0 (all jobs):
WRITE: bw=1267MiB/s (1328MB/s), 1267MiB/s-1267MiB/s (1328MB/s-1328MB/s), io=148GiB (159GB), run=120009-120009msec

fio with changes applied:
Run status group 0 (all jobs):
WRITE: bw=2380MiB/s (2495MB/s), 2380MiB/s-2380MiB/s (2495MB/s-2495MB/s), io=279GiB (299GB), run=120007-120007msec

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

include/sys/spa_impl.h

module/zfs/dbuf.c

include/sys/zio.h

module/zfs/spa.c

include/sys/spa_impl.h

module/zfs/spa.c

module/zfs/spa_misc.c

module/zfs/zio.c

include/sys/zio.h

module/zfs/spa.c

module/zfs/spa_misc.c

module/zfs/zio.c

pcd1193182 · 2023-09-07T20:30:07Z

Note that zdb_block_size_histogram failed in both of the test runs in this PR, and that is not a common failure that we expect to see. Could this be a result of these changes? Or could another recent change have caused this regression?

ednadolski-ix · 2023-09-07T20:54:29Z

Note that zdb_block_size_histogram failed in both of the test runs in this PR, and that is not a common failure that we expect to see. Could this be a result of these changes? Or could another recent change have caused this regression?

@pcd1193182 I have been seeing intermittent pass/fails with some tests, tho this one consistently has passed for me locally. Is there a way to narrow that down running in the CI framework?

$ ./scripts/zfs-tests.sh -K -t tests/functional/cli_root/zdb/zdb_block_size_histogram.ksh
Test: /home/walong/openzfs/zfs-efn-retests/tests/zfs-tests/tests/functional/cli_root/zdb/zdb_block_size_histogram.ksh (run as root) [06:09] [PASS]

Results Summary
PASS	   1

ednadolski-ix · 2023-09-14T14:21:49Z

Note that zdb_block_size_histogram failed in both of the test runs in this PR, and that is not a common failure that we expect to see. Could this be a result of these changes? Or could another recent change have caused this regression?

Please note, this has been addressed by the following commit:
0ee9b02390d57c10a4dee0f3d19fcb115b424ca5

ednadolski-ix · 2023-09-20T19:57:46Z

Including a flame graph comparison showing the reduced lock contention after applying the patch.

Before Patch:
fio result for 12 zvols: WRITE: bw=1322MiB/s (1386MB/s), 1322MiB/s-1322MiB/s (1386MB/s-1386MB/s), io=155GiB (166GB), run=120007-120007msec

With patch applied:
fio result for 12 zvols: WRITE: bw=2336MiB/s (2449MB/s), 2336MiB/s-2336MiB/s (2449MB/s-2449MB/s), io=274GiB (294GB), run=120010-120010msec

tonyhutter · 2023-09-20T23:04:17Z

Including a flame graph comparison showing the reduced lock contention after applying the patch.

@ednadolski-ix just a suggestion - you could try re-running your benchmarks with blk-mq enabled:

echo 1 > /sys/module/zfs/parameters/zvol_use_blk_mq
<export pool>
<import pool>
<rerun benchmark>

No guarantee it's going to go faster though.

grwilson · 2023-09-21T18:03:53Z

We are going to start testing this code internally and I will also start the review.

amotin · 2023-09-21T18:33:50Z

you could try re-running your benchmarks with blk-mq enabled

@tonyhutter I wonder how it supposed to be related? It is a different layer of the stack. This patch supposed to break ~350K block per second async write issue limit, which trivially achievable even if you write sequentially with large 16MB app writes, at which point zvol mq should be irrelevant. Sure mq may have its effect if your app writes are also small, but that is a completely different problem.

tonyhutter · 2023-09-25T16:04:50Z

@amotin ah ok, I must have misunderstood. I saw that these were zvol benchmarks, but it sounds like that's not necessarily important the context of what this PR is trying to do.

amotin · 2023-09-25T16:11:50Z

I saw that these were zvol benchmarks, but it sounds like that's not necessarily important the context of what this PR is trying to do.

@tonyhutter zvols here are important only from perspective that they always have only one (serious) object, so unlike several files on one dataset before this patch it was impossible to sync several of them same time. Several files on different datasets (one file per dataset) have the same problem though. But if you write several zvols same time to benefit from this patch, you likely won't be limited by zvols themselves.

behlendorf

I'm still working my way through this but I posted a few initial comments.

module/zfs/spa.c

module/zfs/spa_misc.c

module/zfs/spa.c

module/zfs/dmu_objset.c

module/zfs/spa.c

module/zfs/dmu_objset.c

man/man4/zfs.4

module/os/linux/spl/spl-taskq.c

module/zfs/dmu_objset.c

As part of transaction group commit, dsl_pool_sync() sequentially calls dsl_dataset_sync() for each dirty dataset, which subsequently calls dmu_objset_sync(). dmu_objset_sync() in turn uses up to 75% of CPU cores to run sync_dnodes_task() in taskq threads to sync the dirty dnodes (files). There are two problems: 1. Each ZVOL in a pool is a separate dataset/objset having a single dnode. This means the objsets are synchronized serially, which leads to a bottleneck of ~330K blocks written per second per pool. 2. In the case of multiple dirty dnodes/files on a dataset/objset on a big system they will be sync'd in parallel taskq threads. However, it is inefficient to to use 75% of CPU cores of a big system to do that, because of (a) bottlenecks on a single write issue taskq, and (b) allocation throttling. In addition, if not for the allocation throttling sorting write requests by bookmarks (logical address), writes for different files may reach space allocators interleaved, leading to unwanted fragmentation. The solution to both problems is to always sync no more and (if possible) no fewer dnodes at the same time than there are allocators the pool. Signed-off-by: Edmund Nadolski <[email protected]>

module/os/linux/spl/spl-taskq.c

module/zfs/dmu_objset.c

include/os/freebsd/spl/sys/taskq.h

man/man4/zfs.4

module/zfs/spa.c

- Given a reasonable number of syncthreads, assign each syncthread its own allocator. - Create a separate write issue taskq for a given number of CPUS and statically bind assign each taskq to a specified syncthread. Signed-off-by: Edmund Nadolski <[email protected]>

As part of transaction group commit, dsl_pool_sync() sequentially calls dsl_dataset_sync() for each dirty dataset, which subsequently calls dmu_objset_sync(). dmu_objset_sync() in turn uses up to 75% of CPU cores to run sync_dnodes_task() in taskq threads to sync the dirty dnodes (files). There are two problems: 1. Each ZVOL in a pool is a separate dataset/objset having a single dnode. This means the objsets are synchronized serially, which leads to a bottleneck of ~330K blocks written per second per pool. 2. In the case of multiple dirty dnodes/files on a dataset/objset on a big system they will be sync'd in parallel taskq threads. However, it is inefficient to to use 75% of CPU cores of a big system to do that, because of (a) bottlenecks on a single write issue taskq, and (b) allocation throttling. In addition, if not for the allocation throttling sorting write requests by bookmarks (logical address), writes for different files may reach space allocators interleaved, leading to unwanted fragmentation. The solution to both problems is to always sync no more and (if possible) no fewer dnodes at the same time than there are allocators the pool. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Edmund Nadolski <[email protected]> Closes openzfs#15197

devZer0 · 2023-12-19T21:48:54Z

does this fix #10110 ?

ednadolski-ix · 2023-12-19T22:06:12Z

does this fix #10110 ?

It is not intended to address that.

datacore-rm · 2024-02-08T09:56:06Z

Hi,
Can you pls help with below queries regarding the patch: -

Does this result in performance improvement only for volblocksize=4k as mentioned or there can be improvement for volblocksize=128k as well?
Could you please add more details why there is bottleneck of ~330K blocks written per second per pool before this patch?

Thanks.

amotin · 2024-02-08T15:32:55Z

@datacore-rm It applies to any volblocksize, just at 128KB 330K blocks per second would be 40GB/s, that is difficult to reach on current hardware. At smaller blocks is easier. The bottleneck is caused by all block writes being issued by only one sync thread. This patch allows to use several CPU cores/threads if there are several dirty ZVOLs.

As part of transaction group commit, dsl_pool_sync() sequentially calls dsl_dataset_sync() for each dirty dataset, which subsequently calls dmu_objset_sync(). dmu_objset_sync() in turn uses up to 75% of CPU cores to run sync_dnodes_task() in taskq threads to sync the dirty dnodes (files). There are two problems: 1. Each ZVOL in a pool is a separate dataset/objset having a single dnode. This means the objsets are synchronized serially, which leads to a bottleneck of ~330K blocks written per second per pool. 2. In the case of multiple dirty dnodes/files on a dataset/objset on a big system they will be sync'd in parallel taskq threads. However, it is inefficient to to use 75% of CPU cores of a big system to do that, because of (a) bottlenecks on a single write issue taskq, and (b) allocation throttling. In addition, if not for the allocation throttling sorting write requests by bookmarks (logical address), writes for different files may reach space allocators interleaved, leading to unwanted fragmentation. The solution to both problems is to always sync no more and (if possible) no fewer dnodes at the same time than there are allocators the pool. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Edmund Nadolski <[email protected]> Closes openzfs#15197

dmb1372 · 2024-10-06T17:28:01Z

@tonyhutter @behlendorf @amotin

My apologies if I overlooked it, but is this functionality / pull request (#15197) a candidate for OpenZFS 2.3.0? I would have tagged this prior to the OpenZFS 2.3.0-rc1 release, but I didn't see a PR for 2.3 like there were for the 2.2.x patch sets. In the PR (#16107) for the zfs-2.2.4 patch set, he had mentioned he wasn't comfortable including it in 2.2.x, as it was a fairly major change, but would consider it for ZFS 2.3.x per: #16107 (comment)

devZer0 · 2024-10-06T17:38:21Z

for my curiosity - does this fix/improve #10110 ? Am 06.10.2024 um 19:28 schrieb dmb1372 ***@***.***>: @tonyhutter @behlendorf @amotin My apologies if I overlooked it, but is this functionality / pull request (#15197) a candidate for OpenZFS 2.3.0? I would have tagged this prior to the OpenZFS 2.3.0-rc1 release, but I didn't see a PR for 2.3 like there were for the 2.2.x patch sets. In the PR (#16107) for the zfs-2.2.4 patch set, he had mentioned he wasn't comfortable including it in 2.2.x, as it was a fairly major change, but would consider it for ZFS 2.3.x per: #16107 (comment) —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>

behlendorf · 2024-10-06T22:29:47Z

@dmb1372 yes, this change is included in the OpenZFS 2.3.0-rc1 tag.

amotin · 2024-10-07T00:23:32Z

@devZer0 No, they are not really related. It is confusing, but sync threads in ZFS handle async writes. ;) The sync threads issue asynchronous data writes onto the stable storage as part of transaction group commit, while #10110 is about sync writes being duplicated into ZIL to not wait for the TXG commit completion, that may take several seconds. The only way how it may help is if multi-threaded write to the stable storage may reduce TXG commit time and so reduce the average amount of backlogged data that still has to be written to ZIL on fsync() request.

devZer0 · 2024-10-07T07:18:39Z

thank you for making it clear!Am 07.10.2024 um 02:23 schrieb Alexander Motin ***@***.***>: @devZer0 No, they are not really related. It is confusing, but sync threads in ZFS handle async writes. ;) The sync threads issue asynchronous data writes onto the stable storage as part of transaction group commit, while #10110 is about sync writes being duplicated into ZIL to not wait for the TXG commit completion, that may take several seconds. The only way how it may help is if multi-threaded write to the stable storage may reduce TXG commit time and so reduce the average amount of backlogged data that still has to be written to ZIL on fsync() request. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

ixhamza reviewed Aug 21, 2023

View reviewed changes

include/sys/spa_impl.h Outdated Show resolved Hide resolved

ixhamza reviewed Aug 21, 2023

View reviewed changes

module/zfs/dbuf.c Outdated Show resolved Hide resolved

include/sys/zio.h Show resolved Hide resolved

module/zfs/spa.c Outdated Show resolved Hide resolved

include/sys/spa_impl.h Outdated Show resolved Hide resolved

ednadolski-ix force-pushed the efn/zfs-sync-parallelism branch from 94a7fec to 14909a7 Compare August 22, 2023 16:48

ednadolski-ix requested review from ixhamza and nwf August 22, 2023 16:57

amotin reviewed Aug 22, 2023

View reviewed changes

behlendorf added the Status: Code Review Needed Ready for review and testing label Aug 26, 2023

ednadolski-ix force-pushed the efn/zfs-sync-parallelism branch from 14909a7 to a016cbe Compare August 29, 2023 16:42

amotin reviewed Aug 29, 2023

View reviewed changes

include/sys/zio.h Outdated Show resolved Hide resolved

module/zfs/spa.c Outdated Show resolved Hide resolved

module/zfs/spa.c Show resolved Hide resolved

ednadolski-ix force-pushed the efn/zfs-sync-parallelism branch from a016cbe to cb3c855 Compare August 30, 2023 21:32

amotin reviewed Aug 30, 2023

View reviewed changes

module/zfs/spa.c Outdated Show resolved Hide resolved

module/zfs/spa.c Outdated Show resolved Hide resolved

module/zfs/spa_misc.c Outdated Show resolved Hide resolved

module/zfs/zio.c Outdated Show resolved Hide resolved

ednadolski-ix force-pushed the efn/zfs-sync-parallelism branch from cb3c855 to 329b4f9 Compare August 31, 2023 01:46

amotin approved these changes Aug 31, 2023

View reviewed changes

ednadolski-ix force-pushed the efn/zfs-sync-parallelism branch 2 times, most recently from d58b21c to 7faf968 Compare September 5, 2023 18:41

ednadolski-ix force-pushed the efn/zfs-sync-parallelism branch 2 times, most recently from 7c6e2e0 to 1a28c85 Compare September 13, 2023 17:55

behlendorf reviewed Sep 27, 2023

View reviewed changes

ednadolski-ix force-pushed the efn/zfs-sync-parallelism branch 3 times, most recently from 428cf89 to 3b20ec8 Compare September 28, 2023 02:31

robn reviewed Oct 16, 2023

View reviewed changes

module/zfs/dmu_objset.c Outdated Show resolved Hide resolved

behlendorf requested changes Oct 20, 2023

View reviewed changes

man/man4/zfs.4 Show resolved Hide resolved

module/os/linux/spl/spl-taskq.c Show resolved Hide resolved

module/zfs/dmu_objset.c Show resolved Hide resolved

ednadolski-ix force-pushed the efn/zfs-sync-parallelism branch from 171f81e to bb62469 Compare October 27, 2023 19:08

ednadolski-ix requested review from robn and grwilson October 30, 2023 21:24

behlendorf reviewed Oct 30, 2023

View reviewed changes

module/os/linux/spl/spl-taskq.c Outdated Show resolved Hide resolved

module/zfs/dmu_objset.c Outdated Show resolved Hide resolved

include/os/freebsd/spl/sys/taskq.h Outdated Show resolved Hide resolved

ednadolski-ix force-pushed the efn/zfs-sync-parallelism branch from bb62469 to c42516b Compare October 30, 2023 22:38

ednadolski-ix requested a review from behlendorf October 31, 2023 20:57

behlendorf approved these changes Nov 1, 2023

View reviewed changes

man/man4/zfs.4 Outdated Show resolved Hide resolved

module/zfs/spa.c Show resolved Hide resolved

ednadolski-ix force-pushed the efn/zfs-sync-parallelism branch from c42516b to 35c84e6 Compare November 1, 2023 21:46

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 1, 2023

behlendorf merged commit 3bd4df3 into openzfs:master Nov 6, 2023
19 checks passed

aenigma1372 mentioned this pull request Apr 22, 2024

zfs-2.2.4 patchset #16107

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zfs sync parallelism #15197

zfs sync parallelism #15197

ednadolski-ix commented Aug 21, 2023 •

edited

Loading

pcd1193182 commented Sep 7, 2023

ednadolski-ix commented Sep 7, 2023

ednadolski-ix commented Sep 14, 2023

ednadolski-ix commented Sep 20, 2023

tonyhutter commented Sep 20, 2023

grwilson commented Sep 21, 2023

amotin commented Sep 21, 2023

tonyhutter commented Sep 25, 2023 •

edited

Loading

amotin commented Sep 25, 2023

behlendorf left a comment

devZer0 commented Dec 19, 2023

ednadolski-ix commented Dec 19, 2023 •

edited

Loading

datacore-rm commented Feb 8, 2024

amotin commented Feb 8, 2024

dmb1372 commented Oct 6, 2024

devZer0 commented Oct 6, 2024 via email

behlendorf commented Oct 6, 2024

amotin commented Oct 7, 2024

devZer0 commented Oct 7, 2024 via email

zfs sync parallelism #15197

zfs sync parallelism #15197

Conversation

ednadolski-ix commented Aug 21, 2023 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

pcd1193182 commented Sep 7, 2023

ednadolski-ix commented Sep 7, 2023

ednadolski-ix commented Sep 14, 2023

ednadolski-ix commented Sep 20, 2023

tonyhutter commented Sep 20, 2023

grwilson commented Sep 21, 2023

amotin commented Sep 21, 2023

tonyhutter commented Sep 25, 2023 • edited Loading

amotin commented Sep 25, 2023

behlendorf left a comment

Choose a reason for hiding this comment

devZer0 commented Dec 19, 2023

ednadolski-ix commented Dec 19, 2023 • edited Loading

datacore-rm commented Feb 8, 2024

amotin commented Feb 8, 2024

dmb1372 commented Oct 6, 2024

devZer0 commented Oct 6, 2024 via email

behlendorf commented Oct 6, 2024

amotin commented Oct 7, 2024

devZer0 commented Oct 7, 2024 via email

ednadolski-ix commented Aug 21, 2023 •

edited

Loading

tonyhutter commented Sep 25, 2023 •

edited

Loading

ednadolski-ix commented Dec 19, 2023 •

edited

Loading