-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs sync parallelism #15197
zfs sync parallelism #15197
Conversation
94a7fec
to
14909a7
Compare
14909a7
to
a016cbe
Compare
a016cbe
to
cb3c855
Compare
cb3c855
to
329b4f9
Compare
d58b21c
to
7faf968
Compare
Note that |
@pcd1193182 I have been seeing intermittent pass/fails with some tests, tho this one consistently has passed for me locally. Is there a way to narrow that down running in the CI framework?
|
7c6e2e0
to
1a28c85
Compare
Please note, this has been addressed by the following commit: |
@ednadolski-ix just a suggestion - you could try re-running your benchmarks with blk-mq enabled:
No guarantee it's going to go faster though. |
We are going to start testing this code internally and I will also start the review. |
@tonyhutter I wonder how it supposed to be related? It is a different layer of the stack. This patch supposed to break ~350K block per second async write issue limit, which trivially achievable even if you write sequentially with large 16MB app writes, at which point zvol mq should be irrelevant. Sure mq may have its effect if your app writes are also small, but that is a completely different problem. |
@amotin ah ok, I must have misunderstood. I saw that these were zvol benchmarks, but it sounds like that's not necessarily important the context of what this PR is trying to do. |
@tonyhutter zvols here are important only from perspective that they always have only one (serious) object, so unlike several files on one dataset before this patch it was impossible to sync several of them same time. Several files on different datasets (one file per dataset) have the same problem though. But if you write several zvols same time to benefit from this patch, you likely won't be limited by zvols themselves. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still working my way through this but I posted a few initial comments.
428cf89
to
3b20ec8
Compare
As part of transaction group commit, dsl_pool_sync() sequentially calls dsl_dataset_sync() for each dirty dataset, which subsequently calls dmu_objset_sync(). dmu_objset_sync() in turn uses up to 75% of CPU cores to run sync_dnodes_task() in taskq threads to sync the dirty dnodes (files). There are two problems: 1. Each ZVOL in a pool is a separate dataset/objset having a single dnode. This means the objsets are synchronized serially, which leads to a bottleneck of ~330K blocks written per second per pool. 2. In the case of multiple dirty dnodes/files on a dataset/objset on a big system they will be sync'd in parallel taskq threads. However, it is inefficient to to use 75% of CPU cores of a big system to do that, because of (a) bottlenecks on a single write issue taskq, and (b) allocation throttling. In addition, if not for the allocation throttling sorting write requests by bookmarks (logical address), writes for different files may reach space allocators interleaved, leading to unwanted fragmentation. The solution to both problems is to always sync no more and (if possible) no fewer dnodes at the same time than there are allocators the pool. Signed-off-by: Edmund Nadolski <[email protected]>
171f81e
to
bb62469
Compare
bb62469
to
c42516b
Compare
- Given a reasonable number of syncthreads, assign each syncthread its own allocator. - Create a separate write issue taskq for a given number of CPUS and statically bind assign each taskq to a specified syncthread. Signed-off-by: Edmund Nadolski <[email protected]>
c42516b
to
35c84e6
Compare
As part of transaction group commit, dsl_pool_sync() sequentially calls dsl_dataset_sync() for each dirty dataset, which subsequently calls dmu_objset_sync(). dmu_objset_sync() in turn uses up to 75% of CPU cores to run sync_dnodes_task() in taskq threads to sync the dirty dnodes (files). There are two problems: 1. Each ZVOL in a pool is a separate dataset/objset having a single dnode. This means the objsets are synchronized serially, which leads to a bottleneck of ~330K blocks written per second per pool. 2. In the case of multiple dirty dnodes/files on a dataset/objset on a big system they will be sync'd in parallel taskq threads. However, it is inefficient to to use 75% of CPU cores of a big system to do that, because of (a) bottlenecks on a single write issue taskq, and (b) allocation throttling. In addition, if not for the allocation throttling sorting write requests by bookmarks (logical address), writes for different files may reach space allocators interleaved, leading to unwanted fragmentation. The solution to both problems is to always sync no more and (if possible) no fewer dnodes at the same time than there are allocators the pool. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Edmund Nadolski <[email protected]> Closes openzfs#15197
As part of transaction group commit, dsl_pool_sync() sequentially calls dsl_dataset_sync() for each dirty dataset, which subsequently calls dmu_objset_sync(). dmu_objset_sync() in turn uses up to 75% of CPU cores to run sync_dnodes_task() in taskq threads to sync the dirty dnodes (files). There are two problems: 1. Each ZVOL in a pool is a separate dataset/objset having a single dnode. This means the objsets are synchronized serially, which leads to a bottleneck of ~330K blocks written per second per pool. 2. In the case of multiple dirty dnodes/files on a dataset/objset on a big system they will be sync'd in parallel taskq threads. However, it is inefficient to to use 75% of CPU cores of a big system to do that, because of (a) bottlenecks on a single write issue taskq, and (b) allocation throttling. In addition, if not for the allocation throttling sorting write requests by bookmarks (logical address), writes for different files may reach space allocators interleaved, leading to unwanted fragmentation. The solution to both problems is to always sync no more and (if possible) no fewer dnodes at the same time than there are allocators the pool. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Edmund Nadolski <[email protected]> Closes openzfs#15197
As part of transaction group commit, dsl_pool_sync() sequentially calls dsl_dataset_sync() for each dirty dataset, which subsequently calls dmu_objset_sync(). dmu_objset_sync() in turn uses up to 75% of CPU cores to run sync_dnodes_task() in taskq threads to sync the dirty dnodes (files). There are two problems: 1. Each ZVOL in a pool is a separate dataset/objset having a single dnode. This means the objsets are synchronized serially, which leads to a bottleneck of ~330K blocks written per second per pool. 2. In the case of multiple dirty dnodes/files on a dataset/objset on a big system they will be sync'd in parallel taskq threads. However, it is inefficient to to use 75% of CPU cores of a big system to do that, because of (a) bottlenecks on a single write issue taskq, and (b) allocation throttling. In addition, if not for the allocation throttling sorting write requests by bookmarks (logical address), writes for different files may reach space allocators interleaved, leading to unwanted fragmentation. The solution to both problems is to always sync no more and (if possible) no fewer dnodes at the same time than there are allocators the pool. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Edmund Nadolski <[email protected]> Closes openzfs#15197
does this fix #10110 ? |
It is not intended to address that. |
Hi,
Thanks. |
@datacore-rm It applies to any volblocksize, just at 128KB 330K blocks per second would be 40GB/s, that is difficult to reach on current hardware. At smaller blocks is easier. The bottleneck is caused by all block writes being issued by only one sync thread. This patch allows to use several CPU cores/threads if there are several dirty ZVOLs. |
As part of transaction group commit, dsl_pool_sync() sequentially calls dsl_dataset_sync() for each dirty dataset, which subsequently calls dmu_objset_sync(). dmu_objset_sync() in turn uses up to 75% of CPU cores to run sync_dnodes_task() in taskq threads to sync the dirty dnodes (files). There are two problems: 1. Each ZVOL in a pool is a separate dataset/objset having a single dnode. This means the objsets are synchronized serially, which leads to a bottleneck of ~330K blocks written per second per pool. 2. In the case of multiple dirty dnodes/files on a dataset/objset on a big system they will be sync'd in parallel taskq threads. However, it is inefficient to to use 75% of CPU cores of a big system to do that, because of (a) bottlenecks on a single write issue taskq, and (b) allocation throttling. In addition, if not for the allocation throttling sorting write requests by bookmarks (logical address), writes for different files may reach space allocators interleaved, leading to unwanted fragmentation. The solution to both problems is to always sync no more and (if possible) no fewer dnodes at the same time than there are allocators the pool. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Edmund Nadolski <[email protected]> Closes openzfs#15197
As part of transaction group commit, dsl_pool_sync() sequentially calls dsl_dataset_sync() for each dirty dataset, which subsequently calls dmu_objset_sync(). dmu_objset_sync() in turn uses up to 75% of CPU cores to run sync_dnodes_task() in taskq threads to sync the dirty dnodes (files). There are two problems: 1. Each ZVOL in a pool is a separate dataset/objset having a single dnode. This means the objsets are synchronized serially, which leads to a bottleneck of ~330K blocks written per second per pool. 2. In the case of multiple dirty dnodes/files on a dataset/objset on a big system they will be sync'd in parallel taskq threads. However, it is inefficient to to use 75% of CPU cores of a big system to do that, because of (a) bottlenecks on a single write issue taskq, and (b) allocation throttling. In addition, if not for the allocation throttling sorting write requests by bookmarks (logical address), writes for different files may reach space allocators interleaved, leading to unwanted fragmentation. The solution to both problems is to always sync no more and (if possible) no fewer dnodes at the same time than there are allocators the pool. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Edmund Nadolski <[email protected]> Closes openzfs#15197
@tonyhutter @behlendorf @amotin My apologies if I overlooked it, but is this functionality / pull request (#15197) a candidate for OpenZFS 2.3.0? I would have tagged this prior to the OpenZFS 2.3.0-rc1 release, but I didn't see a PR for 2.3 like there were for the 2.2.x patch sets. In the PR (#16107) for the zfs-2.2.4 patch set, he had mentioned he wasn't comfortable including it in 2.2.x, as it was a fairly major change, but would consider it for ZFS 2.3.x per: #16107 (comment) |
for my curiosity - does this fix/improve #10110 ? Am 06.10.2024 um 19:28 schrieb dmb1372 ***@***.***>:
@tonyhutter @behlendorf @amotin
My apologies if I overlooked it, but is this functionality / pull request (#15197) a candidate for OpenZFS 2.3.0? I would have tagged this prior to the OpenZFS 2.3.0-rc1 release, but I didn't see a PR for 2.3 like there were for the 2.2.x patch sets. In the PR (#16107) for the zfs-2.2.4 patch set, he had mentioned he wasn't comfortable including it in 2.2.x, as it was a fairly major change, but would consider it for ZFS 2.3.x per: #16107 (comment)
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
@dmb1372 yes, this change is included in the OpenZFS 2.3.0-rc1 tag. |
@devZer0 No, they are not really related. It is confusing, but sync threads in ZFS handle async writes. ;) The sync threads issue asynchronous data writes onto the stable storage as part of transaction group commit, while #10110 is about sync writes being duplicated into ZIL to not wait for the TXG commit completion, that may take several seconds. The only way how it may help is if multi-threaded write to the stable storage may reduce TXG commit time and so reduce the average amount of backlogged data that still has to be written to ZIL on fsync() request. |
thank you for making it clear!Am 07.10.2024 um 02:23 schrieb Alexander Motin ***@***.***>:
@devZer0 No, they are not really related. It is confusing, but sync threads in ZFS handle async writes. ;) The sync threads issue asynchronous data writes onto the stable storage as part of transaction group commit, while #10110 is about sync writes being duplicated into ZIL to not wait for the TXG commit completion, that may take several seconds. The only way how it may help is if multi-threaded write to the stable storage may reduce TXG commit time and so reduce the average amount of backlogged data that still has to be written to ZIL on fsync() request.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
These changes address some performance and scaling limitations in the sync path:
Motivation and Context
Write performance scalability.
Targeted case is improving blocks per second, for volblocksize=4k
Description
Per commit descriptions.
How Has This Been Tested?
Local zfs-tests and buildbot tests.
Fio performance testing with multiple zvols in a pool
Testing includes with --enable-debug set.
System: 32-core AMD EPYC 7313 16-Core Processors, 256 GB DRAM
Drives: 2x Samsung NVMe SSD PM173X (3x 2T NS each)
OS: 22.04.1-Ubuntu, 6.2.0-26-generic
fio w/12x zvols prior to changes:
Run status group 0 (all jobs):
WRITE: bw=1267MiB/s (1328MB/s), 1267MiB/s-1267MiB/s (1328MB/s-1328MB/s), io=148GiB (159GB), run=120009-120009msec
fio with changes applied:
Run status group 0 (all jobs):
WRITE: bw=2380MiB/s (2495MB/s), 2380MiB/s-2380MiB/s (2495MB/s-2495MB/s), io=279GiB (299GB), run=120007-120007msec
Types of changes
Checklist:
Signed-off-by
.