Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve zfs receive performance with lightweight write #11105

Merged
merged 1 commit into from
Dec 11, 2020

Conversation

ahrens
Copy link
Member

@ahrens ahrens commented Oct 23, 2020

Motivation and Context

The performance of zfs receive can be bottlenecked on the CPU consumed
by the receive_writer thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each WRITE record in the send
stream.

I described this work in my talk at the 2020 OpenZFS Developer Summit: slides (link should start you at slide 25) and video (link should start you at 21:52)

Description

This commit introduces the concept of "lightweight writes", which allows
zfs receive to write to the DMU by providing an ABD, and instantiating
only a new type of dbuf_dirty_record_t. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.

Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, zfs receive is write-only, so this works fine.

Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.

The code is also restructured in a few ways:

Added a dr_dnode field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs DB_DNODE_ENTER() and related
routines. The new field is needed by the lightweight-type dirty record.

To ensure that the dr_dnode field remains valid until the dirty record
is freed, we have to ensure that the dnode_move() doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(userquota_updates_task()), this commit extends that so that it always
keeps the dnode hold until zio completion (see dnode_rele_task()).

dn_dirty_txg was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new dnode_rele_task() code.

Removed some dead code related to DRR_WRITE_BYREF (dedup receive).

How Has This Been Tested?

Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.

Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%

image

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@ahrens ahrens added Component: Send/Recv "zfs send/recv" feature Status: Code Review Needed Ready for review and testing Type: Performance Performance improvement or performance problem labels Oct 23, 2020
@codecov
Copy link

codecov bot commented Oct 24, 2020

Codecov Report

Merging #11105 (44016c6) into master (e071625) will increase coverage by 0.13%.
The diff coverage is 91.98%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #11105      +/-   ##
==========================================
+ Coverage   79.51%   79.65%   +0.13%     
==========================================
  Files         400      398       -2     
  Lines      127585   125866    -1719     
==========================================
- Hits       101449   100253    -1196     
+ Misses      26136    25613     -523     
Flag Coverage Δ
kernel 80.42% <93.38%> (-0.24%) ⬇️
user 65.40% <18.32%> (+1.58%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
include/sys/dbuf.h 100.00% <ø> (ø)
module/zfs/dnode_sync.c 93.58% <ø> (ø)
module/zfs/dmu_recv.c 77.31% <82.92%> (+0.94%) ⬆️
module/zfs/dbuf.c 93.29% <95.00%> (+0.16%) ⬆️
module/zfs/dmu.c 87.15% <100.00%> (+0.31%) ⬆️
module/zfs/dmu_objset.c 91.65% <100.00%> (-0.14%) ⬇️
module/zfs/dnode.c 94.70% <100.00%> (+0.19%) ⬆️
module/zfs/dsl_dataset.c 92.48% <100.00%> (+0.32%) ⬆️
module/zfs/dsl_pool.c 94.35% <100.00%> (+0.04%) ⬆️
module/os/linux/spl/spl-zlib.c 55.35% <0.00%> (-28.58%) ⬇️
... and 123 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f483daa...f6ad5ae. Read the comment docs.

@ahrens
Copy link
Member Author

ahrens commented Nov 2, 2020

@behlendorf It looks like some of the ZTS runs are reporting memory leaks. Do you have any tricks for debugging that? E.g. can I get the stack traces where the leaked memory was allocated?

@behlendorf
Copy link
Contributor

behlendorf commented Nov 3, 2020

Yes in fact you can. The coverage builder's kmemleak checker will dump to the console log where the leaked memory was allocated. I believe this is the stack you're looking for (extracted and cleanup from the following log).

http://build.zfsonlinux.org/builders/Ubuntu%2018.04%20x86_64%20Coverage%20%28TEST%29/builds/3522/steps/shell_9/logs/console

unreferenced object 0xffff880282fc84d8 (size 248):
  comm "zfs", pid 4996, jiffies 4301774189 (age 3484.756s)
  hex dump (first 32 bytes):
    03 00 00 00 00 00 10 00 00 01 00 00 00 00 ad de  ................
    00 02 00 00 00 00 ad de 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8185e02a>] kmemleak_alloc+0x7a/0x100
    [<ffffffff8130248f>] kmem_cache_alloc+0x1ef/0x370
    [<ffffffffa0a16e34>] spl_kmem_cache_alloc+0x94/0x370 [spl]
    [<ffffffffa0e6fd55>] abd_alloc_struct+0x25/0xe0 [zfs]
    [<ffffffffa0c7ef63>] abd_alloc_linear+0x23/0x120 [zfs]
    [<ffffffffa0ce44e0>] receive_read_record+0x120/0x470 [zfs]
    [<ffffffffa0ce7311>] dmu_recv_stream+0x4a1/0xac0 [zfs]
    [<ffffffffa0e32747>] zfs_ioc_recv_impl+0x3d7/0xe50 [zfs]
    [<ffffffffa0e3376d>] zfs_ioc_recv+0x1cd/0x360 [zfs]
    [<ffffffffa0e36272>] zfsdev_ioctl_common+0x852/0x8b0 [zfs]
    [<ffffffffa0e855c0>] zfsdev_ioctl+0x70/0x130 [zfs]
    [<ffffffff813578b1>] do_vfs_ioctl+0xa1/0x9c0
    [<ffffffff8135827d>] SyS_ioctl+0xad/0xe0
    [<ffffffff8186ebb7>] entry_SYSCALL_64_fastpath+0x1a/0xa5
    [<ffffffffffffffff>] 0xffffffffffffffff

@ahrens
Copy link
Member Author

ahrens commented Nov 11, 2020

I fixed the leak (which occurs when receiving an incremental large-block stream into a dataset that previously did a non-large-block receive), and addressed code review comments that I received out-of-band.

module/zfs/dbuf.c Outdated Show resolved Hide resolved
@behlendorf
Copy link
Contributor

@grwilson @pcd1193182 if possible would you mind wrapping up your reviews of this PR. Thanks.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 20, 2020
@ahrens
Copy link
Member Author

ahrens commented Dec 2, 2020

Rebased. @grwilson @pcd1193182 could you take a look at this?

module/zfs/dbuf.c Outdated Show resolved Hide resolved
module/zfs/dbuf.c Outdated Show resolved Hide resolved
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes.  Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.

This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`.  The dbuf and arc buf for this
"dirty leaf block" are not instantiated.

Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data).  Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.

Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.

Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU.  On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.

Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%

The code is also restructured in a few ways:

Added a `dr_dnode` field to the dbuf_dirty_record_t.  This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines.  The new field is needed by the lightweight-type dirty record.

To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t.  To do this we keep a hold on the dnode until it's zio's have
completed.  This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).

`dn_dirty_txg` was previously zeroed when the dnode was synced.  This
was not necessary, since its meaning can be "when was this dnode last
dirtied".  This change simplifies the new `dnode_rele_task()` code.

Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).

Signed-off-by: Matthew Ahrens <[email protected]>
@behlendorf behlendorf merged commit ba67d82 into openzfs:master Dec 11, 2020
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes.  Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.

This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`.  The dbuf and arc buf for this
"dirty leaf block" are not instantiated.

Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data).  Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.

Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.

Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU.  On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.

Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%

The code is also restructured in a few ways:

Added a `dr_dnode` field to the dbuf_dirty_record_t.  This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines.  The new field is needed by the lightweight-type dirty record.

To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t.  To do this we keep a hold on the dnode until it's zio's have
completed.  This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).

`dn_dirty_txg` was previously zeroed when the dnode was synced.  This
was not necessary, since its meaning can be "when was this dnode last
dirtied".  This change simplifies the new `dnode_rele_task()` code.

Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes openzfs#11105
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes.  Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.

This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`.  The dbuf and arc buf for this
"dirty leaf block" are not instantiated.

Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data).  Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.

Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.

Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU.  On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.

Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%

The code is also restructured in a few ways:

Added a `dr_dnode` field to the dbuf_dirty_record_t.  This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines.  The new field is needed by the lightweight-type dirty record.

To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t.  To do this we keep a hold on the dnode until it's zio's have
completed.  This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).

`dn_dirty_txg` was previously zeroed when the dnode was synced.  This
was not necessary, since its meaning can be "when was this dnode last
dirtied".  This change simplifies the new `dnode_rele_task()` code.

Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes openzfs#11105
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Send/Recv "zfs send/recv" feature Status: Accepted Ready to integrate (reviewed, tested) Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants