Skip to content

Commit

Permalink
dsl_dataset: put IO-inducing frees on the pool deadlist
Browse files Browse the repository at this point in the history
dsl_free() calls zio_free() to free the block. For most blocks, this
simply calls metaslab_free() without doing any IO or putting anything on
the IO pipeline.

Some blocks however require additional IO to free. This at least
includes gang, dedup and cloned blocks. For those, zio_free() will issue
a ZIO_TYPE_FREE IO and return.

If a huge number of blocks are being freed all at once, it's possible
for dsl_dataset_block_kill() to be called millions of time on a single
transaction (eg a 2T object of 128K blocks is 16M blocks). If those are
all IO-inducing frees, that then becomes 16M FREE IOs placed on the
pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T
object that requires a 20G allocation of resident memory from the
zio_cache. If that can't be satisfied by the kernel, an out-of-memory
condition is raised.

This would be better handled by improving the cases that the
dmu_tx_assign() throttle will handle, or by reducing the overheads
required by the IO pipeline, or with a better central facility for
freeing blocks.

For now, we simply check for the cases that would cause zio_free() to
create a FREE IO, and instead put the block on the pool's freelist. This
is the same place that blocks from destroyed datasets go, and the async
destroy machinery will automatically see them and trickle them out as
normal.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
  • Loading branch information
robn authored and behlendorf committed Nov 12, 2024
1 parent 3bc2bea commit a824df7
Showing 1 changed file with 26 additions and 2 deletions.
28 changes: 26 additions & 2 deletions module/zfs/dsl_dataset.c
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@
#include <sys/zio_compress.h>
#include <zfs_fletcher.h>
#include <sys/zio_checksum.h>
#include <sys/brt.h>

/*
* The SPA supports block sizes up to 16MB. However, very large blocks
Expand Down Expand Up @@ -289,18 +290,41 @@ dsl_dataset_block_kill(dsl_dataset_t *ds, const blkptr_t *bp, dmu_tx_t *tx,
if (BP_GET_LOGICAL_BIRTH(bp) > dsl_dataset_phys(ds)->ds_prev_snap_txg) {
int64_t delta;

dprintf_bp(bp, "freeing ds=%llu", (u_longlong_t)ds->ds_object);
dsl_free(tx->tx_pool, tx->tx_txg, bp);
/*
* Put blocks that would create IO on the pool's deadlist for
* dsl_process_async_destroys() to find. This is to prevent
* zio_free() from creating a ZIO_TYPE_FREE IO for them, which
* are very heavy and can lead to out-of-memory conditions if
* something tries to free millions of blocks on the same txg.
*/
boolean_t defer = spa_version(spa) >= SPA_VERSION_DEADLISTS &&
(BP_IS_GANG(bp) || BP_GET_DEDUP(bp) ||
brt_maybe_exists(spa, bp));

if (defer) {
dprintf_bp(bp, "putting on free list: %s", "");
bpobj_enqueue(&ds->ds_dir->dd_pool->dp_free_bpobj,
bp, B_FALSE, tx);
} else {
dprintf_bp(bp, "freeing ds=%llu",
(u_longlong_t)ds->ds_object);
dsl_free(tx->tx_pool, tx->tx_txg, bp);
}

mutex_enter(&ds->ds_lock);
ASSERT(dsl_dataset_phys(ds)->ds_unique_bytes >= used ||
!DS_UNIQUE_IS_ACCURATE(ds));
delta = parent_delta(ds, -used);
dsl_dataset_phys(ds)->ds_unique_bytes -= used;
mutex_exit(&ds->ds_lock);

dsl_dir_diduse_transfer_space(ds->ds_dir,
delta, -compressed, -uncompressed, -used,
DD_USED_REFRSRV, DD_USED_HEAD, tx);

if (defer)
dsl_dir_diduse_space(tx->tx_pool->dp_free_dir,
DD_USED_HEAD, used, compressed, uncompressed, tx);
} else {
dprintf_bp(bp, "putting on dead list: %s", "");
if (async) {
Expand Down

0 comments on commit a824df7

Please sign in to comment.