Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Module Parameters Regarding Log Size Limit: Throttle write when the limit reached. #11876

Closed
wants to merge 1 commit into from

Conversation

jxdking
Copy link
Contributor

@jxdking jxdking commented Apr 10, 2021

I add 2 module parameters regarding total log data size. These can prevent slog device overflow, and long hang on fsync.
zfs_wrlog_data_max
zfs_wrlog_data_sync_percent

Motivation and Context

It resolves long hang regarding fsync under some specific work load. The issue can be reliably produced with fio.
Here is the script.
fio --name=fio-seq-write --rw=write --bs=1M --direct=1 --numjobs=1 --time_based=1 --runtime=30 --fsync=1000 --filename=fio-seq-write --size=20M --ioengine=libaio --iodepth=16
The command repeatedly overwrites a small file. Since the dirty data space never exceed zfs_dirty_data_sync_percent, write operations are cached in the memory and are never throttled. You will see over 1GB/s write speed on reasonable hardware. If slog device cannot keep up the speed, all the write operations are backed up in ram. That cause long system hang and huge memory usage.
The solution is find a way to "kick" txg sync based on current total size of zil log data (log could be either on memory, or on the slog device), and also throttle write when txg sync doesn't help.

Description

I add 2 module parameters regarding total log data size.
zfs_wrlog_data_max
The upper limit of TX_WRITE log data. Once it is reached, write operation is blocked, until log data is cleared out after txg sync.
It only counts TX_WRITE log with WR_COPIED or WR_NEED_COPY.
This defaults to the same value of zfs_dirty_data_max
zfs_wrlog_data_sync_percent
The least TX_WRITE log data (as a percentage of zfs_wrlog_data_max) to kick a txg sync.

How Has This Been Tested?

Tested with fio, please see the example above.

System:
Ubuntu groovy running in kvm, 8 cores, 2GB memory. Host is using HDD (not SSD)
Created a zpool based on file with 1GB slog.
Run fio on the pool.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@jxdking
Copy link
Contributor Author

jxdking commented Apr 10, 2021

This should be a serious concern for any server running zfs. During the test, I find vm can trickle down fsync to the host. If there is no disk throttle in placed on the vm, the host could experience long hang.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Apr 12, 2021
@jxdking jxdking marked this pull request as ready for review April 12, 2021 12:07
@jxdking jxdking force-pushed the master branch 3 times, most recently from 8cd5a3d to df43d02 Compare April 13, 2021 19:19
include/sys/dsl_pool.h Outdated Show resolved Hide resolved
module/zfs/dsl_pool.c Outdated Show resolved Hide resolved
module/zfs/dsl_pool.c Outdated Show resolved Hide resolved

if (write_state == WR_COPIED || write_state == WR_NEED_COPY) {
dsl_pool_wrlog_delay(size, zilog->zl_dmu_pool, tx->tx_txg);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unconditionally blocking here on the txg sync may introduce some subtle problems since it will result in zfs_write() holding open its transaction handle until the txg sync completes. That can be problematic since if the pool gets suspended and the txg can't be written there's no way to unwind the in progress write.

A better place to wait (and kick) would be in dmu_tx_assign() which is already set up to delay when we're over the dirty limit and on error all callers are setup to unwind any open transactions. However, updating the wrlog accounting there would be tricky so it probably makes good sense to leave that bit here.

@behlendorf behlendorf requested review from ahrens and amotin April 13, 2021 22:28
@jxdking jxdking force-pushed the master branch 3 times, most recently from 24d74f8 to ee5f60c Compare April 15, 2021 15:20
@jxdking jxdking requested a review from behlendorf April 15, 2021 19:52
zfs_wrlog_data_max
The upper limit of TX_WRITE log data. Once it is reached,
write operation is blocked, until log data is cleared out
after txg sync. It only counts TX_WRITE log with WR_COPIED
or WR_NEED_COPY.
This defaults to the same value of zfs_dirty_data_max

zfs_wrlog_data_sync_percent
The least TX_WRITE log data (as a percentage
of zfs_wrlog_data_max) to kick a txg sync.

Add zfs_wrlog_data_max and zfs_wrlog_data_sync_percent
to man/man5/zfs-module-parameters.5

Add write-transaction zil log data counter at end of the body of
zfs_log_write() and zvol_log_write().

Add delay logic into dmu_tx_try_assign().

Merge remote-tracking branch 'upstream/master'

Signed-off-by: jxdking <[email protected]>
@jxdking
Copy link
Contributor Author

jxdking commented Apr 16, 2021

Accidentally merged with others changes in to my fork. This pull request is closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants