Introduce write throttle smoothing #12868

allanjude · 2021-12-16T16:41:22Z

To avoid stalls and uneven performance during heavy write workloads,
continue to apply the write throttling even when the amount of dirty data
has temporarily dipped below zfs_delay_min_dirty_percent for up to
zfs_write_smoothing seconds.

Motivation and Context

Consider a case where new data is written to the ZFS dirty buffers at the rate that the backing storage cannot keep up with. ZFS allows writing data at full speed until the amount of dirty data in memory reach the threshold (zfs_delay_min_dirty_percent, default is 60%). When this point is reached, ZFS starts to apply artificial delays. The applied delays scales based on the amount of dirty data and is described by the formula:

delay_ns = zfs_delay_scale * (amount_of_dirty_data - delay_min_bytes) / (zfs_dirty_data_max - amount_of_dirty_data)

While data is added at a somewhat consistent rate, data is drained from the dirty buffers in large swaths as each transaction group is flushed to disk. This can result in the amount of dirty data dropping below the 60% threshold again, resulting in no delay being applied, and ZFS accepts data at full speed.

This causes applications to experience significant changes in write throughput depending on the amount of dirty data, especially on very heavy write workloads. The behavior is noticeable because the delay is applied only when the threshold is exceeded.

Description

The idea is to continue to apply the write throttle (adding delay to every write) for a short time even after the amount of dirty data has fallen below the threshold. The smoothing mechanism continues to be applied for zfs_write_smoothing seconds after the last time when the amount of dirty data was above the threshold. Thanks to this, applications will not be able to burst writes again if the backing storage is still possibly unable to keep up with the level of incoming writes.
The additional delay will be applied only when the backing storage has been overloaded within the last few seconds.

The original delaying algorithm works as before.

The formula used for the smoothing algorithm is as follow:

delay_ns = zfs_delay_scale * (amount_of_dirty_data) / ((zfs_dirty_data_max - amount_of_dirty_data));

The characteristic of this formula is shown in the Figure below in yellow color. It’s more aggressive than the previous one because when the dirty data exceeds the threshold, we apply the previous formula.

How Has This Been Tested?

We performed several tests using the fio utility. We observed the amount of dirty data during these tests and applied delays. Both those observation was done using eBPF.
Example of eBPF to observe the change in the amount of data buffers:

bpftrace -e ‘kprobe:dsl_pool_dirty_delta {@val[tid] = (int8 *)arg0 + 0x248;} kretprobe:dsl_pool_dirty_delta {printf(“%lld %lld\n”, nsecs, *(uint64 *)@val[tid]); delete(@val[tid]); }’

To observe the applied delay:

bpftrace -e ‘kprobe:dmu_tx_assign { @time[tid] = 0 } kprobe:dmu_tx_wait { @tmp[tid] = nsecs } kretprobe:dmu_tx_wait /@tmp[tid]/ { @time[tid] += nsecs - @tmp
[tid] } kretprobe:dmu_tx_assign { @ns = hist(@time[tid]); delete(@time[tid]) } END { clear(@time); clear(@tmp); }’

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

man/man4/zfs.4

amotin · 2021-12-27T17:09:15Z

module/zfs/dmu_tx.c

+	}
+	if (last_smooth > now) {
+		smoothing_time = zfs_smoothing_scale *
+		    (dirty >> 1) / ((delay_min_bytes - dirty) << 1);


What is the magic with the shifts? Wouldn't decreasing zfs_smoothing_scale 4 times do the same?

If dirty == delay_min_bytes here should be division by zero. If it goes higher, then smoothing time will be zero due to negative overflow.

Yea, the magic with the number is over complicated.
I will simplify that, thank you!
Dividing by zero is a nice catch - It showed us that we made a mistake when putting this commit together. We have used different formulas in the descriptiona and final commit.

amotin · 2021-12-27T17:11:51Z

module/zfs/dmu_tx.c


-	if (dirty <= delay_min_bytes)
+	now = gethrtime();


Additional gethrtime() per I/O may be not huge, but not free also.

This one actually isn't per IO - this one is in one that we "almost know" that we have to delay.
The one per IO would be in dsl_pool_need_dirty_delay.
But in general I agree that this isn't free.

amotin · 2021-12-27T17:31:35Z

The problem of throughput fluctuation with txg commits is known for a long time. But IIRC Matt Ahrens already tried to fix it by decrementing ammount of dirty data on write ZIOs completion. If you still observe significant correlation between the dirty data fluctuation and TXG commits, then I guess it may be some issue with that write completion accounting (there are no leaks by design, since on each TXG commit all that left is cleared any way), or on opposite side there could be periods of time at the end of TXG commit where little data written, but much time spent, including on cache syncs. Or there may be some other sources of fluctuation, such as dedup tables, which update may take significant time at each TXG commit.

man/man4/zfs.4

oshogbo

Thank you for the review!

man/man4/zfs.4

oshogbo · 2022-01-08T00:45:07Z

module/zfs/dmu_tx.c


-	if (dirty <= delay_min_bytes)
+	now = gethrtime();


This one actually isn't per IO - this one is in one that we "almost know" that we have to delay.
The one per IO would be in dsl_pool_need_dirty_delay.
But in general I agree that this isn't free.

oshogbo · 2022-01-08T00:45:52Z

module/zfs/dmu_tx.c

+	}
+	if (last_smooth > now) {
+		smoothing_time = zfs_smoothing_scale *
+		    (dirty >> 1) / ((delay_min_bytes - dirty) << 1);


Yea, the magic with the number is over complicated.
I will simplify that, thank you!
Dividing by zero is a nice catch - It showed us that we made a mistake when putting this commit together. We have used different formulas in the descriptiona and final commit.

amotin · 2022-02-02T17:59:54Z

module/zfs/dsl_pool.c

@@ -103,6 +103,7 @@ unsigned long zfs_dirty_data_max = 0;
 unsigned long zfs_dirty_data_max_max = 0;
 int zfs_dirty_data_max_percent = 10;
 int zfs_dirty_data_max_max_percent = 25;
+unsigned long zfs_smoothing_write = 0;


I don't like when users are required to tune things for getting good results. If this patch is so good, why is it disabled by default?

And as I've written in my comment, it would be good to understand the actual cause of fluctuation. Why added write accounting did not make it smooth enough? Unavoidable fluctuations due to metadata updates and cache flushes during commit?

Alexander, thank you for raising your concerns.

In general, the existing mechanism should be enough in most cases.
We can tune it to slow down writes from some threshold, and the delay will increase over time.
Our test in our configuration means applying the delay very early and with more significant delays, which causes a drop in overlay performance.
But this ofc would make the writes more stable.

As you mention, the fluctuation with txg commits has been known for a long time, and this is a small
an implementation which in some cases may help users.
The issue is that the current mechanism is not adaptive. You apply delays without looking
at the current overload. If there is a big chunk of data and then nothing, do we
want to use the delay? And on the other hand, if we just flushed a big piece of data
and there is another chunk of data and another, should we apply the delays?
The current algorithm looks only at one variable, the current usage of the dirty buffers.

This patch is non-ideal - we agree. But we try to address this issue somehow and give at least some mechanism to test.
We can also look at this as the historical data to determine if we should or should not apply pressure.

Why is this disabled by default? We successfully tested it on our installation, but we are unsure how all configurations will react. We prefer to start small and see how the broader community will respond to this approach. We are more than happy to enable it by default at some point. On the other hand, if there will be a better approach or it seems unpracticed on most installations, we are more than happy to remove it. Fortunately, the mechanism isn't very intrusive. It is not an on-disk format change, so if we decide to adapt it, or replace it with another mechanism in the future, we won't have any problems doing that.

If you have another idea of dealing with fluctuation, we are more than happy to try it.

To avoid stalls and uneven performance during heavy write workloads, continue to apply write throttling even when the amount of dirty data has temporarily dipped below zfs_delay_min_dirty_percent for up to zfs_write_smoothing seconds. Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Mariusz Zaborski <[email protected]>

gmelikov · 2022-03-02T09:33:08Z

man/man4/zfs.4

@@ -15,7 +15,7 @@
 .\" own identifying information:
 .\" Portions Copyright [yyyy] [name of copyright owner]
 .\"
-.Dd June 1, 2021
+.Dd January 8, 2021


Looks like unnecessary change.

Hym, but when we change the man page shouldn't we bump the .Dd?

Yeah, this is fine.

Sorry, should be more specific - January 8, 2021 is older than June 1, 2021. Or maybe it's a typo and it should be 2022?

Yep it should be 2022. Thanks!

problame · 2022-03-22T09:26:51Z

module/zfs/dmu_tx.c

@@ -1089,10 +1103,11 @@ dmu_tx_wait(dmu_tx_t *tx)
 			DMU_TX_STAT_BUMP(dmu_tx_dirty_over_max);
 		while (dp->dp_dirty_total >= zfs_dirty_data_max)
 			cv_wait(&dp->dp_spaceavail_cv, &dp->dp_lock);
+		last_smooth = dp->dp_last_smooth;


Why is the read of dp_last_smooth not being done in dmu_tx_delay, like the update?

Is it because we need to hold the mutex, and don't want to pay the cost of re-acquiring in dmu_tx_delay just to read the value?
Does it actually make a perf difference in practice?

Yes, its because we would need to take again a mutex.
This method is equivalent to dp_dirty_total which also requires mutex.

problame · 2022-03-22T09:40:24Z

include/sys/dsl_pool.h

@@ -115,6 +117,7 @@ typedef struct dsl_pool {
 	kcondvar_t dp_spaceavail_cv;
 	uint64_t dp_dirty_pertxg[TXG_SIZE];
 	uint64_t dp_dirty_total;
+	hrtime_t dp_last_smooth;


doc comment suggestion, if I understand the code correctly:

The point in time time after which the write smoothing mechanism has no effect.

If that interpretation of the variable's role is correct, I'd prefer a different name.
For example, dp_stop_write_smooth_after.

This isn't a variable described in doc.
This is variable to store a time of last smoothing.

problame · 2022-03-22T09:44:40Z

module/zfs/dmu_tx.c

-	if (dirty <= delay_min_bytes)
+	now = gethrtime();
+
+	if (dirty <= delay_min_bytes && last_smooth <= now)


We can avoid this early exit if we can take the "performance hit" of the two zero initializations in line 799 and 800, and the MIN(MAX(...)) code in line 811.

Doesn't it make code more readable?

problame · 2022-03-22T09:46:04Z

module/zfs/dmu_tx.c

+
+	if (dirty > delay_min_bytes) {
+		delay_time = zfs_delay_scale *
+		    (dirty - delay_min_bytes) / (zfs_dirty_data_max - dirty);


Should probably use the occasion to READ_ONCE(zfs_dirty_data_max) at the top of the function, into a const value.
Then make all use sites of that variable in this function use the const variable

Yes, I can try that.

oshogbo

Address comments from @problame

oshogbo · 2022-04-01T17:50:21Z

include/sys/dsl_pool.h

@@ -115,6 +117,7 @@ typedef struct dsl_pool {
 	kcondvar_t dp_spaceavail_cv;
 	uint64_t dp_dirty_pertxg[TXG_SIZE];
 	uint64_t dp_dirty_total;
+	hrtime_t dp_last_smooth;


This isn't a variable described in doc.
This is variable to store a time of last smoothing.

oshogbo · 2022-04-01T17:50:49Z

module/zfs/dmu_tx.c

-	if (dirty <= delay_min_bytes)
+	now = gethrtime();
+
+	if (dirty <= delay_min_bytes && last_smooth <= now)


Doesn't it make code more readable?

oshogbo · 2022-04-01T17:50:59Z

module/zfs/dmu_tx.c

+
+	if (dirty > delay_min_bytes) {
+		delay_time = zfs_delay_scale *
+		    (dirty - delay_min_bytes) / (zfs_dirty_data_max - dirty);


Yes, I can try that.

oshogbo · 2022-04-01T17:52:18Z

module/zfs/dmu_tx.c

@@ -1089,10 +1103,11 @@ dmu_tx_wait(dmu_tx_t *tx)
 			DMU_TX_STAT_BUMP(dmu_tx_dirty_over_max);
 		while (dp->dp_dirty_total >= zfs_dirty_data_max)
 			cv_wait(&dp->dp_spaceavail_cv, &dp->dp_lock);
+		last_smooth = dp->dp_last_smooth;


Yes, its because we would need to take again a mutex.
This method is equivalent to dp_dirty_total which also requires mutex.

amotin · 2022-05-17T18:57:02Z

I suspect that problem "fixed" in this PR was introduced by #12284 . If workload repeatedly overwrites the same data over and over again, log size grows faster than amount of dirty data in ARC, and instead of smooth write throttling in dmu_tx_wait() the mentioned PR triggers txg_wait_synced(dp, spa_last_synced_txg(spa) + 1), that blocks any further user writes until all dirty data is written and TXG synced, exactly as we could see on the graph provided during the previous monthly call. I am now working on amending the mentioned PR to make it also throttle writes smoothly.

amotin · 2022-05-18T16:16:51Z

I think #13476 should solve the problem better and this PR should be closed, unless you know some other pathological scenario still not covered.

allanjude · 2022-09-04T15:39:42Z

This pull request has been overcome in favour of:
#13838
and
#13839

not our work, just trying it out openzfs#12868 Signed-off-by: Pavel Snajdr <[email protected]>

behlendorf added the Status: Code Review Needed Ready for review and testing label Dec 17, 2021

allanjude force-pushed the smoothing branch 2 times, most recently from 1384257 to bb2e82b Compare December 24, 2021 15:21

amotin reviewed Dec 27, 2021

View reviewed changes

nabijaczleweli reviewed Jan 4, 2022

View reviewed changes

man/man4/zfs.4 Outdated Show resolved Hide resolved

ahrens assigned mmaybee Jan 4, 2022

ahrens requested review from ahrens and grwilson January 4, 2022 22:28

oshogbo force-pushed the smoothing branch from bb2e82b to 98e009a Compare January 8, 2022 01:21

oshogbo reviewed Jan 8, 2022

View reviewed changes

oshogbo force-pushed the smoothing branch from 98e009a to ad3ee4a Compare January 8, 2022 01:45

allanjude force-pushed the smoothing branch from ad3ee4a to 7bc8852 Compare February 1, 2022 14:58

amotin reviewed Feb 2, 2022

View reviewed changes

allanjude force-pushed the smoothing branch from 7bc8852 to 0e0b869 Compare February 23, 2022 15:10

gmelikov reviewed Mar 2, 2022

View reviewed changes

problame reviewed Mar 22, 2022

View reviewed changes

oshogbo reviewed Apr 1, 2022

View reviewed changes

amotin mentioned this pull request May 18, 2022

Refactor Log Size Limit. #13476

Merged

13 tasks

This was referenced Sep 4, 2022

quota: disable quota check for ZVOL #13838

Merged

quota: extend quota for dataset #13839

Merged

allanjude closed this Sep 4, 2022

snajpa added a commit to vpsfreecz/zfs that referenced this pull request Sep 21, 2022

write throttle smoothing optional, default disabled

c5c3a1a

not our work, just trying it out openzfs#12868 Signed-off-by: Pavel Snajdr <[email protected]>

andrewd-zededa mentioned this pull request Feb 28, 2024

Update to OpenZFS 2.2.2 lf-edge/eve#3779

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce write throttle smoothing #12868

Introduce write throttle smoothing #12868

allanjude commented Dec 16, 2021 •

edited

Loading

amotin Dec 27, 2021

amotin Dec 27, 2021

oshogbo Jan 8, 2022

amotin Dec 27, 2021

oshogbo Jan 8, 2022

amotin commented Dec 27, 2021

oshogbo left a comment

oshogbo Jan 8, 2022

oshogbo Jan 8, 2022

amotin Feb 2, 2022

oshogbo Feb 14, 2022

gmelikov Mar 2, 2022

oshogbo Mar 2, 2022

nabijaczleweli Mar 2, 2022

gmelikov Mar 2, 2022

oshogbo Mar 2, 2022

problame Mar 22, 2022

oshogbo Apr 1, 2022

problame Mar 22, 2022

oshogbo Apr 1, 2022

problame Mar 22, 2022

oshogbo Apr 1, 2022

problame Mar 22, 2022

oshogbo Apr 1, 2022

oshogbo left a comment

oshogbo Apr 1, 2022

oshogbo Apr 1, 2022

oshogbo Apr 1, 2022

oshogbo Apr 1, 2022

amotin commented May 17, 2022

amotin commented May 18, 2022

allanjude commented Sep 4, 2022

Introduce write throttle smoothing #12868

Introduce write throttle smoothing #12868

Conversation

allanjude commented Dec 16, 2021 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotin commented Dec 27, 2021

oshogbo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oshogbo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotin commented May 17, 2022

amotin commented May 18, 2022

allanjude commented Sep 4, 2022

allanjude commented Dec 16, 2021 •

edited

Loading