-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce write throttle smoothing #12868
Conversation
1384257
to
bb2e82b
Compare
module/zfs/dmu_tx.c
Outdated
} | ||
if (last_smooth > now) { | ||
smoothing_time = zfs_smoothing_scale * | ||
(dirty >> 1) / ((delay_min_bytes - dirty) << 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the magic with the shifts? Wouldn't decreasing zfs_smoothing_scale 4 times do the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If dirty == delay_min_bytes here should be division by zero. If it goes higher, then smoothing time will be zero due to negative overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, the magic with the number is over complicated.
I will simplify that, thank you!
Dividing by zero is a nice catch - It showed us that we made a mistake when putting this commit together. We have used different formulas in the descriptiona and final commit.
|
||
if (dirty <= delay_min_bytes) | ||
now = gethrtime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional gethrtime() per I/O may be not huge, but not free also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one actually isn't per IO - this one is in one that we "almost know" that we have to delay.
The one per IO would be in dsl_pool_need_dirty_delay.
But in general I agree that this isn't free.
The problem of throughput fluctuation with txg commits is known for a long time. But IIRC Matt Ahrens already tried to fix it by decrementing ammount of dirty data on write ZIOs completion. If you still observe significant correlation between the dirty data fluctuation and TXG commits, then I guess it may be some issue with that write completion accounting (there are no leaks by design, since on each TXG commit all that left is cleared any way), or on opposite side there could be periods of time at the end of TXG commit where little data written, but much time spent, including on cache syncs. Or there may be some other sources of fluctuation, such as dedup tables, which update may take significant time at each TXG commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review!
|
||
if (dirty <= delay_min_bytes) | ||
now = gethrtime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one actually isn't per IO - this one is in one that we "almost know" that we have to delay.
The one per IO would be in dsl_pool_need_dirty_delay.
But in general I agree that this isn't free.
module/zfs/dmu_tx.c
Outdated
} | ||
if (last_smooth > now) { | ||
smoothing_time = zfs_smoothing_scale * | ||
(dirty >> 1) / ((delay_min_bytes - dirty) << 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, the magic with the number is over complicated.
I will simplify that, thank you!
Dividing by zero is a nice catch - It showed us that we made a mistake when putting this commit together. We have used different formulas in the descriptiona and final commit.
@@ -103,6 +103,7 @@ unsigned long zfs_dirty_data_max = 0; | |||
unsigned long zfs_dirty_data_max_max = 0; | |||
int zfs_dirty_data_max_percent = 10; | |||
int zfs_dirty_data_max_max_percent = 25; | |||
unsigned long zfs_smoothing_write = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like when users are required to tune things for getting good results. If this patch is so good, why is it disabled by default?
And as I've written in my comment, it would be good to understand the actual cause of fluctuation. Why added write accounting did not make it smooth enough? Unavoidable fluctuations due to metadata updates and cache flushes during commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alexander, thank you for raising your concerns.
In general, the existing mechanism should be enough in most cases.
We can tune it to slow down writes from some threshold, and the delay will increase over time.
Our test in our configuration means applying the delay very early and with more significant delays, which causes a drop in overlay performance.
But this ofc would make the writes more stable.
As you mention, the fluctuation with txg commits has been known for a long time, and this is a small
an implementation which in some cases may help users.
The issue is that the current mechanism is not adaptive. You apply delays without looking
at the current overload. If there is a big chunk of data and then nothing, do we
want to use the delay? And on the other hand, if we just flushed a big piece of data
and there is another chunk of data and another, should we apply the delays?
The current algorithm looks only at one variable, the current usage of the dirty buffers.
This patch is non-ideal - we agree. But we try to address this issue somehow and give at least some mechanism to test.
We can also look at this as the historical data to determine if we should or should not apply pressure.
Why is this disabled by default? We successfully tested it on our installation, but we are unsure how all configurations will react. We prefer to start small and see how the broader community will respond to this approach. We are more than happy to enable it by default at some point. On the other hand, if there will be a better approach or it seems unpracticed on most installations, we are more than happy to remove it. Fortunately, the mechanism isn't very intrusive. It is not an on-disk format change, so if we decide to adapt it, or replace it with another mechanism in the future, we won't have any problems doing that.
If you have another idea of dealing with fluctuation, we are more than happy to try it.
To avoid stalls and uneven performance during heavy write workloads, continue to apply write throttling even when the amount of dirty data has temporarily dipped below zfs_delay_min_dirty_percent for up to zfs_write_smoothing seconds. Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Mariusz Zaborski <[email protected]>
@@ -15,7 +15,7 @@ | |||
.\" own identifying information: | |||
.\" Portions Copyright [yyyy] [name of copyright owner] | |||
.\" | |||
.Dd June 1, 2021 | |||
.Dd January 8, 2021 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like unnecessary change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hym, but when we change the man page shouldn't we bump the .Dd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, should be more specific - January 8, 2021 is older than June 1, 2021. Or maybe it's a typo and it should be 2022
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep it should be 2022. Thanks!
@@ -1089,10 +1103,11 @@ dmu_tx_wait(dmu_tx_t *tx) | |||
DMU_TX_STAT_BUMP(dmu_tx_dirty_over_max); | |||
while (dp->dp_dirty_total >= zfs_dirty_data_max) | |||
cv_wait(&dp->dp_spaceavail_cv, &dp->dp_lock); | |||
last_smooth = dp->dp_last_smooth; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the read of dp_last_smooth
not being done in dmu_tx_delay
, like the update?
Is it because we need to hold the mutex, and don't want to pay the cost of re-acquiring in dmu_tx_delay
just to read the value?
Does it actually make a perf difference in practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, its because we would need to take again a mutex.
This method is equivalent to dp_dirty_total which also requires mutex.
@@ -115,6 +117,7 @@ typedef struct dsl_pool { | |||
kcondvar_t dp_spaceavail_cv; | |||
uint64_t dp_dirty_pertxg[TXG_SIZE]; | |||
uint64_t dp_dirty_total; | |||
hrtime_t dp_last_smooth; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc comment suggestion, if I understand the code correctly:
The point in time time after which the write smoothing mechanism has no effect.
If that interpretation of the variable's role is correct, I'd prefer a different name.
For example, dp_stop_write_smooth_after
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a variable described in doc.
This is variable to store a time of last smoothing.
if (dirty <= delay_min_bytes) | ||
now = gethrtime(); | ||
|
||
if (dirty <= delay_min_bytes && last_smooth <= now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid this early exit if we can take the "performance hit" of the two zero initializations in line 799 and 800, and the MIN(MAX(...)) code in line 811.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't it make code more readable?
|
||
if (dirty > delay_min_bytes) { | ||
delay_time = zfs_delay_scale * | ||
(dirty - delay_min_bytes) / (zfs_dirty_data_max - dirty); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably use the occasion to READ_ONCE(zfs_dirty_data_max)
at the top of the function, into a const
value.
Then make all use sites of that variable in this function use the const
variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address comments from @problame
@@ -115,6 +117,7 @@ typedef struct dsl_pool { | |||
kcondvar_t dp_spaceavail_cv; | |||
uint64_t dp_dirty_pertxg[TXG_SIZE]; | |||
uint64_t dp_dirty_total; | |||
hrtime_t dp_last_smooth; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a variable described in doc.
This is variable to store a time of last smoothing.
if (dirty <= delay_min_bytes) | ||
now = gethrtime(); | ||
|
||
if (dirty <= delay_min_bytes && last_smooth <= now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't it make code more readable?
|
||
if (dirty > delay_min_bytes) { | ||
delay_time = zfs_delay_scale * | ||
(dirty - delay_min_bytes) / (zfs_dirty_data_max - dirty); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can try that.
@@ -1089,10 +1103,11 @@ dmu_tx_wait(dmu_tx_t *tx) | |||
DMU_TX_STAT_BUMP(dmu_tx_dirty_over_max); | |||
while (dp->dp_dirty_total >= zfs_dirty_data_max) | |||
cv_wait(&dp->dp_spaceavail_cv, &dp->dp_lock); | |||
last_smooth = dp->dp_last_smooth; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, its because we would need to take again a mutex.
This method is equivalent to dp_dirty_total which also requires mutex.
I suspect that problem "fixed" in this PR was introduced by #12284 . If workload repeatedly overwrites the same data over and over again, log size grows faster than amount of dirty data in ARC, and instead of smooth write throttling in dmu_tx_wait() the mentioned PR triggers txg_wait_synced(dp, spa_last_synced_txg(spa) + 1), that blocks any further user writes until all dirty data is written and TXG synced, exactly as we could see on the graph provided during the previous monthly call. I am now working on amending the mentioned PR to make it also throttle writes smoothly. |
I think #13476 should solve the problem better and this PR should be closed, unless you know some other pathological scenario still not covered. |
not our work, just trying it out openzfs#12868 Signed-off-by: Pavel Snajdr <[email protected]>
To avoid stalls and uneven performance during heavy write workloads,
continue to apply the write throttling even when the amount of dirty data
has temporarily dipped below
zfs_delay_min_dirty_percent
for up tozfs_write_smoothing
seconds.Sponsored by: Zededa Inc.
Motivation and Context
Consider a case where new data is written to the ZFS dirty buffers at the rate that the backing storage cannot keep up with. ZFS allows writing data at full speed until the amount of dirty data in memory reach the threshold (
zfs_delay_min_dirty_percent
, default is 60%). When this point is reached, ZFS starts to apply artificial delays. The applied delays scales based on the amount of dirty data and is described by the formula:While data is added at a somewhat consistent rate, data is drained from the dirty buffers in large swaths as each transaction group is flushed to disk. This can result in the amount of dirty data dropping below the 60% threshold again, resulting in no delay being applied, and ZFS accepts data at full speed.
This causes applications to experience significant changes in write throughput depending on the amount of dirty data, especially on very heavy write workloads. The behavior is noticeable because the delay is applied only when the threshold is exceeded.
Description
The idea is to continue to apply the write throttle (adding delay to every write) for a short time even after the amount of dirty data has fallen below the threshold. The smoothing mechanism continues to be applied for
zfs_write_smoothing
seconds after the last time when the amount of dirty data was above the threshold. Thanks to this, applications will not be able to burst writes again if the backing storage is still possibly unable to keep up with the level of incoming writes.The additional delay will be applied only when the backing storage has been overloaded within the last few seconds.
The original delaying algorithm works as before.
The formula used for the smoothing algorithm is as follow:
The characteristic of this formula is shown in the Figure below in yellow color. It’s more aggressive than the previous one because when the dirty data exceeds the threshold, we apply the previous formula.
How Has This Been Tested?
We performed several tests using the fio utility. We observed the amount of dirty data during these tests and applied delays. Both those observation was done using eBPF.
Example of eBPF to observe the change in the amount of data buffers:
To observe the applied delay:
Types of changes
Checklist:
Signed-off-by
.