Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowdown IO scheduler based on dispatched/completed ratio #1766

Merged
merged 7 commits into from
Oct 17, 2023

Conversation

xemul
Copy link
Contributor

@xemul xemul commented Aug 2, 2023

There are three places where IO dispatch loop is throttled

  • self-throttling with token bucket according to math model
  • per-shard one-tick threshold
  • 2-bucket approach when tokens are replenished only after they are released from disk

This PR removes the last one, because it leads to self-slowdown in case of reactor stalls. This back-link was introduced to catch the case when the disk suddenly slows down to stop dispatched to over-load it with requests, but effectively this back-link measures not the real disk dispatch rate, but the disk+kernel+reactor dispatch rate. Despite the "kernel" part is tiny, the reactor part can grow large triggering the self slow-down effect.

Here's some math.

Let's assume that a some point scheduler dispatched N_d requests. It means that it was able to grab N_d tokens in T_d duration, the rate of dispatch is R_d = N_d/T_d. The requests are to be completed by the reactor next tick. Let's assume it takes T_c time until reactor gets there and it completes N_c requests. The rate of completion is thus R_c = N_c/T_c. Apparently, N_c <= N_d, because kernel cannot complete more requests that it was queued.

In case reactor experiences a stall during the completion tick, T_c > T_d and since N_c <= N_d consequentially N_d/T_d > N_c/T_c. In case reactor doesn't stall, the number of requests that will complete N_c = N_d/T_d * T_c, because this is how dispatch rate is defined. This is equivalent to N_c/T_c = N_d/T_d.

Finally: R_d >= R_c i.e. the dispatch rate is equal of greater than the completion rate where the "equal" part is less likely and is only if reactor clockworks and doesn't stall.

The mentioned back-link makes sure that R_d <= R_c, coupled with the stalls (even the small ones) this drives the R_d down each tick, causing the R_c to go down as well, then again.

The removed fuse is replaced with the flow-monitor based on dispatch-to-completion rate. Normally, the number of requests dispatched for a certain duration divided by the number of requests completed for the same duration must be 1.0. Otherwise that would mean that requests accumulate in disk. However, this ratio cannot be such immediately and in the longer run it tends to be slightly greater that 1.0, because if reactor polls kernel for IO completions more often, it won't get more requests that it was dispatched. But even a small delay in polling would make Nr_completed / duration less because of the larger denominator value.

Having said that, the new backlink is based on the flow-ratio. When the "average" value of dispatched/completed rates exceeds some threshold (configurable, 1.5 by default) the "cost" of individual requests increases thus reducing the dispatch rate.

The main difference from the current implementation is that the new backlink is not "immediate". The averaging is the exponential moving average filter with 100ms updates and 0.95 smoothing factor. Current backlink is immediate in a sense that delay to deliver a completion immediately slows down the next tick dispatch thus accumulating spontaneous reactor micro-stalls.

This can be reproduced by the test introduced in #1724 . It's not (yet) in the PR, but making the tokens release loop artificially release ~1% more tokens fixes this case, which also supports the theory of reduced completion rate being the culprit. BTW, it cannot be the fix, because the ... over-release factor is not constant and it's hard to calculate it.

fixes: #1641
refs: #1311
refs: #1492 (*) in fact, this is the metrics that correlates with the flow ratio to grow above 1.0, but this metrics is sort of look at quota-violation from the IO angle
refs: #1774 this PR has attached metrics screenshots demonstrating the effect on stressed scylla

@avikivity
Copy link
Member

I don't understand. What if the disk is 5% slower than the setting in io_properties.yaml. Won't the queue keep growing and latency grow with it?

@avikivity
Copy link
Member

The mentioned back-link makes sure that R_d <= R_c, coupled with the stalls (even the small ones) this drives the R_d down each tick, causing the R_c to go down as well, then again.

I don't understand this either. After a reactor stall, all pending requests will complete, because the disk hasn't stalled. The scheduler is in pristine state.

It's true that stalls > io_latency_goal have a huge impact on throughput. But it could be compensated by increasing io_latency_goal (if throughput is more important than latency) or by fixing the stalls.

@xemul
Copy link
Contributor Author

xemul commented Aug 2, 2023

Yes, it will. My point is that currently we cannot tell "disk is 5% slower than the setting" from "reactor deliver completions a bit later and this looks like disk is 5% slower". And the attempt to slow down the dispatch rate leads to lower completion rate (literally, because less requests are dispatched) which, in turn, slows down the dispatch rate even further.

@xemul
Copy link
Contributor Author

xemul commented Aug 2, 2023

I don't understand this either. After a reactor stall, all pending requests will complete, because the disk hasn't stalled. The scheduler is in pristine state.

No it isn't. The scheduler assumes that N submitted requests will complete in T duration. They are, but disk, reactor my add "t" duration on top. The resulting completion rate is N / (T + t) < N / T

@xemul
Copy link
Contributor Author

xemul commented Aug 2, 2023

But it could be compensated by increasing io_latency_goal (if throughput is more important than latency) or by fixing the stalls.

When I tell "stall" I don't mean "stalls that are reported". I mean -- reactor poll completion notably later than disk completes the request.

Let's take a disk of 1Gb/s. Per 0.75ms (default io latency goal) it should do 1Gb * 0.75 / 1000 = 786kb/s. Now the scheduler submitted them all in one poll. The expectation is that the token bucket will replenish all 786k tokens in 0.75ms. Now the reactor completes requests in 0.80ms, just 50usec later. The token bucket will replenish ~736k tokens, because this is the completion rate Next tick the scheduler will be able to submit only 736k tokens, which is ~958Mb/s, not 1Gb/s. Next tick it will lose 7% more.

@avikivity
Copy link
Member

Yes, it will. My point is that currently we cannot tell "disk is 5% slower than the setting" from "reactor deliver completions a bit later and this looks like disk is 5% slower".

We have metrics for quota violations.

And "a bit" doesn't hurt as long as the io latency goal is large enough.

And the attempt to slow down the dispatch rate leads to lower completion rate (literally, because less requests are dispatched) which, in turn, slows down the dispatch rate even further.

I don't understand. We should stuff enough requests to fill up the latency goal. Small stalls shouldn't affect it.

@xemul
Copy link
Contributor Author

xemul commented Aug 2, 2023

I don't understand. We should stuff enough requests to fill up the latency goal.

We do, but "enough" == according to the number of tokens in a bucket.

Small stalls shouldn't affect it.

Small stalls make just a bit less tokens into a bucket during replenishment. This bits accumulate and eventually the replenishment rate is notably lower.

@avikivity
Copy link
Member

But it could be compensated by increasing io_latency_goal (if throughput is more important than latency) or by fixing the stalls.

When I tell "stall" I don't mean "stalls that are reported". I mean -- reactor poll completion notably later than disk completes the request.

That's almost always true. Small reads take ~0.1ms. The task quota is 0.5ms.

Let's take a disk of 1Gb/s. Per 0.75ms (default io latency goal) it should do 1Gb * 0.75 / 1000 = 786kb/s. Now the scheduler submitted them all in one poll. The expectation is that the token bucket will replenish all 786k tokens in 0.75ms. Now the reactor completes requests in 0.80ms, just 50usec later. The token bucket will replenish ~736k tokens, because this is the completion rate Next tick the scheduler will be able to submit only 736k tokens, which is ~958Mb/s, not 1Gb/s. Next tick it will lose 7% more.

If we regularly have a 0.3ms stall (regularly = almost every poll period), that's bad. If we have such a hiccup every 100 periods, it's unnoticable.

Which is it?

We should increase the io latency goal. 0.75ms is too close to the task quota period.

@xemul
Copy link
Contributor Author

xemul commented Aug 2, 2023

If we regularly have a 0.3ms stall (regularly = almost every poll period), that's bad. If we have such a hiccup every 100 periods, it's unnoticable.

Which is it?

It doesn't really matter how frequent 0.3ms stalls are (in fact, the bad clusters I've seen all had ~10% of task-quota violations). If we replace 50usec with 5usec in the above math, the result will be the same, but the scheduler will lose 0.7% of tokens every tick, not 7%. The result is the same, but slower.

We should increase the io latency goal. 0.75ms is too close to the task quota period.

Every time a request is completed X% later then the math thinks it will, token bucket loses ~X% of tokens in it. The %-s accumulate over time.

@avikivity
Copy link
Member

I don't understand. We should stuff enough requests to fill up the latency goal.

We do, but "enough" == according to the number of tokens in a bucket.

Small stalls shouldn't affect it.

Small stalls make just a bit less tokens into a bucket during replenishment. This bits accumulate and eventually the replenishment rate is notably lower.

I don't understand. A stall will give the disk time to catch up and complete all requests. After we recover from the stall, the buckets are fully replenished.

@avikivity
Copy link
Member

If we regularly have a 0.3ms stall (regularly = almost every poll period), that's bad. If we have such a hiccup every 100 periods, it's unnoticable.
Which is it?

It doesn't really matter how frequent 0.3ms stalls are (in fact, the bad clusters I've seen all had ~10% of task-quota violations). If we replace 50usec with 5usec in the above math, the result will be the same, but the scheduler will lose 0.7% of tokens every tick, not 7%. The result is the same, but slower.

How can we lose a token? We take it from the top bucket, wait for completion, put it in the bottom bucket, move it again. It's fully recycled.

We should increase the io latency goal. 0.75ms is too close to the task quota period.

Every time a request is completed X% later then the math thinks it will, token bucket loses ~X% of tokens in it. The %-s accumulate over time.

Well that's broken, but I don't understand why.

@xemul
Copy link
Contributor Author

xemul commented Aug 2, 2023

I don't understand. A stall will give the disk time to catch up and complete all requests.

Yes. Disk will complete all the requests. Then the replenisher will try to put more tokens into bucket, because more time had passed since lat time, but it won't be able to do it, because the N_requests / Duration is < N_requests / IO_quota

After we recover from the stall, the buckets are fully replenished.

Provided nobody grabs from it, yes. If the IO stops for a while, the bucket is fully replenished.

@xemul
Copy link
Contributor Author

xemul commented Aug 2, 2023

How can we lose a token? We take it from the top bucket, wait for completion, put it in the bottom bucket, move it again. It's fully recycled.

Token != request
Token == request / duration
We take request / task_quota, put back request / (task_quota + extra_delay)

@avikivity
Copy link
Member

Sorry, I'm really confused. We probably have different mental models of how this works. Since you wrote it, your model is probably correct, but it doesn't help me understand what's going on.

@avikivity
Copy link
Member

How can we lose a token? We take it from the top bucket, wait for completion, put it in the bottom bucket, move it again. It's fully recycled.

Token != request Token == request / duration We take request / task_quota, put back request / (task_quota + extra_delay)

Ah. That's wrong.

Token should be "normalized disk ability to do something". If you take it, you have to put it back.

Pulling something from the token bucket at a steady rate = queuing work at a disk at a steady rate.

@xemul
Copy link
Contributor Author

xemul commented Aug 2, 2023

The "ability" nowadays means IOPS and bandwidth, which are speeds, not raw "number of requests" and "number of bytes" metrics. One cannot take "1 operation per second" thing from a bucket, instead, it takes an "1 operation" from a bucket, but it appears the re once every second. If it appears once every 1.01 seconds, then one will take "0.99 operations per second", which, in turn, will make it appear back in 1.02 seconds, and so on. This effect disappears in case token bucket is left idle for some time to fully replenish

@xemul
Copy link
Contributor Author

xemul commented Aug 2, 2023

Let me try to demonstrate it with some .. weird artificial numbers just to take a look at it from the different angle.

Let's take a t.b. with 10 token / second rate. And the input flow into the scheduler at the same rate -- 10 requests / second. Tick is every second.

tick-1: 10 requests in the queue, 10 tokens in a bucket. All can be and are dispatched
tick-2: 10 requests complete, 10 tokens are put into the 'replenishing bucket'. And again, 10 requests in a queue, 10 tokens replenished, all can be and all are dispatched
tick-N: the same

This is how it should work. Now let's imagine that tick-X takes 1.1 seconds.

tick-x: 10 requests completed, 10 tokens are put into replenishing bucket. 11 requests in a queue. 11 tokens are to be replenished (because 1.1 seconds passed since last replenish), but (!) only 10 can be, 1 token is lost only 10 are replenished and only 10 (!) requests are submitted. Queue grew up 1 request

None of the next ticks will drain this queue, because each second only 10 requests are completed and only 10 tokens are replenished. We've "lost" 1 token and can only get it back if only 9 or less requests are queued, no other ways. Even if tick-y takes 0.9 seconds it won't help, because we'll have 9 requests completed, 9 queued and 8 9 tokens replenished.

upd: one misprint above at the "89 tokens"

@avikivity
Copy link
Member

The "ability" nowadays means IOPS and bandwidth, which are speeds, not raw "number of requests" and "number of bytes" metrics. One cannot take "1 operation per second" thing from a bucket, instead, it takes an "1 operation" from a bucket, but it appears the re once every second. If it appears once every 1.01 seconds, then one will take "0.99 operations per second", which, in turn, will make it appear back in 1.02 seconds, and so on. This effect disappears in case token bucket is left idle for some time to fully replenish

We drain tokens at some rate. So if a token represents work, draining at some rate represents work per unit time.

@xemul
Copy link
Contributor Author

xemul commented Aug 2, 2023

We drain tokens at some rate.

Yes, exactly. And we replenish tokens at some rate and in order for it to work in-sunk the rate of filling the 2nd bucket should be at least the same, or faster. But since it's a rate, delay in completing a request renders us larger denominator of the "work / duration" fraction and lower replenishing rate.

So if a token represents work, draining at some rate represents work per unit time.

@avikivity
Copy link
Member

Let me try to demonstrate it with some .. weird artificial numbers just to take a look at it from the different angle.

Let's take a t.b. with 10 token / second rate. And the input flow into the scheduler at the same rate -- 10 requests / second. Tick is every second.

tick-1: 10 requests in the queue, 10 tokens in a bucket. All can be and are dispatched tick-2: 10 requests complete, 10 tokens are put into the 'replenishing bucket'. And again, 10 requests in a queue, 10 tokens replenished, all can be and all are dispatched tick-N: the same

This is how it should work. Now let's imagine that tick-X takes 1.1 seconds.

tick-x: 10 requests completed, 10 tokens are put into replenishing bucket. 11 requests in a queue. 11 tokens are to be replenished

Why do we want to replenish 11 and not 10?

A token bucket is supposed to represent a bucket filled with tokens, not a bucket filled with "work per unit time".

(because 1.1 seconds passed since last replenish), but (!) only 10 can be, 1 token is lost only 10 are replenished and only 10 (!) requests are submitted. Queue grew up 1 request

None of the next ticks will drain this queue, because each second only 10 requests are completed and only 10 tokens are replenished. We've "lost" 1 token and can only get it back if only 9 or less requests are queued, no other ways. Even if tick-y takes 0.9 seconds it won't help, because we'll have 9 requests completed, 9 queued and 8 9 tokens replenished.

upd: one misprint above at the "89 tokens"

@xemul
Copy link
Contributor Author

xemul commented Aug 2, 2023

Why do we want to replenish 11 and not 10?

Because we replenish "the amount of tokens for 1.1 second" which is time * rate = 1.1 * 10 = 11

@avikivity
Copy link
Member

Why do we want to replenish 11 and not 10?

Because we replenish "the amount of tokens for 1.1 second" which is time * rate = 1.1 * 10 = 11

We should replenish the amount of tokens we got.

@xemul
Copy link
Contributor Author

xemul commented Aug 2, 2023

We should replenish the amount of tokens we got.

This is going to be advanced --max-io-requests thing, not the rate-limiter based on ... cyan-purple diskplorer plots that ties rates of {requests, bytes} x {read, write} with each other.

@xemul
Copy link
Contributor Author

xemul commented Aug 2, 2023

We should replenish the amount of tokens we got.

By the way, we got 10 tokens in the above example. 11 is the amount of tokens we want as per time-replenishment

@xemul
Copy link
Contributor Author

xemul commented Aug 3, 2023

One more example of how the back-link makes things worse.

Let's assume that we've 5000 requests to dispatch and each tick a scheduler can push 500 as per the rate-limiter. And (!) each tick delays completion of 5% of the dispatched requests. In case dispatch loop just does what it needs -- dispatches 500 requests at a time, the ticks would look like this:

tick       |   1 |   2 |   3 |   4 |   5 |   6 |   7 |   8 |   9 |  10 |  11
dispatched | 500 | 500 | 500 | 500 | 500 | 500 | 500 | 500 | 500 | 500 |  
complated  | 475 | 475 | 475 | 475 | 475 | 475 | 475 | 475 | 475 | 475 | 250

It takes 11 ticks to process and one extra tick to pick up delayed completions. Not a big deal. And the dispatch average rate of 5000/10=500 req/sec is almost on par with the completion rate of 5000/11=454 req/sec.

Now if adding a "back-link", i.e. -- each tick the scheduler can dispatch min(500, completed_on_previous_tick) requests, the result is notably worse

tick       |   1 |   2 |   3 |   4 |   5 |   6 |   7 |   8 |   9 |  10 |  11 |  12 |  13 |  14 | 15
dispatched | 500 | 475 | 451 | 429 | 407 | 387 | 368 | 349 | 332 | 315 | 299 | 284 | 270 | 134 |  
completed  | 475 | 451 | 429 | 407 | 387 | 368 | 349 | 332 | 315 | 299 | 284 | 270 | 257 | 127 | 250

It takes 15 ticks, and still one unnoticeable extra tick to pick up those 250 delayed requests. The dispatch rate is 357 req/sec and completion rate is 333 req/sec. 30% of the throughput ability is lost

This model has a flaw in a sense that in the latter case delayed requests will get completed two/three/... ticks later, yes, but it's a simplification. For more ... real-ish numbers I've created a test in #1724 , it shows very similar effect, it takes more "ticks" to reveal itself, but since each tick is 0.5ms, in real time it will show up in seconds.

@avikivity
Copy link
Member

One more example of how the back-link makes things worse.

Let's assume that we've 5000 requests to dispatch and each tick a scheduler can push 500 as per the rate-limiter. And (!) each tick delays completion of 5% of the dispatched requests. In case dispatch loop just does what it needs -- dispatches 500 requests at a time, the ticks would look like this:

tick       |   1 |   2 |   3 |   4 |   5 |   6 |   7 |   8 |   9 |  10 |  11
dispatched | 500 | 500 | 500 | 500 | 500 | 500 | 500 | 500 | 500 | 500 |  
complated  | 475 | 475 | 475 | 475 | 475 | 475 | 475 | 475 | 475 | 475 | 250

It takes 11 ticks to process and one extra tick to pick up delayed completions. Not a big deal. And the dispatch average rate of 5000/10=500 req/sec is almost on par with the completion rate of 5000/11=454 req/sec.

This is fine and after some time we'll dispatch and complete 475 each tick. But this corresponds to the current implementation as I understand it.

Now if adding a "back-link", i.e. -- each tick the scheduler can dispatch min(500, completed_on_previous_tick) requests, the result is notably worse

Well, why would we do that?

tick       |   1 |   2 |   3 |   4 |   5 |   6 |   7 |   8 |   9 |  10 |  11 |  12 |  13 |  14 | 15
dispatched | 500 | 475 | 451 | 429 | 407 | 387 | 368 | 349 | 332 | 315 | 299 | 284 | 270 | 134 |  
completed  | 475 | 451 | 429 | 407 | 387 | 368 | 349 | 332 | 315 | 299 | 284 | 270 | 257 | 127 | 250

It takes 15 ticks, and still one unnoticeable extra tick to pick up those 250 delayed requests. The dispatch rate is 357 req/sec and completion rate is 333 req/sec. 30% of the throughput ability is lost

This model has a flaw in a sense that in the latter case delayed requests will get completed two/three/... ticks later, yes, but it's a simplification. For more ... real-ish numbers I've created a test in #1724 , it shows very similar effect, it takes more "ticks" to reveal itself, but since each tick is 0.5ms, in real time it will show up in seconds.

@xemul
Copy link
Contributor Author

xemul commented Aug 3, 2023

Well, why would we do that?

Because that's how the 2-buckets work. A bucket doesn't have more tokens than there were returned upon completion. I.e. if we dispatched 20, then 19 were completed and we replenish t.b., even if time-based replenisher tells we have 20, we'd still replenish 19

@xemul
Copy link
Contributor Author

xemul commented Oct 6, 2023

How do you prove that it stabilizes? And what's the motivation for increasing cost by flow_ratio?

The rate-limited main equation is

$$ (\frac{iops}{iops_{max}} + \frac{bandwidth}{bandwidth_{max}}) <= 1$$

or

$$ \frac{d}{dt} ( \frac{ops}{iops_{max}} + \frac{bytes}{bandwidth_{max}} ) <= 1$$

We implement it via token bucket with rate of 1.0 and the cost of individual request (cost == amount of tokens it needs to carry) being

$$ cost = \frac{1}{iops_{max}} + \frac{size}{bandiwdth_{max}}$$

Now let's assume that at some point the rate should be (I'll get back to how we assume it below) lowered, because disk doesn't keep up. It means that in teh frst formula we want to decrease iops_max and bandwidth_max denominators.

$$ \frac{d}{dt} ( \frac{ops}{iops_{max} * \alpha} + \frac{bytes}{bandwidth_{max} * \beta} ) <= 1 , \alpha < 1.0 , \beta < 1.0$$

In this PR I say that

$$ \alpha = \beta = \frac {1}{\gamma} , \gamma > 1.0$$

because we don't know which IOPS or bandwidth was overwhelmed for the disk (spoiler: actually most likely it's the latter, but still), which means new token bucket equation is

$$ \frac{d}{dt} ( \gamma * ( \frac{ops}{iops_{max}} + \frac{bytes}{bandwidth_{max}} ) ) <= 1, \gamma > 1.0 $$

The last equation means, that in order to make scheduler work with lower IOPS and bandwidth, we may multiply individual request cost by some number larger than 1.0.

Now how do we "assume that IOPS and bandwidth should be lowered"? There are also many ways to assume that, but in this PR I say that: the IOPS and bandwidth should be lowered in case scheduler submits more requests per second than the kernel completes requests per second in the long run. This is where the flow ratio came from.

Summary:

  • in our case decreasing t.b. rate can be achieved (though not equivalent) to multiplying the request cost by some value larger than 1.0
  • the sign of "disk is overloaded" is when it completes requests at lower rate than we dispatch

(upd: s/bandwidthbytes/bandwidth/ in some formulas)

@avikivity
Copy link
Member

+2 points for inline equations

The last equation means, that in order to make scheduler work with lower IOPS and bandwidth, we may multiply individual request cost by some number larger than 1.0.

That's pretty simple, now that you've explained it so. Another way to look at it is that the we make consumption rate smaller than alpha (alpha<1) instead of smaller than 1.

Now how do we "assume that IOPS and bandwidth should be lowered"? There are also many ways to assume that, but in this PR I say that: the IOPS and bandwidth should be lowered in case scheduler submits more requests per second than the kernel completes requests per second in the long run. This is where the flow ratio came from.

But measured in consumption units, yes? especially if the bandwidth is the root cause.

But in the really long run, submits have to equal completes, no? So the flow ratio measures if we're int the process of moving from one equilibrium (there submits equal completes) to another (where they are also equal, but latency is much higher). I guess in the really long run, we achieve equilibrium by running out of concurrency, whereas we want to achieve equilibrium by limiting latency.

Summary:

  • in our case decreasing t.b. rate can be achieved (though not equivalent) to multiplying the request cost by some value larger than 1.0
  • the sign of "disk is overloaded" is when it completes requests at lower rate than we dispatch

(upd: s/bandwidthbytes/bandwidth/ in some formulas)

Is the algorithm stable? suppose we increased the cost and achieved a flow ratio of 1. We must have, we can't have a flow ratio > 1 for long periods. Then we jump back to gamma=1. Will we fluctuate between the two modes?

One way to test it is to run a test with iotune numbers adjusted upwards by 20%, so we know the disk is slower than what the scheduler thinks it is.

Maybe we should monitor latency directly, though that's a very noisy meaurement. If moving_average(latency) > latency_goal, increase cost.

We also need to parametrize the algorithm. 100ms works for SSD (maybe too large), won't work for HDD, and then I have long conversations with @horschi.

@xemul
Copy link
Contributor Author

xemul commented Oct 6, 2023

Another way to look at it is that the we make consumption rate smaller than alpha (alpha<1) instead of smaller than 1.

Possible, but 1.0 on the right side of the equation is token bucket filling rate. It's "shared" between shards and tuning it in share-nothing approach is more challenging that letting each shard increase its requests' cost

But measured in consumption units, yes? especially if the bandwidth is the root cause.

No, measured in the number of requests

But in the really long run, submits have to equal completes, no? So the flow ratio measures if we're int the process of moving from one equilibrium (there submits equal completes) to another (where they are also equal, but latency is much higher). I guess in the really long run, we achieve equilibrium by running out of concurrency, whereas we want to achieve equilibrium by limiting latency.

In the really long run -- probably. But If we push more requests into disk kernel than the disk can handle what will happen? AFAIK (but I can be wrong) the amount of data disk and_kernel can queue-up internal is pretty large if not "infinite". With 10Gb/s disk and 100ms averaging in case disk just stops kernel will need to queue just 1Gb, if disk is "suddenly" back to normal is will drain withing the next 100ms

Is the algorithm stable? suppose we increased the cost and achieved a flow ratio of 1. We must have, we can't have a flow ratio > 1 for long periods. Then we jump back to gamma=1. Will we fluctuate between the two modes?

Once every 100ms, it is possible in theory, yes. But why is that a problem? Is your point that "we better settle at gamma of, say, 1.01, rather than switching between 1.0 and, say, 1.1"?

Maybe we should monitor latency directly, though that's a very noisy meaurement. If moving_average(latency) > latency_goal, increase cost.

I had this idea, but requests of different size and direction would have different latencies, say 4k write is 50us, 128k read can be 200us. If the former's latency goes up 3x times it's still below latency goal, but disk is likely in trouble already.

We also need to parametrize the algorithm. 100ms works for SSD (maybe too large), won't work for HDD

It is all configurable already, but I just noticed that I forgot to expose the averaging period as an option, only as io-queue config parameter (the cost-increase threshold is an option in this PR)

@avikivity
Copy link
Member

Another way to look at it is that the we make consumption rate smaller than alpha (alpha<1) instead of smaller than 1.

Possible, but 1.0 on the right side of the equation is token bucket filling rate. It's "shared" between shards and tuning it in share-nothing approach is more challenging that letting each shard increase its requests' cost

Sure, I was more concerned in explaining it to myself than suggesting a code change.

But measured in consumption units, yes? especially if the bandwidth is the root cause.

No, measured in the number of requests

But in the really long run, submits have to equal completes, no? So the flow ratio measures if we're int the process of moving from one equilibrium (there submits equal completes) to another (where they are also equal, but latency is much higher). I guess in the really long run, we achieve equilibrium by running out of concurrency, whereas we want to achieve equilibrium by limiting latency.

In the really long run -- probably. But If we push more requests into disk kernel than the disk can handle what will happen? AFAIK (but I can be wrong) the amount of data disk and_kernel can queue-up internal is pretty large if not "infinite". With 10Gb/s disk and 100ms averaging in case disk just stops kernel will need to queue just 1Gb, if disk is "suddenly" back to normal is will drain withing the next 100ms

The kernel queue is really the size of the aio context we allocate (summed across all shards). So we can control submit rate until we fill up all aio slots. Once we reach the limit, submit rate becomes equal to the completion rate.

Is the algorithm stable? suppose we increased the cost and achieved a flow ratio of 1. We must have, we can't have a flow ratio > 1 for long periods. Then we jump back to gamma=1. Will we fluctuate between the two modes?

Once every 100ms, it is possible in theory, yes. But why is that a problem? Is your point that "we better settle at gamma of, say, 1.01, rather than switching between 1.0 and, say, 1.1"?

Maybe we should monitor latency directly, though that's a very noisy meaurement. If moving_average(latency) > latency_goal, increase cost.

I had this idea, but requests of different size and direction would have different latencies, say 4k write is 50us, 128k read can be 200us. If the former's latency goes up 3x times it's still below latency goal, but disk is likely in trouble already.

Why do we care? We have a latency goal.

Note if we have small requests, we also send very many of them, so their latency will also grow.

We also need to parametrize the algorithm. 100ms works for SSD (maybe too large), won't work for HDD

It is all configurable already, but I just noticed that I forgot to expose the averaging period as an option, only as io-queue config parameter (the cost-increase threshold is an option in this PR)

I'd still like to see a test with artificially raised io_properties.yaml. Will it survive?

@xemul
Copy link
Contributor Author

xemul commented Oct 10, 2023

The kernel queue is really the size of the aio context we allocate (summed across all shards). So we can control submit rate until we fill up all aio slots. Once we reach the limit, submit rate becomes equal to the completion rate.

Yes, that's true

Why do we care? We have a latency goal.
Note if we have small requests, we also send very many of them, so their latency will also grow.

Because one-shot measure is not reliable. E.g. with latencies we don't measure disk latency, we measure disk + seastar latency, so even if one-time measurement goes up it doesn't mean disk gets slow. This is exactly what the current approach has troubles with -- it notices that the short-term completion rate gets lower (because reactor is a bit late in completing request). For per-request latencies we need some long-term aggregate, but what can it be? The "expected" latency is not a single number we can get e.g. some average on.

@avikivity
Copy link
Member

The kernel queue is really the size of the aio context we allocate (summed across all shards). So we can control submit rate until we fill up all aio slots. Once we reach the limit, submit rate becomes equal to the completion rate.

Yes, that's true

Why do we care? We have a latency goal.
Note if we have small requests, we also send very many of them, so their latency will also grow.

Because one-shot measure is not reliable. E.g. with latencies we don't measure disk latency, we measure disk + seastar latency, so even if one-time measurement goes up it doesn't mean disk gets slow. This is exactly what the current approach has troubles with -- it notices that the short-term completion rate gets lower (because reactor is a bit late in completing request). For per-request latencies we need some long-term aggregate, but what can it be? The "expected" latency is not a single number we can get e.g. some average on.

Agree we can't use point-in-time measurements. But I want to see some kind of logic that explains the idea, not heuristics.

@avikivity
Copy link
Member

Perhaps we should "learn" the actual disk capacity using a long-term moving average. But wait - that's effectively what you do. But I don't see why the flow ratio is the right trigger.

@xemul
Copy link
Contributor Author

xemul commented Oct 10, 2023

(Another explanation from today's call) (Not yet the logic that explains the idea)

If the disk is working slower than we think it should from the io-properties.yaml file, then there can be two reasons for that.

  1. It's some temporary glitch. In that case the flow-ratio-based back-link expects that lowering the dispatch rate (which is input for the disk) would help the disk to "take breath". After a while dispatch rate to completion rate would restore and scheduler will get back to its original rates

  2. It's permanent, i.e. -- we overestimated the disk, and the numbers in the io-properties.yaml file are indeed too large. In that case flow-ratio-based backlink won't work indeed. Disk will not get back to normal, it will continue to complete requests at the speed it can. The dispatch rate will be lowered and it will become close to that "real" completion rate, the flow-ratio would become 1.0 again and scheduler will again start oversubscribing the disk

In other words -- the proposed flow-ratio back-link is the way to overcome temporary slowdown, but not the way to re-evaluate the disk capacity run-time.

@xemul
Copy link
Contributor Author

xemul commented Oct 10, 2023

Perhaps we should "learn" the actual disk capacity using a long-term moving average.

This undermines the whole point of io-properties.yaml. If we can learn the actual disk capacity runtime, then we can start with 100Mb/s vs 100IOPS default hard-coded throughput and "learn" the real numbers eventually.

I'm not against that, but this sounds like completely different direction.

@xemul
Copy link
Contributor Author

xemul commented Oct 12, 2023

The math behind flow-ratio.

di -- the amount of requests dispatched at tick i
pi -- the amount of requests processed by disk at tick i
ci -- the amount of requests completed by reactor loop at tick i

We can observe di and ci in the dispatcher, but not the pi, because we don't have direct access to disks' queues

After n ticks we have

Dn -- total amount of requests dispatched,
Pn -- total amount of requests processed,
Cn -- total amount of requests completed,

$$ D_n = \sum_{i=0}^n d_i $$

$$ P_n = \sum_{i=0}^n p_i $$

$$ C_n = \sum_{i=0}^n c_i $$

  • Disk cannot process more than it was dispatched, but it can process less "accumulating" a queue

$$ d_i \ge p_i \Rightarrow D_n \ge P_n$$

  • Reactor cannot complete more than it was processed by disk either

$$ p_i \ge c_i \Rightarrow P_n \ge C_n $$

Note that Dn > Pn means that disk is falling behind. Our goal is to make sure the disk doesn't do it and doesn't accumulate the queue, i.e. Dn = Pn, but we cannot observe Pn directly, only Cn.

Next

$$ D_n - P_n = Qd_n \ge 0 $$

$$ P_n - C_n = Qc_n \ge 0 $$

$$ D_n - C_N = Qd_n + Qc_n \ge 0 $$

Qdn is the queue accumulated in disk. We try to avoid it. Qcn is the accumulated delta between processed (by disk) and completed (by reactor) requests. We cannot reliably say that it goes to zero over time, because there's always some unknown amount of processed but not yet completed requests. So the above formula contains two unknowns and we cannot solve it. Let's try the other way

$$ \frac {D_n} {P_n} = Rd_n \ge 1 $$

$$ \frac {P_n} {C_n} = Rc_n \ge 1 $$

$$ \frac {D_n} {C_n} = Rd_n \times Rc_n \ge 1 $$

Rdn is the ratio of dispatched to processed requests. If it's 1, we're OK, if it's greater, disk is accumulating the queue, we try to avoid it. Rcn is the ratio between processed (by disk) and completed (by reactor) requests. It has a not immediately apparent, but very pretty property

$$ \lim_{n\to\infty} Rc_n = \lim_{n\to\infty} \frac {P_n} {C_n} = \lim_{n\to\infty} \frac {C_n + Qc_n} { C_n} = \lim_{n\to\infty} (1 + \frac {Qc_n} {C_n} ) = 1 + \lim_{n\to\infty} \frac {Qc_n} {C_n}$$

The Qcn doesn't grow over time. It's non-zero, but it's upper bound by some value. Respectively

$$ \lim_{n\to\infty} Rc_n = 1 $$

$$ \lim_{n\to\infty} \frac {D_n} {C_n} = \lim_{n\to\infty} ( Rd_n \times Rc_n ) = \lim_{n\to\infty} Rd_n $$

IOW -- we can say if the disk is accumulating the queue or not by observing the dispatched-to-completed (to completed, not processed) over a long enough time

xemul added 7 commits October 12, 2023 12:37
It will be used later, but it is implicitly used already to print the
auto-adjust message.

BTW, the message can be printed twise, one for each stream, but streams
are configured using the same config, so the latency-goal will be the
same as well.

Signed-off-by: Pavel Emelyanov <[email protected]>
The monitor evaluates EMA value for the dispatch speed to complete speed
ratio. The ratio is expected to be 1.0, but according to the the patch
series description, sometimes it can grow larger.

The ratio of dispatch and completion rates is fed into EMA filter every
100ms with the smoothing factor of 0.95, but both values are placed in
io_queue::config and can later be added to io_properties of options.

The calculated value is exported as a io_queue_flow_control metrics
labeled with mountpoint, shard and IO-group.

Signed-off-by: Pavel Emelyanov <[email protected]>
... and export flow_ratio.

Currently there are only per-class metrics, but the flow-rate is
evaluated per-queue.

Signed-off-by: Pavel Emelyanov <[email protected]>
According to the patch-series description, the completion rate is always
less or equal than the dispatch rate. Making the dispatch rate
not-exceed the completion rate makes both creep down in an infinite
loop, so remove this backlink and make IO scheduler work at the
configured rate all the time.

Signed-off-by: Pavel Emelyanov <[email protected]>
Make the slow-down link less immediate. When the disk is not overloaded
it, in theory, should complete as many requests per-tick, as it was
dispatched to it previously. IOW, the ratio of dispatched requests to
completed requests should be 1.0 or, in the worst case, close to it.
In the "long run" it should be 1.0. If it's not, it means that requests
tend to accumulate in disk and it's a reason for slow down.

This ratio is tracked with the flow-rate monitor. It estimates the
ratio with the exponential moving average filter, adding new drops
once every 100ms with the smothing factor of 0.95.

Next, current formula for the rate limiter is

    IOPS / IOPS_max + Bandwidth / Bandwidth_max <= 1.0

and it's implemented as a token bucket with replenish rate of 1 token
per second and each request requiring

    capacity = weight / A + size / B

tokens (fraction) to proceed.  To slow things down there are two options:

 - reduce the 1.0 constant
 - increase capacities of requests

This set walks the second route, as touching the t.b. rate might be too
tricky, because t.b. is shared across shards and shards would step on
each others toes doing that. So instead of doing it, the patch increases
the requests' cost multiplying it by the aforementioned ratio. Since in
real load the ratio is always strictly greater than 1.0, a threshold is
added below which the scheduler continues working at the configured
rate.

Signed-off-by: Pavel Emelyanov <[email protected]>
It includes tha basic rate-limiter equation, the brief description of
how it maps to token bucket algo, how the slow-down is performed by just
adjusting the request cost and the flow-ratio justification

Signed-off-by: Pavel Emelyanov <[email protected]>
@xemul xemul force-pushed the br-io-sched-no-capping branch from bda1287 to 9a154ab Compare October 12, 2023 10:51
@xemul
Copy link
Contributor Author

xemul commented Oct 12, 2023

upd:

  • make flow-ratio duration depend on io_latency_goal
  • reduce slow-down threshold to 1.1
  • added doc file describing math behind the model

@avikivity avikivity merged commit eefa837 into scylladb:master Oct 17, 2023
xemul added a commit to scylladb/scylla-seastar that referenced this pull request Feb 8, 2024
…o branch-5.2

This backports scylladb/seastar#1766

* br-double-bucket-slowdown-backport-5.2:
  io_queue: Add flow-rate based self slowdown backlink
  io_queue: Make main throttler uncapped
  io_queue: Add queue-wide metrics
  io_queue: Introduce "flow monitor"
  io_queue: Count total number of dispatched and completed requests so far
  io_queue: Introduce io_group::io_latency_goal()
  fair_queue: Do not re-evaluate request capacity twice
xemul added a commit to scylladb/scylla-seastar that referenced this pull request Feb 8, 2024
… branch-5.4

This backports scylladb/seastar#1766

* br-double-bucket-slowdown-backport-5.4:
  io_queue: Add flow-rate based self slowdown backlink
  io_queue: Make main throttler uncapped
  io_queue: Add queue-wide metrics
  io_queue: Introduce "flow monitor"
  io_queue: Count total number of dispatched and completed requests so far
  io_queue: Introduce io_group::io_latency_goal()
graphcareful pushed a commit to graphcareful/seastar that referenced this pull request Mar 20, 2024
…m Pavel Emelyanov

There are three places where IO dispatch loop is throttled

  * self-throttling with token bucket according to math model
  * per-shard one-tick threshold
  * 2-bucket approach when tokens are replenished only after they are released from disk

This PR removes the last one, because it leads to self-slowdown in case of reactor stalls. This back-link was introduced to catch the case when the disk suddenly slows down to stop dispatched to over-load it with requests, but effectively this back-link measures not the real disk dispatch rate, but the disk+kernel+reactor dispatch rate. Despite the "kernel" part is tiny, the reactor part can grow large triggering the self slow-down effect.

Here's some math.

Let's assume that a some point scheduler dispatched N_d requests. It means that it was able to grab N_d tokens in T_d duration, the rate of dispatch is R_d = N_d/T_d. The requests are to be completed by the reactor next tick. Let's assume it takes T_c time until reactor gets there and it completes N_c requests. The rate of completion is thus R_c = N_c/T_c. Apparently, N_c <= N_d, because kernel cannot complete more requests that it was queued.

In case reactor experiences a stall during the completion tick, T_c > T_d and since N_c <= N_d consequentially N_d/T_d > N_c/T_c. In case reactor doesn't stall, the number of requests that will complete N_c = N_d/T_d * T_c, because this is how dispatch rate is defined. This is equivalent to N_c/T_c = N_d/T_d.

Finally: R_d >= R_c i.e. the dispatch rate is equal of greater than the completion rate where the "equal" part is less likely and is only if reactor clockworks and doesn't stall.

The mentioned back-link makes sure that R_d <= R_c, coupled with the stalls (even the small ones) this drives the R_d down each tick, causing the R_c to go down as well, then again.

The removed fuse is replaced with the flow-monitor based on dispatch-to-completion rate. Normally, the number of requests dispatched for a certain duration divided by the number of requests completed for the same duration must be 1.0. Otherwise that would mean that requests accumulate in disk. However, this ratio cannot be such immediately and in the longer run it tends to be slightly greater that 1.0, because if reactor polls kernel for IO completions more often, it won't get more requests that it was dispatched. But even a small delay in polling would make Nr_completed / duration less because of the larger denominator value.

Having said that, the new backlink is based on the flow-ratio. When the "average" value of dispatched/completed rates exceeds some threshold (configurable, 1.5 by default) the "cost" of individual requests increases thus reducing the dispatch rate.

The main difference from the current implementation is that the new backlink is not "immediate". The averaging is the exponential moving average filter with 100ms updates and 0.95 smoothing factor. Current backlink is immediate in a sense that delay to deliver a completion immediately slows down the next tick dispatch thus accumulating spontaneous reactor micro-stalls.

This can be reproduced by the test introduced in scylladb#1724 . It's not (yet) in the PR, but making the tokens release loop artificially release ~1% more tokens fixes this case, which also supports the theory of reduced completion rate being the culprit. BTW, it cannot be the fix, because the ... over-release factor is not constant and it's hard to calculate it.

fixes: scylladb#1641
refs: scylladb#1311
refs: scylladb#1492 (*) in fact, _this_ is the metrics that correlates with the flow ratio to grow above 1.0, but this metrics is sort of look at quota-violation from the IO angle
refs: scylladb#1774 this PR has attached metrics screenshots demonstrating the effect on stressed scylla

Closes scylladb#1766

* github.com:scylladb/seastar:
  doc: Add document describing all the math behind IO scheduler
  io_queue: Add flow-rate based self slowdown backlink
  io_queue: Make main throttler uncapped
  io_queue: Add queue-wide metrics
  io_queue: Introduce "flow monitor"
  io_queue: Count total number of dispatched and completed requests so far
  io_queue: Introduce io_group::io_latency_goal()
avikivity added a commit that referenced this pull request Aug 8, 2024
Scheduler goal is to make sure that dispatched requests complete not later than after io-latency-goal duration (which is defaulted to 1.5 times reactor CPU latency goal). In cases when disk or kernel slow down, there's flow-ratio guard that slows down the dispatch rate as well [1]. However, sometimes it may not be enough and requests delay their execution even more. Slowly executing requests are indirectly reported via toal-disk-delay metrics [2], but this metrics accumulates several requests into one counter potentially smoothing spikes by good requests. This detector is aimed at detecting individual slow requests and logging them along with some related statistics that should help to understand what's going on.

refs: #1766 [1] (eefa837) [2] (0238d25)
fixes: #1311
closes: #1609

Closes #2371

* github.com:scylladb/seastar:
  reactor: Add --io-completion-notify-ms option
  io_queue: Stall detector
  io_queue: Keep local variable with request execution delay
  io_queue: Rename flow ratio timer to be more generic
  reactor: Export _polls counter (internally)
xemul referenced this pull request in michoecho/seastar Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IO sched native throttler is probably too throttly
2 participants