Slowdown IO scheduler based on dispatched/completed ratio #1766

xemul · 2023-08-02T15:17:25Z

There are three places where IO dispatch loop is throttled

self-throttling with token bucket according to math model
per-shard one-tick threshold
2-bucket approach when tokens are replenished only after they are released from disk

This PR removes the last one, because it leads to self-slowdown in case of reactor stalls. This back-link was introduced to catch the case when the disk suddenly slows down to stop dispatched to over-load it with requests, but effectively this back-link measures not the real disk dispatch rate, but the disk+kernel+reactor dispatch rate. Despite the "kernel" part is tiny, the reactor part can grow large triggering the self slow-down effect.

Here's some math.

Let's assume that a some point scheduler dispatched N_d requests. It means that it was able to grab N_d tokens in T_d duration, the rate of dispatch is R_d = N_d/T_d. The requests are to be completed by the reactor next tick. Let's assume it takes T_c time until reactor gets there and it completes N_c requests. The rate of completion is thus R_c = N_c/T_c. Apparently, N_c <= N_d, because kernel cannot complete more requests that it was queued.

In case reactor experiences a stall during the completion tick, T_c > T_d and since N_c <= N_d consequentially N_d/T_d > N_c/T_c. In case reactor doesn't stall, the number of requests that will complete N_c = N_d/T_d * T_c, because this is how dispatch rate is defined. This is equivalent to N_c/T_c = N_d/T_d.

Finally: R_d >= R_c i.e. the dispatch rate is equal of greater than the completion rate where the "equal" part is less likely and is only if reactor clockworks and doesn't stall.

The mentioned back-link makes sure that R_d <= R_c, coupled with the stalls (even the small ones) this drives the R_d down each tick, causing the R_c to go down as well, then again.

The removed fuse is replaced with the flow-monitor based on dispatch-to-completion rate. Normally, the number of requests dispatched for a certain duration divided by the number of requests completed for the same duration must be 1.0. Otherwise that would mean that requests accumulate in disk. However, this ratio cannot be such immediately and in the longer run it tends to be slightly greater that 1.0, because if reactor polls kernel for IO completions more often, it won't get more requests that it was dispatched. But even a small delay in polling would make Nr_completed / duration less because of the larger denominator value.

Having said that, the new backlink is based on the flow-ratio. When the "average" value of dispatched/completed rates exceeds some threshold (configurable, 1.5 by default) the "cost" of individual requests increases thus reducing the dispatch rate.

The main difference from the current implementation is that the new backlink is not "immediate". The averaging is the exponential moving average filter with 100ms updates and 0.95 smoothing factor. Current backlink is immediate in a sense that delay to deliver a completion immediately slows down the next tick dispatch thus accumulating spontaneous reactor micro-stalls.

This can be reproduced by the test introduced in #1724 . It's not (yet) in the PR, but making the tokens release loop artificially release ~1% more tokens fixes this case, which also supports the theory of reduced completion rate being the culprit. BTW, it cannot be the fix, because the ... over-release factor is not constant and it's hard to calculate it.

fixes: #1641
refs: #1311
refs: #1492 (*) in fact, this is the metrics that correlates with the flow ratio to grow above 1.0, but this metrics is sort of look at quota-violation from the IO angle
refs: #1774 this PR has attached metrics screenshots demonstrating the effect on stressed scylla

avikivity · 2023-08-02T15:29:29Z

I don't understand. What if the disk is 5% slower than the setting in io_properties.yaml. Won't the queue keep growing and latency grow with it?

avikivity · 2023-08-02T15:34:09Z

The mentioned back-link makes sure that R_d <= R_c, coupled with the stalls (even the small ones) this drives the R_d down each tick, causing the R_c to go down as well, then again.

I don't understand this either. After a reactor stall, all pending requests will complete, because the disk hasn't stalled. The scheduler is in pristine state.

It's true that stalls > io_latency_goal have a huge impact on throughput. But it could be compensated by increasing io_latency_goal (if throughput is more important than latency) or by fixing the stalls.

xemul · 2023-08-02T15:34:25Z

Yes, it will. My point is that currently we cannot tell "disk is 5% slower than the setting" from "reactor deliver completions a bit later and this looks like disk is 5% slower". And the attempt to slow down the dispatch rate leads to lower completion rate (literally, because less requests are dispatched) which, in turn, slows down the dispatch rate even further.

xemul · 2023-08-02T15:35:18Z

I don't understand this either. After a reactor stall, all pending requests will complete, because the disk hasn't stalled. The scheduler is in pristine state.

No it isn't. The scheduler assumes that N submitted requests will complete in T duration. They are, but disk, reactor my add "t" duration on top. The resulting completion rate is N / (T + t) < N / T

xemul · 2023-08-02T15:39:50Z

But it could be compensated by increasing io_latency_goal (if throughput is more important than latency) or by fixing the stalls.

When I tell "stall" I don't mean "stalls that are reported". I mean -- reactor poll completion notably later than disk completes the request.

Let's take a disk of 1Gb/s. Per 0.75ms (default io latency goal) it should do 1Gb * 0.75 / 1000 = 786kb/s. Now the scheduler submitted them all in one poll. The expectation is that the token bucket will replenish all 786k tokens in 0.75ms. Now the reactor completes requests in 0.80ms, just 50usec later. The token bucket will replenish ~736k tokens, because this is the completion rate Next tick the scheduler will be able to submit only 736k tokens, which is ~958Mb/s, not 1Gb/s. Next tick it will lose 7% more.

avikivity · 2023-08-02T15:40:03Z

Yes, it will. My point is that currently we cannot tell "disk is 5% slower than the setting" from "reactor deliver completions a bit later and this looks like disk is 5% slower".

We have metrics for quota violations.

And "a bit" doesn't hurt as long as the io latency goal is large enough.

And the attempt to slow down the dispatch rate leads to lower completion rate (literally, because less requests are dispatched) which, in turn, slows down the dispatch rate even further.

I don't understand. We should stuff enough requests to fill up the latency goal. Small stalls shouldn't affect it.

xemul · 2023-08-02T15:42:06Z

I don't understand. We should stuff enough requests to fill up the latency goal.

We do, but "enough" == according to the number of tokens in a bucket.

Small stalls shouldn't affect it.

Small stalls make just a bit less tokens into a bucket during replenishment. This bits accumulate and eventually the replenishment rate is notably lower.

avikivity · 2023-08-02T15:43:52Z

But it could be compensated by increasing io_latency_goal (if throughput is more important than latency) or by fixing the stalls.

When I tell "stall" I don't mean "stalls that are reported". I mean -- reactor poll completion notably later than disk completes the request.

That's almost always true. Small reads take ~0.1ms. The task quota is 0.5ms.

Let's take a disk of 1Gb/s. Per 0.75ms (default io latency goal) it should do 1Gb * 0.75 / 1000 = 786kb/s. Now the scheduler submitted them all in one poll. The expectation is that the token bucket will replenish all 786k tokens in 0.75ms. Now the reactor completes requests in 0.80ms, just 50usec later. The token bucket will replenish ~736k tokens, because this is the completion rate Next tick the scheduler will be able to submit only 736k tokens, which is ~958Mb/s, not 1Gb/s. Next tick it will lose 7% more.

If we regularly have a 0.3ms stall (regularly = almost every poll period), that's bad. If we have such a hiccup every 100 periods, it's unnoticable.

Which is it?

We should increase the io latency goal. 0.75ms is too close to the task quota period.

xemul · 2023-08-02T15:47:51Z

If we regularly have a 0.3ms stall (regularly = almost every poll period), that's bad. If we have such a hiccup every 100 periods, it's unnoticable.

Which is it?

It doesn't really matter how frequent 0.3ms stalls are (in fact, the bad clusters I've seen all had ~10% of task-quota violations). If we replace 50usec with 5usec in the above math, the result will be the same, but the scheduler will lose 0.7% of tokens every tick, not 7%. The result is the same, but slower.

We should increase the io latency goal. 0.75ms is too close to the task quota period.

Every time a request is completed X% later then the math thinks it will, token bucket loses ~X% of tokens in it. The %-s accumulate over time.

avikivity · 2023-08-02T15:51:24Z

I don't understand. We should stuff enough requests to fill up the latency goal.

We do, but "enough" == according to the number of tokens in a bucket.

Small stalls shouldn't affect it.

Small stalls make just a bit less tokens into a bucket during replenishment. This bits accumulate and eventually the replenishment rate is notably lower.

I don't understand. A stall will give the disk time to catch up and complete all requests. After we recover from the stall, the buckets are fully replenished.

avikivity · 2023-08-02T15:53:10Z

If we regularly have a 0.3ms stall (regularly = almost every poll period), that's bad. If we have such a hiccup every 100 periods, it's unnoticable.
Which is it?

It doesn't really matter how frequent 0.3ms stalls are (in fact, the bad clusters I've seen all had ~10% of task-quota violations). If we replace 50usec with 5usec in the above math, the result will be the same, but the scheduler will lose 0.7% of tokens every tick, not 7%. The result is the same, but slower.

How can we lose a token? We take it from the top bucket, wait for completion, put it in the bottom bucket, move it again. It's fully recycled.

We should increase the io latency goal. 0.75ms is too close to the task quota period.

Every time a request is completed X% later then the math thinks it will, token bucket loses ~X% of tokens in it. The %-s accumulate over time.

Well that's broken, but I don't understand why.

xemul · 2023-08-02T15:53:24Z

I don't understand. A stall will give the disk time to catch up and complete all requests.

Yes. Disk will complete all the requests. Then the replenisher will try to put more tokens into bucket, because more time had passed since lat time, but it won't be able to do it, because the N_requests / Duration is < N_requests / IO_quota

After we recover from the stall, the buckets are fully replenished.

Provided nobody grabs from it, yes. If the IO stops for a while, the bucket is fully replenished.

xemul · 2023-08-02T15:54:43Z

How can we lose a token? We take it from the top bucket, wait for completion, put it in the bottom bucket, move it again. It's fully recycled.

Token != request
Token == request / duration
We take request / task_quota, put back request / (task_quota + extra_delay)

avikivity · 2023-08-02T15:54:56Z

Sorry, I'm really confused. We probably have different mental models of how this works. Since you wrote it, your model is probably correct, but it doesn't help me understand what's going on.

avikivity · 2023-08-02T15:56:44Z

How can we lose a token? We take it from the top bucket, wait for completion, put it in the bottom bucket, move it again. It's fully recycled.

Token != request Token == request / duration We take request / task_quota, put back request / (task_quota + extra_delay)

Ah. That's wrong.

Token should be "normalized disk ability to do something". If you take it, you have to put it back.

Pulling something from the token bucket at a steady rate = queuing work at a disk at a steady rate.

xemul · 2023-08-02T15:59:29Z

The "ability" nowadays means IOPS and bandwidth, which are speeds, not raw "number of requests" and "number of bytes" metrics. One cannot take "1 operation per second" thing from a bucket, instead, it takes an "1 operation" from a bucket, but it appears the re once every second. If it appears once every 1.01 seconds, then one will take "0.99 operations per second", which, in turn, will make it appear back in 1.02 seconds, and so on. This effect disappears in case token bucket is left idle for some time to fully replenish

xemul · 2023-08-02T16:15:02Z

Let me try to demonstrate it with some .. weird artificial numbers just to take a look at it from the different angle.

Let's take a t.b. with 10 token / second rate. And the input flow into the scheduler at the same rate -- 10 requests / second. Tick is every second.

tick-1: 10 requests in the queue, 10 tokens in a bucket. All can be and are dispatched
tick-2: 10 requests complete, 10 tokens are put into the 'replenishing bucket'. And again, 10 requests in a queue, 10 tokens replenished, all can be and all are dispatched
tick-N: the same

This is how it should work. Now let's imagine that tick-X takes 1.1 seconds.

tick-x: 10 requests completed, 10 tokens are put into replenishing bucket. 11 requests in a queue. 11 tokens are to be replenished (because 1.1 seconds passed since last replenish), but (!) only 10 can be, 1 token is lost only 10 are replenished and only 10 (!) requests are submitted. Queue grew up 1 request

None of the next ticks will drain this queue, because each second only 10 requests are completed and only 10 tokens are replenished. We've "lost" 1 token and can only get it back if only 9 or less requests are queued, no other ways. Even if tick-y takes 0.9 seconds it won't help, because we'll have 9 requests completed, 9 queued and 8 9 tokens replenished.

upd: one misprint above at the "89 tokens"

avikivity · 2023-08-02T16:16:33Z

The "ability" nowadays means IOPS and bandwidth, which are speeds, not raw "number of requests" and "number of bytes" metrics. One cannot take "1 operation per second" thing from a bucket, instead, it takes an "1 operation" from a bucket, but it appears the re once every second. If it appears once every 1.01 seconds, then one will take "0.99 operations per second", which, in turn, will make it appear back in 1.02 seconds, and so on. This effect disappears in case token bucket is left idle for some time to fully replenish

We drain tokens at some rate. So if a token represents work, draining at some rate represents work per unit time.

xemul · 2023-08-02T16:18:38Z

We drain tokens at some rate.

Yes, exactly. And we replenish tokens at some rate and in order for it to work in-sunk the rate of filling the 2nd bucket should be at least the same, or faster. But since it's a rate, delay in completing a request renders us larger denominator of the "work / duration" fraction and lower replenishing rate.

So if a token represents work, draining at some rate represents work per unit time.

avikivity · 2023-08-02T16:19:08Z

Let me try to demonstrate it with some .. weird artificial numbers just to take a look at it from the different angle.

Let's take a t.b. with 10 token / second rate. And the input flow into the scheduler at the same rate -- 10 requests / second. Tick is every second.

tick-1: 10 requests in the queue, 10 tokens in a bucket. All can be and are dispatched tick-2: 10 requests complete, 10 tokens are put into the 'replenishing bucket'. And again, 10 requests in a queue, 10 tokens replenished, all can be and all are dispatched tick-N: the same

This is how it should work. Now let's imagine that tick-X takes 1.1 seconds.

tick-x: 10 requests completed, 10 tokens are put into replenishing bucket. 11 requests in a queue. 11 tokens are to be replenished

Why do we want to replenish 11 and not 10?

A token bucket is supposed to represent a bucket filled with tokens, not a bucket filled with "work per unit time".

(because 1.1 seconds passed since last replenish), but (!) only 10 can be, 1 token is lost only 10 are replenished and only 10 (!) requests are submitted. Queue grew up 1 request

None of the next ticks will drain this queue, because each second only 10 requests are completed and only 10 tokens are replenished. We've "lost" 1 token and can only get it back if only 9 or less requests are queued, no other ways. Even if tick-y takes 0.9 seconds it won't help, because we'll have 9 requests completed, 9 queued and 8 9 tokens replenished.

upd: one misprint above at the "89 tokens"

xemul · 2023-08-02T16:19:42Z

Why do we want to replenish 11 and not 10?

Because we replenish "the amount of tokens for 1.1 second" which is time * rate = 1.1 * 10 = 11

avikivity · 2023-08-02T16:47:46Z

Why do we want to replenish 11 and not 10?

Because we replenish "the amount of tokens for 1.1 second" which is time * rate = 1.1 * 10 = 11

We should replenish the amount of tokens we got.

xemul · 2023-08-02T17:00:59Z

We should replenish the amount of tokens we got.

This is going to be advanced --max-io-requests thing, not the rate-limiter based on ... cyan-purple diskplorer plots that ties rates of {requests, bytes} x {read, write} with each other.

xemul · 2023-08-02T17:02:10Z

We should replenish the amount of tokens we got.

By the way, we got 10 tokens in the above example. 11 is the amount of tokens we want as per time-replenishment

xemul · 2023-08-03T09:40:06Z

One more example of how the back-link makes things worse.

Let's assume that we've 5000 requests to dispatch and each tick a scheduler can push 500 as per the rate-limiter. And (!) each tick delays completion of 5% of the dispatched requests. In case dispatch loop just does what it needs -- dispatches 500 requests at a time, the ticks would look like this:

tick       |   1 |   2 |   3 |   4 |   5 |   6 |   7 |   8 |   9 |  10 |  11
dispatched | 500 | 500 | 500 | 500 | 500 | 500 | 500 | 500 | 500 | 500 |  
complated  | 475 | 475 | 475 | 475 | 475 | 475 | 475 | 475 | 475 | 475 | 250

It takes 11 ticks to process and one extra tick to pick up delayed completions. Not a big deal. And the dispatch average rate of 5000/10=500 req/sec is almost on par with the completion rate of 5000/11=454 req/sec.

Now if adding a "back-link", i.e. -- each tick the scheduler can dispatch min(500, completed_on_previous_tick) requests, the result is notably worse

tick       |   1 |   2 |   3 |   4 |   5 |   6 |   7 |   8 |   9 |  10 |  11 |  12 |  13 |  14 | 15
dispatched | 500 | 475 | 451 | 429 | 407 | 387 | 368 | 349 | 332 | 315 | 299 | 284 | 270 | 134 |  
completed  | 475 | 451 | 429 | 407 | 387 | 368 | 349 | 332 | 315 | 299 | 284 | 270 | 257 | 127 | 250

It takes 15 ticks, and still one unnoticeable extra tick to pick up those 250 delayed requests. The dispatch rate is 357 req/sec and completion rate is 333 req/sec. 30% of the throughput ability is lost

This model has a flaw in a sense that in the latter case delayed requests will get completed two/three/... ticks later, yes, but it's a simplification. For more ... real-ish numbers I've created a test in #1724 , it shows very similar effect, it takes more "ticks" to reveal itself, but since each tick is 0.5ms, in real time it will show up in seconds.

avikivity · 2023-08-03T14:50:09Z

One more example of how the back-link makes things worse.

Let's assume that we've 5000 requests to dispatch and each tick a scheduler can push 500 as per the rate-limiter. And (!) each tick delays completion of 5% of the dispatched requests. In case dispatch loop just does what it needs -- dispatches 500 requests at a time, the ticks would look like this:
tick       |   1 |   2 |   3 |   4 |   5 |   6 |   7 |   8 |   9 |  10 |  11
dispatched | 500 | 500 | 500 | 500 | 500 | 500 | 500 | 500 | 500 | 500 |  
complated  | 475 | 475 | 475 | 475 | 475 | 475 | 475 | 475 | 475 | 475 | 250
It takes 11 ticks to process and one extra tick to pick up delayed completions. Not a big deal. And the dispatch average rate of 5000/10=500 req/sec is almost on par with the completion rate of 5000/11=454 req/sec.

This is fine and after some time we'll dispatch and complete 475 each tick. But this corresponds to the current implementation as I understand it.

Now if adding a "back-link", i.e. -- each tick the scheduler can dispatch min(500, completed_on_previous_tick) requests, the result is notably worse

Well, why would we do that?

tick       |   1 |   2 |   3 |   4 |   5 |   6 |   7 |   8 |   9 |  10 |  11 |  12 |  13 |  14 | 15
dispatched | 500 | 475 | 451 | 429 | 407 | 387 | 368 | 349 | 332 | 315 | 299 | 284 | 270 | 134 |  
completed  | 475 | 451 | 429 | 407 | 387 | 368 | 349 | 332 | 315 | 299 | 284 | 270 | 257 | 127 | 250
It takes 15 ticks, and still one unnoticeable extra tick to pick up those 250 delayed requests. The dispatch rate is 357 req/sec and completion rate is 333 req/sec. 30% of the throughput ability is lost

This model has a flaw in a sense that in the latter case delayed requests will get completed two/three/... ticks later, yes, but it's a simplification. For more ... real-ish numbers I've created a test in #1724 , it shows very similar effect, it takes more "ticks" to reveal itself, but since each tick is 0.5ms, in real time it will show up in seconds.

xemul · 2023-08-03T17:03:02Z

Well, why would we do that?

Because that's how the 2-buckets work. A bucket doesn't have more tokens than there were returned upon completion. I.e. if we dispatched 20, then 19 were completed and we replenish t.b., even if time-based replenisher tells we have 20, we'd still replenish 19

xemul · 2023-10-06T09:28:44Z

How do you prove that it stabilizes? And what's the motivation for increasing cost by flow_ratio?

The rate-limited main equation is

$$ (\frac{iops}{iops_{max}} + \frac{bandwidth}{bandwidth_{max}}) <= 1$$

or

$$ \frac{d}{dt} ( \frac{ops}{iops_{max}} + \frac{bytes}{bandwidth_{max}} ) <= 1$$

We implement it via token bucket with rate of 1.0 and the cost of individual request (cost == amount of tokens it needs to carry) being

$$ cost = \frac{1}{iops_{max}} + \frac{size}{bandiwdth_{max}}$$

Now let's assume that at some point the rate should be (I'll get back to how we assume it below) lowered, because disk doesn't keep up. It means that in teh frst formula we want to decrease iops_max and bandwidth_max denominators.

$$ \frac{d}{dt} ( \frac{ops}{iops_{max} * \alpha} + \frac{bytes}{bandwidth_{max} * \beta} ) <= 1 , \alpha < 1.0 , \beta < 1.0$$

In this PR I say that

$$ \alpha = \beta = \frac {1}{\gamma} , \gamma > 1.0$$

because we don't know which IOPS or bandwidth was overwhelmed for the disk (spoiler: actually most likely it's the latter, but still), which means new token bucket equation is

$$ \frac{d}{dt} ( \gamma * ( \frac{ops}{iops_{max}} + \frac{bytes}{bandwidth_{max}} ) ) <= 1, \gamma > 1.0 $$

The last equation means, that in order to make scheduler work with lower IOPS and bandwidth, we may multiply individual request cost by some number larger than 1.0.

Now how do we "assume that IOPS and bandwidth should be lowered"? There are also many ways to assume that, but in this PR I say that: the IOPS and bandwidth should be lowered in case scheduler submits more requests per second than the kernel completes requests per second in the long run. This is where the flow ratio came from.

Summary:

in our case decreasing t.b. rate can be achieved (though not equivalent) to multiplying the request cost by some value larger than 1.0
the sign of "disk is overloaded" is when it completes requests at lower rate than we dispatch

(upd: s/bandwidthbytes/bandwidth/ in some formulas)

avikivity · 2023-10-06T13:26:45Z

+2 points for inline equations

The last equation means, that in order to make scheduler work with lower IOPS and bandwidth, we may multiply individual request cost by some number larger than 1.0.

That's pretty simple, now that you've explained it so. Another way to look at it is that the we make consumption rate smaller than alpha (alpha<1) instead of smaller than 1.

Now how do we "assume that IOPS and bandwidth should be lowered"? There are also many ways to assume that, but in this PR I say that: the IOPS and bandwidth should be lowered in case scheduler submits more requests per second than the kernel completes requests per second in the long run. This is where the flow ratio came from.

But measured in consumption units, yes? especially if the bandwidth is the root cause.

But in the really long run, submits have to equal completes, no? So the flow ratio measures if we're int the process of moving from one equilibrium (there submits equal completes) to another (where they are also equal, but latency is much higher). I guess in the really long run, we achieve equilibrium by running out of concurrency, whereas we want to achieve equilibrium by limiting latency.

Summary:

in our case decreasing t.b. rate can be achieved (though not equivalent) to multiplying the request cost by some value larger than 1.0

the sign of "disk is overloaded" is when it completes requests at lower rate than we dispatch

(upd: s/bandwidthbytes/bandwidth/ in some formulas)

Is the algorithm stable? suppose we increased the cost and achieved a flow ratio of 1. We must have, we can't have a flow ratio > 1 for long periods. Then we jump back to gamma=1. Will we fluctuate between the two modes?

One way to test it is to run a test with iotune numbers adjusted upwards by 20%, so we know the disk is slower than what the scheduler thinks it is.

Maybe we should monitor latency directly, though that's a very noisy meaurement. If moving_average(latency) > latency_goal, increase cost.

We also need to parametrize the algorithm. 100ms works for SSD (maybe too large), won't work for HDD, and then I have long conversations with @horschi.

xemul · 2023-10-06T15:07:34Z

Another way to look at it is that the we make consumption rate smaller than alpha (alpha<1) instead of smaller than 1.

Possible, but 1.0 on the right side of the equation is token bucket filling rate. It's "shared" between shards and tuning it in share-nothing approach is more challenging that letting each shard increase its requests' cost

But measured in consumption units, yes? especially if the bandwidth is the root cause.

No, measured in the number of requests

But in the really long run, submits have to equal completes, no? So the flow ratio measures if we're int the process of moving from one equilibrium (there submits equal completes) to another (where they are also equal, but latency is much higher). I guess in the really long run, we achieve equilibrium by running out of concurrency, whereas we want to achieve equilibrium by limiting latency.

In the really long run -- probably. But If we push more requests into ~~disk~~ kernel than the disk can handle what will happen? AFAIK (but I can be wrong) the amount of data disk and_kernel can queue-up internal is pretty large if not "infinite". With 10Gb/s disk and 100ms averaging in case disk just stops kernel will need to queue just 1Gb, if disk is "suddenly" back to normal is will drain withing the next 100ms

Is the algorithm stable? suppose we increased the cost and achieved a flow ratio of 1. We must have, we can't have a flow ratio > 1 for long periods. Then we jump back to gamma=1. Will we fluctuate between the two modes?

Once every 100ms, it is possible in theory, yes. But why is that a problem? Is your point that "we better settle at gamma of, say, 1.01, rather than switching between 1.0 and, say, 1.1"?

Maybe we should monitor latency directly, though that's a very noisy meaurement. If moving_average(latency) > latency_goal, increase cost.

I had this idea, but requests of different size and direction would have different latencies, say 4k write is 50us, 128k read can be 200us. If the former's latency goes up 3x times it's still below latency goal, but disk is likely in trouble already.

We also need to parametrize the algorithm. 100ms works for SSD (maybe too large), won't work for HDD

It is all configurable already, but I just noticed that I forgot to expose the averaging period as an option, only as io-queue config parameter (the cost-increase threshold is an option in this PR)

avikivity · 2023-10-06T16:16:23Z

Another way to look at it is that the we make consumption rate smaller than alpha (alpha<1) instead of smaller than 1.

Possible, but 1.0 on the right side of the equation is token bucket filling rate. It's "shared" between shards and tuning it in share-nothing approach is more challenging that letting each shard increase its requests' cost

Sure, I was more concerned in explaining it to myself than suggesting a code change.

But measured in consumption units, yes? especially if the bandwidth is the root cause.

No, measured in the number of requests

But in the really long run, submits have to equal completes, no? So the flow ratio measures if we're int the process of moving from one equilibrium (there submits equal completes) to another (where they are also equal, but latency is much higher). I guess in the really long run, we achieve equilibrium by running out of concurrency, whereas we want to achieve equilibrium by limiting latency.

In the really long run -- probably. But If we push more requests into ~~disk~~ kernel than the disk can handle what will happen? AFAIK (but I can be wrong) the amount of data disk and_kernel can queue-up internal is pretty large if not "infinite". With 10Gb/s disk and 100ms averaging in case disk just stops kernel will need to queue just 1Gb, if disk is "suddenly" back to normal is will drain withing the next 100ms

The kernel queue is really the size of the aio context we allocate (summed across all shards). So we can control submit rate until we fill up all aio slots. Once we reach the limit, submit rate becomes equal to the completion rate.

Is the algorithm stable? suppose we increased the cost and achieved a flow ratio of 1. We must have, we can't have a flow ratio > 1 for long periods. Then we jump back to gamma=1. Will we fluctuate between the two modes?

Once every 100ms, it is possible in theory, yes. But why is that a problem? Is your point that "we better settle at gamma of, say, 1.01, rather than switching between 1.0 and, say, 1.1"?

Maybe we should monitor latency directly, though that's a very noisy meaurement. If moving_average(latency) > latency_goal, increase cost.

I had this idea, but requests of different size and direction would have different latencies, say 4k write is 50us, 128k read can be 200us. If the former's latency goes up 3x times it's still below latency goal, but disk is likely in trouble already.

Why do we care? We have a latency goal.

Note if we have small requests, we also send very many of them, so their latency will also grow.

We also need to parametrize the algorithm. 100ms works for SSD (maybe too large), won't work for HDD

It is all configurable already, but I just noticed that I forgot to expose the averaging period as an option, only as io-queue config parameter (the cost-increase threshold is an option in this PR)

I'd still like to see a test with artificially raised io_properties.yaml. Will it survive?

xemul · 2023-10-10T05:27:13Z

The kernel queue is really the size of the aio context we allocate (summed across all shards). So we can control submit rate until we fill up all aio slots. Once we reach the limit, submit rate becomes equal to the completion rate.

Yes, that's true

Why do we care? We have a latency goal.
Note if we have small requests, we also send very many of them, so their latency will also grow.

Because one-shot measure is not reliable. E.g. with latencies we don't measure disk latency, we measure disk + seastar latency, so even if one-time measurement goes up it doesn't mean disk gets slow. This is exactly what the current approach has troubles with -- it notices that the short-term completion rate gets lower (because reactor is a bit late in completing request). For per-request latencies we need some long-term aggregate, but what can it be? The "expected" latency is not a single number we can get e.g. some average on.

avikivity · 2023-10-10T12:56:38Z

The kernel queue is really the size of the aio context we allocate (summed across all shards). So we can control submit rate until we fill up all aio slots. Once we reach the limit, submit rate becomes equal to the completion rate.

Yes, that's true

Why do we care? We have a latency goal.
Note if we have small requests, we also send very many of them, so their latency will also grow.

Because one-shot measure is not reliable. E.g. with latencies we don't measure disk latency, we measure disk + seastar latency, so even if one-time measurement goes up it doesn't mean disk gets slow. This is exactly what the current approach has troubles with -- it notices that the short-term completion rate gets lower (because reactor is a bit late in completing request). For per-request latencies we need some long-term aggregate, but what can it be? The "expected" latency is not a single number we can get e.g. some average on.

Agree we can't use point-in-time measurements. But I want to see some kind of logic that explains the idea, not heuristics.

avikivity · 2023-10-10T12:57:51Z

Perhaps we should "learn" the actual disk capacity using a long-term moving average. But wait - that's effectively what you do. But I don't see why the flow ratio is the right trigger.

xemul · 2023-10-10T14:00:47Z

(Another explanation from today's call) (Not yet the logic that explains the idea)

If the disk is working slower than we think it should from the io-properties.yaml file, then there can be two reasons for that.

It's some temporary glitch. In that case the flow-ratio-based back-link expects that lowering the dispatch rate (which is input for the disk) would help the disk to "take breath". After a while dispatch rate to completion rate would restore and scheduler will get back to its original rates
It's permanent, i.e. -- we overestimated the disk, and the numbers in the io-properties.yaml file are indeed too large. In that case flow-ratio-based backlink won't work indeed. Disk will not get back to normal, it will continue to complete requests at the speed it can. The dispatch rate will be lowered and it will become close to that "real" completion rate, the flow-ratio would become 1.0 again and scheduler will again start oversubscribing the disk

In other words -- the proposed flow-ratio back-link is the way to overcome temporary slowdown, but not the way to re-evaluate the disk capacity run-time.

xemul · 2023-10-10T14:03:14Z

Perhaps we should "learn" the actual disk capacity using a long-term moving average.

This undermines the whole point of io-properties.yaml. If we can learn the actual disk capacity runtime, then we can start with 100Mb/s vs 100IOPS default hard-coded throughput and "learn" the real numbers eventually.

I'm not against that, but this sounds like completely different direction.

xemul · 2023-10-12T07:35:47Z

The math behind flow-ratio.

d_i -- the amount of requests dispatched at tick i
p_i -- the amount of requests processed by disk at tick i
c_i -- the amount of requests completed by reactor loop at tick i

We can observe d_i and c_i in the dispatcher, but not the p_i, because we don't have direct access to disks' queues

After n ticks we have

D_n -- total amount of requests dispatched,
P_n -- total amount of requests processed,
C_n -- total amount of requests completed,

$$ D_n = \sum_{i=0}^n d_i $$

$$ P_n = \sum_{i=0}^n p_i $$

$$ C_n = \sum_{i=0}^n c_i $$

Disk cannot process more than it was dispatched, but it can process less "accumulating" a queue

$$ d_i \ge p_i \Rightarrow D_n \ge P_n$$

Reactor cannot complete more than it was processed by disk either

$$ p_i \ge c_i \Rightarrow P_n \ge C_n $$

Note that D_n > P_n means that disk is falling behind. Our goal is to make sure the disk doesn't do it and doesn't accumulate the queue, i.e. D_n = P_n, but we cannot observe P_n directly, only C_n.

$$ P_n - C_n = Qc_n \ge 0 $$

$$ D_n - C_N = Qd_n + Qc_n \ge 0 $$

Qd_n is the queue accumulated in disk. We try to avoid it. Qc_n is the accumulated delta between processed (by disk) and completed (by reactor) requests. We cannot reliably say that it goes to zero over time, because there's always some unknown amount of processed but not yet completed requests. So the above formula contains two unknowns and we cannot solve it. Let's try the other way

$$ \frac {D_n} {P_n} = Rd_n \ge 1 $$

$$ \frac {P_n} {C_n} = Rc_n \ge 1 $$

$$ \frac {D_n} {C_n} = Rd_n \times Rc_n \ge 1 $$

Rd_n is the ratio of dispatched to processed requests. If it's 1, we're OK, if it's greater, disk is accumulating the queue, we try to avoid it. Rc_n is the ratio between processed (by disk) and completed (by reactor) requests. It has a not immediately apparent, but very pretty property

$$ \lim_{n\to\infty} Rc_n = \lim_{n\to\infty} \frac {P_n} {C_n} = \lim_{n\to\infty} \frac {C_n + Qc_n} { C_n} = \lim_{n\to\infty} (1 + \frac {Qc_n} {C_n} ) = 1 + \lim_{n\to\infty} \frac {Qc_n} {C_n}$$

The Qc_n doesn't grow over time. It's non-zero, but it's upper bound by some value. Respectively

$$ \lim_{n\to\infty} Rc_n = 1 $$

$$ \lim_{n\to\infty} \frac {D_n} {C_n} = \lim_{n\to\infty} ( Rd_n \times Rc_n ) = \lim_{n\to\infty} Rd_n $$

IOW -- we can say if the disk is accumulating the queue or not by observing the dispatched-to-completed (to completed, not processed) over a long enough time

It will be used later, but it is implicitly used already to print the auto-adjust message. BTW, the message can be printed twise, one for each stream, but streams are configured using the same config, so the latency-goal will be the same as well. Signed-off-by: Pavel Emelyanov <[email protected]>

To be used by next patch Signed-off-by: Pavel Emelyanov <[email protected]>

The monitor evaluates EMA value for the dispatch speed to complete speed ratio. The ratio is expected to be 1.0, but according to the the patch series description, sometimes it can grow larger. The ratio of dispatch and completion rates is fed into EMA filter every 100ms with the smoothing factor of 0.95, but both values are placed in io_queue::config and can later be added to io_properties of options. The calculated value is exported as a io_queue_flow_control metrics labeled with mountpoint, shard and IO-group. Signed-off-by: Pavel Emelyanov <[email protected]>

... and export flow_ratio. Currently there are only per-class metrics, but the flow-rate is evaluated per-queue. Signed-off-by: Pavel Emelyanov <[email protected]>

According to the patch-series description, the completion rate is always less or equal than the dispatch rate. Making the dispatch rate not-exceed the completion rate makes both creep down in an infinite loop, so remove this backlink and make IO scheduler work at the configured rate all the time. Signed-off-by: Pavel Emelyanov <[email protected]>

Make the slow-down link less immediate. When the disk is not overloaded it, in theory, should complete as many requests per-tick, as it was dispatched to it previously. IOW, the ratio of dispatched requests to completed requests should be 1.0 or, in the worst case, close to it. In the "long run" it should be 1.0. If it's not, it means that requests tend to accumulate in disk and it's a reason for slow down. This ratio is tracked with the flow-rate monitor. It estimates the ratio with the exponential moving average filter, adding new drops once every 100ms with the smothing factor of 0.95. Next, current formula for the rate limiter is IOPS / IOPS_max + Bandwidth / Bandwidth_max <= 1.0 and it's implemented as a token bucket with replenish rate of 1 token per second and each request requiring capacity = weight / A + size / B tokens (fraction) to proceed. To slow things down there are two options: - reduce the 1.0 constant - increase capacities of requests This set walks the second route, as touching the t.b. rate might be too tricky, because t.b. is shared across shards and shards would step on each others toes doing that. So instead of doing it, the patch increases the requests' cost multiplying it by the aforementioned ratio. Since in real load the ratio is always strictly greater than 1.0, a threshold is added below which the scheduler continues working at the configured rate. Signed-off-by: Pavel Emelyanov <[email protected]>

It includes tha basic rate-limiter equation, the brief description of how it maps to token bucket algo, how the slow-down is performed by just adjusting the request cost and the flow-ratio justification Signed-off-by: Pavel Emelyanov <[email protected]>

xemul · 2023-10-12T10:52:41Z

upd:

make flow-ratio duration depend on io_latency_goal
reduce slow-down threshold to 1.1
added doc file describing math behind the model

…o branch-5.2 This backports scylladb/seastar#1766 * br-double-bucket-slowdown-backport-5.2: io_queue: Add flow-rate based self slowdown backlink io_queue: Make main throttler uncapped io_queue: Add queue-wide metrics io_queue: Introduce "flow monitor" io_queue: Count total number of dispatched and completed requests so far io_queue: Introduce io_group::io_latency_goal() fair_queue: Do not re-evaluate request capacity twice

… branch-5.4 This backports scylladb/seastar#1766 * br-double-bucket-slowdown-backport-5.4: io_queue: Add flow-rate based self slowdown backlink io_queue: Make main throttler uncapped io_queue: Add queue-wide metrics io_queue: Introduce "flow monitor" io_queue: Count total number of dispatched and completed requests so far io_queue: Introduce io_group::io_latency_goal()

…m Pavel Emelyanov There are three places where IO dispatch loop is throttled * self-throttling with token bucket according to math model * per-shard one-tick threshold * 2-bucket approach when tokens are replenished only after they are released from disk This PR removes the last one, because it leads to self-slowdown in case of reactor stalls. This back-link was introduced to catch the case when the disk suddenly slows down to stop dispatched to over-load it with requests, but effectively this back-link measures not the real disk dispatch rate, but the disk+kernel+reactor dispatch rate. Despite the "kernel" part is tiny, the reactor part can grow large triggering the self slow-down effect. Here's some math. Let's assume that a some point scheduler dispatched N_d requests. It means that it was able to grab N_d tokens in T_d duration, the rate of dispatch is R_d = N_d/T_d. The requests are to be completed by the reactor next tick. Let's assume it takes T_c time until reactor gets there and it completes N_c requests. The rate of completion is thus R_c = N_c/T_c. Apparently, N_c <= N_d, because kernel cannot complete more requests that it was queued. In case reactor experiences a stall during the completion tick, T_c > T_d and since N_c <= N_d consequentially N_d/T_d > N_c/T_c. In case reactor doesn't stall, the number of requests that will complete N_c = N_d/T_d * T_c, because this is how dispatch rate is defined. This is equivalent to N_c/T_c = N_d/T_d. Finally: R_d >= R_c i.e. the dispatch rate is equal of greater than the completion rate where the "equal" part is less likely and is only if reactor clockworks and doesn't stall. The mentioned back-link makes sure that R_d <= R_c, coupled with the stalls (even the small ones) this drives the R_d down each tick, causing the R_c to go down as well, then again. The removed fuse is replaced with the flow-monitor based on dispatch-to-completion rate. Normally, the number of requests dispatched for a certain duration divided by the number of requests completed for the same duration must be 1.0. Otherwise that would mean that requests accumulate in disk. However, this ratio cannot be such immediately and in the longer run it tends to be slightly greater that 1.0, because if reactor polls kernel for IO completions more often, it won't get more requests that it was dispatched. But even a small delay in polling would make Nr_completed / duration less because of the larger denominator value. Having said that, the new backlink is based on the flow-ratio. When the "average" value of dispatched/completed rates exceeds some threshold (configurable, 1.5 by default) the "cost" of individual requests increases thus reducing the dispatch rate. The main difference from the current implementation is that the new backlink is not "immediate". The averaging is the exponential moving average filter with 100ms updates and 0.95 smoothing factor. Current backlink is immediate in a sense that delay to deliver a completion immediately slows down the next tick dispatch thus accumulating spontaneous reactor micro-stalls. This can be reproduced by the test introduced in scylladb#1724 . It's not (yet) in the PR, but making the tokens release loop artificially release ~1% more tokens fixes this case, which also supports the theory of reduced completion rate being the culprit. BTW, it cannot be the fix, because the ... over-release factor is not constant and it's hard to calculate it. fixes: scylladb#1641 refs: scylladb#1311 refs: scylladb#1492 (*) in fact, _this_ is the metrics that correlates with the flow ratio to grow above 1.0, but this metrics is sort of look at quota-violation from the IO angle refs: scylladb#1774 this PR has attached metrics screenshots demonstrating the effect on stressed scylla Closes scylladb#1766 * github.com:scylladb/seastar: doc: Add document describing all the math behind IO scheduler io_queue: Add flow-rate based self slowdown backlink io_queue: Make main throttler uncapped io_queue: Add queue-wide metrics io_queue: Introduce "flow monitor" io_queue: Count total number of dispatched and completed requests so far io_queue: Introduce io_group::io_latency_goal()

Scheduler goal is to make sure that dispatched requests complete not later than after io-latency-goal duration (which is defaulted to 1.5 times reactor CPU latency goal). In cases when disk or kernel slow down, there's flow-ratio guard that slows down the dispatch rate as well [1]. However, sometimes it may not be enough and requests delay their execution even more. Slowly executing requests are indirectly reported via toal-disk-delay metrics [2], but this metrics accumulates several requests into one counter potentially smoothing spikes by good requests. This detector is aimed at detecting individual slow requests and logging them along with some related statistics that should help to understand what's going on. refs: #1766 [1] (eefa837) [2] (0238d25) fixes: #1311 closes: #1609 Closes #2371 * github.com:scylladb/seastar: reactor: Add --io-completion-notify-ms option io_queue: Stall detector io_queue: Keep local variable with request execution delay io_queue: Rename flow ratio timer to be more generic reactor: Export _polls counter (internally)

xemul requested review from avikivity, vladzcloudius and bhalevy August 2, 2023 15:17

xemul added 7 commits October 12, 2023 12:37

io_queue: Count total number of dispatched and completed requests so far

9822bb8

To be used by next patch Signed-off-by: Pavel Emelyanov <[email protected]>

io_queue: Add queue-wide metrics

19ac746

... and export flow_ratio. Currently there are only per-class metrics, but the flow-rate is evaluated per-queue. Signed-off-by: Pavel Emelyanov <[email protected]>

xemul force-pushed the br-io-sched-no-capping branch from bda1287 to 9a154ab Compare October 12, 2023 10:51

avikivity merged commit eefa837 into scylladb:master Oct 17, 2023

xemul mentioned this pull request Jun 11, 2024

[RFC] Scatter IO request among shards evenly #2289

Closed

travisdowns mentioned this pull request Jul 15, 2024

io_submit blocks the reactor when device's request queue fills up #70

Open

xemul mentioned this pull request Jul 26, 2024

Detect and report slow IO requests #2371

Merged

xemul mentioned this pull request Dec 20, 2024

Grabbed tokens are lost when preempting requests #2591

Open

xemul referenced this pull request in michoecho/seastar Dec 23, 2024

fix?

b0ec97d

michoecho mentioned this pull request Dec 23, 2024

[DRAFT, don't review yet] Rewrite the IO scheduling algorithm to ensure cross-shard fairness of tokens #2596

Closed

Slowdown IO scheduler based on dispatched/completed ratio #1766

Slowdown IO scheduler based on dispatched/completed ratio #1766

Conversation

xemul commented Aug 2, 2023 • edited Loading

avikivity commented Aug 2, 2023

avikivity commented Aug 2, 2023

xemul commented Aug 2, 2023

xemul commented Aug 2, 2023

xemul commented Aug 2, 2023

avikivity commented Aug 2, 2023

xemul commented Aug 2, 2023

avikivity commented Aug 2, 2023

xemul commented Aug 2, 2023

avikivity commented Aug 2, 2023

avikivity commented Aug 2, 2023

xemul commented Aug 2, 2023

xemul commented Aug 2, 2023

avikivity commented Aug 2, 2023

avikivity commented Aug 2, 2023

xemul commented Aug 2, 2023

xemul commented Aug 2, 2023 • edited Loading

avikivity commented Aug 2, 2023

xemul commented Aug 2, 2023

avikivity commented Aug 2, 2023

xemul commented Aug 2, 2023

avikivity commented Aug 2, 2023

xemul commented Aug 2, 2023

xemul commented Aug 2, 2023

xemul commented Aug 3, 2023

avikivity commented Aug 3, 2023

xemul commented Aug 3, 2023

xemul commented Oct 6, 2023 • edited Loading

avikivity commented Oct 6, 2023

xemul commented Oct 6, 2023

avikivity commented Oct 6, 2023

xemul commented Oct 10, 2023

avikivity commented Oct 10, 2023

avikivity commented Oct 10, 2023

xemul commented Oct 10, 2023

xemul commented Oct 10, 2023

xemul commented Oct 12, 2023

xemul commented Oct 12, 2023

xemul commented Aug 2, 2023 •

edited

Loading

xemul commented Aug 2, 2023 •

edited

Loading

xemul commented Oct 6, 2023 •

edited

Loading