-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slowdown IO scheduler based on dispatched/completed ratio #1766
Conversation
I don't understand. What if the disk is 5% slower than the setting in io_properties.yaml. Won't the queue keep growing and latency grow with it? |
I don't understand this either. After a reactor stall, all pending requests will complete, because the disk hasn't stalled. The scheduler is in pristine state. It's true that stalls > io_latency_goal have a huge impact on throughput. But it could be compensated by increasing io_latency_goal (if throughput is more important than latency) or by fixing the stalls. |
Yes, it will. My point is that currently we cannot tell "disk is 5% slower than the setting" from "reactor deliver completions a bit later and this looks like disk is 5% slower". And the attempt to slow down the dispatch rate leads to lower completion rate (literally, because less requests are dispatched) which, in turn, slows down the dispatch rate even further. |
No it isn't. The scheduler assumes that N submitted requests will complete in T duration. They are, but disk, reactor my add "t" duration on top. The resulting completion rate is |
When I tell "stall" I don't mean "stalls that are reported". I mean -- reactor poll completion notably later than disk completes the request. Let's take a disk of 1Gb/s. Per 0.75ms (default io latency goal) it should do 1Gb * 0.75 / 1000 = 786kb/s. Now the scheduler submitted them all in one poll. The expectation is that the token bucket will replenish all 786k tokens in 0.75ms. Now the reactor completes requests in 0.80ms, just 50usec later. The token bucket will replenish ~736k tokens, because this is the completion rate Next tick the scheduler will be able to submit only 736k tokens, which is ~958Mb/s, not 1Gb/s. Next tick it will lose 7% more. |
We have metrics for quota violations. And "a bit" doesn't hurt as long as the io latency goal is large enough.
I don't understand. We should stuff enough requests to fill up the latency goal. Small stalls shouldn't affect it. |
We do, but "enough" == according to the number of tokens in a bucket.
Small stalls make just a bit less tokens into a bucket during replenishment. This bits accumulate and eventually the replenishment rate is notably lower. |
That's almost always true. Small reads take ~0.1ms. The task quota is 0.5ms.
If we regularly have a 0.3ms stall (regularly = almost every poll period), that's bad. If we have such a hiccup every 100 periods, it's unnoticable. Which is it? We should increase the io latency goal. 0.75ms is too close to the task quota period. |
It doesn't really matter how frequent 0.3ms stalls are (in fact, the bad clusters I've seen all had ~10% of task-quota violations). If we replace 50usec with 5usec in the above math, the result will be the same, but the scheduler will lose 0.7% of tokens every tick, not 7%. The result is the same, but slower.
Every time a request is completed X% later then the math thinks it will, token bucket loses ~X% of tokens in it. The %-s accumulate over time. |
I don't understand. A stall will give the disk time to catch up and complete all requests. After we recover from the stall, the buckets are fully replenished. |
How can we lose a token? We take it from the top bucket, wait for completion, put it in the bottom bucket, move it again. It's fully recycled.
Well that's broken, but I don't understand why. |
Yes. Disk will complete all the requests. Then the replenisher will try to put more tokens into bucket, because more time had passed since lat time, but it won't be able to do it, because the N_requests / Duration is < N_requests / IO_quota
Provided nobody grabs from it, yes. If the IO stops for a while, the bucket is fully replenished. |
|
Sorry, I'm really confused. We probably have different mental models of how this works. Since you wrote it, your model is probably correct, but it doesn't help me understand what's going on. |
Ah. That's wrong. Token should be "normalized disk ability to do something". If you take it, you have to put it back. Pulling something from the token bucket at a steady rate = queuing work at a disk at a steady rate. |
The "ability" nowadays means IOPS and bandwidth, which are speeds, not raw "number of requests" and "number of bytes" metrics. One cannot take "1 operation per second" thing from a bucket, instead, it takes an "1 operation" from a bucket, but it appears the re once every second. If it appears once every 1.01 seconds, then one will take "0.99 operations per second", which, in turn, will make it appear back in 1.02 seconds, and so on. This effect disappears in case token bucket is left idle for some time to fully replenish |
Let me try to demonstrate it with some .. weird artificial numbers just to take a look at it from the different angle. Let's take a t.b. with 10 token / second rate. And the input flow into the scheduler at the same rate -- 10 requests / second. Tick is every second. tick-1: 10 requests in the queue, 10 tokens in a bucket. All can be and are dispatched This is how it should work. Now let's imagine that tick-X takes 1.1 seconds. tick-x: 10 requests completed, 10 tokens are put into replenishing bucket. 11 requests in a queue. 11 tokens are to be replenished (because 1.1 seconds passed since last replenish), but (!) only 10 can be, 1 token is lost only 10 are replenished and only 10 (!) requests are submitted. Queue grew up 1 request None of the next ticks will drain this queue, because each second only 10 requests are completed and only 10 tokens are replenished. We've "lost" 1 token and can only get it back if only 9 or less requests are queued, no other ways. Even if tick-y takes 0.9 seconds it won't help, because we'll have 9 requests completed, 9 queued and upd: one misprint above at the " |
We drain tokens at some rate. So if a token represents work, draining at some rate represents work per unit time. |
Yes, exactly. And we replenish tokens at some rate and in order for it to work in-sunk the rate of filling the 2nd bucket should be at least the same, or faster. But since it's a rate, delay in completing a request renders us larger denominator of the "work / duration" fraction and lower replenishing rate.
|
Why do we want to replenish 11 and not 10? A token bucket is supposed to represent a bucket filled with tokens, not a bucket filled with "work per unit time". (because 1.1 seconds passed since last replenish), but (!) only 10 can be, 1 token is lost only 10 are replenished and only 10 (!) requests are submitted. Queue grew up 1 request
|
Because we replenish "the amount of tokens for 1.1 second" which is time * rate = 1.1 * 10 = 11 |
We should replenish the amount of tokens we got. |
This is going to be advanced |
By the way, we got 10 tokens in the above example. 11 is the amount of tokens we want as per time-replenishment |
One more example of how the back-link makes things worse. Let's assume that we've 5000 requests to dispatch and each tick a scheduler can push 500 as per the rate-limiter. And (!) each tick delays completion of 5% of the dispatched requests. In case dispatch loop just does what it needs -- dispatches 500 requests at a time, the ticks would look like this:
It takes 11 ticks to process and one extra tick to pick up delayed completions. Not a big deal. And the dispatch average rate of 5000/10=500 req/sec is almost on par with the completion rate of 5000/11=454 req/sec. Now if adding a "back-link", i.e. -- each tick the scheduler can dispatch min(500, completed_on_previous_tick) requests, the result is notably worse
It takes 15 ticks, and still one unnoticeable extra tick to pick up those 250 delayed requests. The dispatch rate is 357 req/sec and completion rate is 333 req/sec. 30% of the throughput ability is lost This model has a flaw in a sense that in the latter case delayed requests will get completed two/three/... ticks later, yes, but it's a simplification. For more ... real-ish numbers I've created a test in #1724 , it shows very similar effect, it takes more "ticks" to reveal itself, but since each tick is 0.5ms, in real time it will show up in seconds. |
This is fine and after some time we'll dispatch and complete 475 each tick. But this corresponds to the current implementation as I understand it.
Well, why would we do that?
|
Because that's how the 2-buckets work. A bucket doesn't have more tokens than there were returned upon completion. I.e. if we dispatched 20, then 19 were completed and we replenish t.b., even if time-based replenisher tells we have 20, we'd still replenish 19 |
The rate-limited main equation is or We implement it via token bucket with rate of Now let's assume that at some point the rate should be (I'll get back to how we assume it below) lowered, because disk doesn't keep up. It means that in teh frst formula we want to decrease In this PR I say that because we don't know which IOPS or bandwidth was overwhelmed for the disk (spoiler: actually most likely it's the latter, but still), which means new token bucket equation is The last equation means, that in order to make scheduler work with lower IOPS and bandwidth, we may multiply individual request cost by some number larger than 1.0. Now how do we "assume that IOPS and bandwidth should be lowered"? There are also many ways to assume that, but in this PR I say that: the IOPS and bandwidth should be lowered in case scheduler submits more requests per second than the kernel completes requests per second in the long run. This is where the flow ratio came from. Summary:
(upd: s/bandwidthbytes/bandwidth/ in some formulas) |
+2 points for inline equations
That's pretty simple, now that you've explained it so. Another way to look at it is that the we make consumption rate smaller than alpha (alpha<1) instead of smaller than 1.
But measured in consumption units, yes? especially if the bandwidth is the root cause. But in the really long run, submits have to equal completes, no? So the flow ratio measures if we're int the process of moving from one equilibrium (there submits equal completes) to another (where they are also equal, but latency is much higher). I guess in the really long run, we achieve equilibrium by running out of concurrency, whereas we want to achieve equilibrium by limiting latency.
Is the algorithm stable? suppose we increased the cost and achieved a flow ratio of 1. We must have, we can't have a flow ratio > 1 for long periods. Then we jump back to gamma=1. Will we fluctuate between the two modes? One way to test it is to run a test with iotune numbers adjusted upwards by 20%, so we know the disk is slower than what the scheduler thinks it is. Maybe we should monitor latency directly, though that's a very noisy meaurement. If moving_average(latency) > latency_goal, increase cost. We also need to parametrize the algorithm. 100ms works for SSD (maybe too large), won't work for HDD, and then I have long conversations with @horschi. |
Possible, but 1.0 on the right side of the equation is token bucket filling rate. It's "shared" between shards and tuning it in share-nothing approach is more challenging that letting each shard increase its requests' cost
No, measured in the number of requests
In the really long run -- probably. But If we push more requests into
Once every 100ms, it is possible in theory, yes. But why is that a problem? Is your point that "we better settle at gamma of, say, 1.01, rather than switching between 1.0 and, say, 1.1"?
I had this idea, but requests of different size and direction would have different latencies, say 4k write is 50us, 128k read can be 200us. If the former's latency goes up 3x times it's still below latency goal, but disk is likely in trouble already.
It is all configurable already, but I just noticed that I forgot to expose the averaging period as an option, only as io-queue config parameter (the cost-increase threshold is an option in this PR) |
Sure, I was more concerned in explaining it to myself than suggesting a code change.
The kernel queue is really the size of the aio context we allocate (summed across all shards). So we can control submit rate until we fill up all aio slots. Once we reach the limit, submit rate becomes equal to the completion rate.
Why do we care? We have a latency goal. Note if we have small requests, we also send very many of them, so their latency will also grow.
I'd still like to see a test with artificially raised io_properties.yaml. Will it survive? |
Yes, that's true
Because one-shot measure is not reliable. E.g. with latencies we don't measure disk latency, we measure disk + seastar latency, so even if one-time measurement goes up it doesn't mean disk gets slow. This is exactly what the current approach has troubles with -- it notices that the short-term completion rate gets lower (because reactor is a bit late in completing request). For per-request latencies we need some long-term aggregate, but what can it be? The "expected" latency is not a single number we can get e.g. some average on. |
Agree we can't use point-in-time measurements. But I want to see some kind of logic that explains the idea, not heuristics. |
Perhaps we should "learn" the actual disk capacity using a long-term moving average. But wait - that's effectively what you do. But I don't see why the flow ratio is the right trigger. |
(Another explanation from today's call) (Not yet the logic that explains the idea) If the disk is working slower than we think it should from the io-properties.yaml file, then there can be two reasons for that.
In other words -- the proposed flow-ratio back-link is the way to overcome temporary slowdown, but not the way to re-evaluate the disk capacity run-time. |
This undermines the whole point of io-properties.yaml. If we can learn the actual disk capacity runtime, then we can start with 100Mb/s vs 100IOPS default hard-coded throughput and "learn" the real numbers eventually. I'm not against that, but this sounds like completely different direction. |
The math behind flow-ratio. di -- the amount of requests dispatched at tick i We can observe di and ci in the dispatcher, but not the pi, because we don't have direct access to disks' queues After n ticks we have Dn -- total amount of requests dispatched,
Note that Dn > Pn means that disk is falling behind. Our goal is to make sure the disk doesn't do it and doesn't accumulate the queue, i.e. Dn = Pn, but we cannot observe Pn directly, only Cn. Next Qdn is the queue accumulated in disk. We try to avoid it. Qcn is the accumulated delta between processed (by disk) and completed (by reactor) requests. We cannot reliably say that it goes to zero over time, because there's always some unknown amount of processed but not yet completed requests. So the above formula contains two unknowns and we cannot solve it. Let's try the other way Rdn is the ratio of dispatched to processed requests. If it's 1, we're OK, if it's greater, disk is accumulating the queue, we try to avoid it. Rcn is the ratio between processed (by disk) and completed (by reactor) requests. It has a not immediately apparent, but very pretty property The Qcn doesn't grow over time. It's non-zero, but it's upper bound by some value. Respectively IOW -- we can say if the disk is accumulating the queue or not by observing the dispatched-to-completed (to completed, not processed) over a long enough time |
It will be used later, but it is implicitly used already to print the auto-adjust message. BTW, the message can be printed twise, one for each stream, but streams are configured using the same config, so the latency-goal will be the same as well. Signed-off-by: Pavel Emelyanov <[email protected]>
To be used by next patch Signed-off-by: Pavel Emelyanov <[email protected]>
The monitor evaluates EMA value for the dispatch speed to complete speed ratio. The ratio is expected to be 1.0, but according to the the patch series description, sometimes it can grow larger. The ratio of dispatch and completion rates is fed into EMA filter every 100ms with the smoothing factor of 0.95, but both values are placed in io_queue::config and can later be added to io_properties of options. The calculated value is exported as a io_queue_flow_control metrics labeled with mountpoint, shard and IO-group. Signed-off-by: Pavel Emelyanov <[email protected]>
... and export flow_ratio. Currently there are only per-class metrics, but the flow-rate is evaluated per-queue. Signed-off-by: Pavel Emelyanov <[email protected]>
According to the patch-series description, the completion rate is always less or equal than the dispatch rate. Making the dispatch rate not-exceed the completion rate makes both creep down in an infinite loop, so remove this backlink and make IO scheduler work at the configured rate all the time. Signed-off-by: Pavel Emelyanov <[email protected]>
Make the slow-down link less immediate. When the disk is not overloaded it, in theory, should complete as many requests per-tick, as it was dispatched to it previously. IOW, the ratio of dispatched requests to completed requests should be 1.0 or, in the worst case, close to it. In the "long run" it should be 1.0. If it's not, it means that requests tend to accumulate in disk and it's a reason for slow down. This ratio is tracked with the flow-rate monitor. It estimates the ratio with the exponential moving average filter, adding new drops once every 100ms with the smothing factor of 0.95. Next, current formula for the rate limiter is IOPS / IOPS_max + Bandwidth / Bandwidth_max <= 1.0 and it's implemented as a token bucket with replenish rate of 1 token per second and each request requiring capacity = weight / A + size / B tokens (fraction) to proceed. To slow things down there are two options: - reduce the 1.0 constant - increase capacities of requests This set walks the second route, as touching the t.b. rate might be too tricky, because t.b. is shared across shards and shards would step on each others toes doing that. So instead of doing it, the patch increases the requests' cost multiplying it by the aforementioned ratio. Since in real load the ratio is always strictly greater than 1.0, a threshold is added below which the scheduler continues working at the configured rate. Signed-off-by: Pavel Emelyanov <[email protected]>
It includes tha basic rate-limiter equation, the brief description of how it maps to token bucket algo, how the slow-down is performed by just adjusting the request cost and the flow-ratio justification Signed-off-by: Pavel Emelyanov <[email protected]>
bda1287
to
9a154ab
Compare
upd:
|
…o branch-5.2 This backports scylladb/seastar#1766 * br-double-bucket-slowdown-backport-5.2: io_queue: Add flow-rate based self slowdown backlink io_queue: Make main throttler uncapped io_queue: Add queue-wide metrics io_queue: Introduce "flow monitor" io_queue: Count total number of dispatched and completed requests so far io_queue: Introduce io_group::io_latency_goal() fair_queue: Do not re-evaluate request capacity twice
… branch-5.4 This backports scylladb/seastar#1766 * br-double-bucket-slowdown-backport-5.4: io_queue: Add flow-rate based self slowdown backlink io_queue: Make main throttler uncapped io_queue: Add queue-wide metrics io_queue: Introduce "flow monitor" io_queue: Count total number of dispatched and completed requests so far io_queue: Introduce io_group::io_latency_goal()
…m Pavel Emelyanov There are three places where IO dispatch loop is throttled * self-throttling with token bucket according to math model * per-shard one-tick threshold * 2-bucket approach when tokens are replenished only after they are released from disk This PR removes the last one, because it leads to self-slowdown in case of reactor stalls. This back-link was introduced to catch the case when the disk suddenly slows down to stop dispatched to over-load it with requests, but effectively this back-link measures not the real disk dispatch rate, but the disk+kernel+reactor dispatch rate. Despite the "kernel" part is tiny, the reactor part can grow large triggering the self slow-down effect. Here's some math. Let's assume that a some point scheduler dispatched N_d requests. It means that it was able to grab N_d tokens in T_d duration, the rate of dispatch is R_d = N_d/T_d. The requests are to be completed by the reactor next tick. Let's assume it takes T_c time until reactor gets there and it completes N_c requests. The rate of completion is thus R_c = N_c/T_c. Apparently, N_c <= N_d, because kernel cannot complete more requests that it was queued. In case reactor experiences a stall during the completion tick, T_c > T_d and since N_c <= N_d consequentially N_d/T_d > N_c/T_c. In case reactor doesn't stall, the number of requests that will complete N_c = N_d/T_d * T_c, because this is how dispatch rate is defined. This is equivalent to N_c/T_c = N_d/T_d. Finally: R_d >= R_c i.e. the dispatch rate is equal of greater than the completion rate where the "equal" part is less likely and is only if reactor clockworks and doesn't stall. The mentioned back-link makes sure that R_d <= R_c, coupled with the stalls (even the small ones) this drives the R_d down each tick, causing the R_c to go down as well, then again. The removed fuse is replaced with the flow-monitor based on dispatch-to-completion rate. Normally, the number of requests dispatched for a certain duration divided by the number of requests completed for the same duration must be 1.0. Otherwise that would mean that requests accumulate in disk. However, this ratio cannot be such immediately and in the longer run it tends to be slightly greater that 1.0, because if reactor polls kernel for IO completions more often, it won't get more requests that it was dispatched. But even a small delay in polling would make Nr_completed / duration less because of the larger denominator value. Having said that, the new backlink is based on the flow-ratio. When the "average" value of dispatched/completed rates exceeds some threshold (configurable, 1.5 by default) the "cost" of individual requests increases thus reducing the dispatch rate. The main difference from the current implementation is that the new backlink is not "immediate". The averaging is the exponential moving average filter with 100ms updates and 0.95 smoothing factor. Current backlink is immediate in a sense that delay to deliver a completion immediately slows down the next tick dispatch thus accumulating spontaneous reactor micro-stalls. This can be reproduced by the test introduced in scylladb#1724 . It's not (yet) in the PR, but making the tokens release loop artificially release ~1% more tokens fixes this case, which also supports the theory of reduced completion rate being the culprit. BTW, it cannot be the fix, because the ... over-release factor is not constant and it's hard to calculate it. fixes: scylladb#1641 refs: scylladb#1311 refs: scylladb#1492 (*) in fact, _this_ is the metrics that correlates with the flow ratio to grow above 1.0, but this metrics is sort of look at quota-violation from the IO angle refs: scylladb#1774 this PR has attached metrics screenshots demonstrating the effect on stressed scylla Closes scylladb#1766 * github.com:scylladb/seastar: doc: Add document describing all the math behind IO scheduler io_queue: Add flow-rate based self slowdown backlink io_queue: Make main throttler uncapped io_queue: Add queue-wide metrics io_queue: Introduce "flow monitor" io_queue: Count total number of dispatched and completed requests so far io_queue: Introduce io_group::io_latency_goal()
Scheduler goal is to make sure that dispatched requests complete not later than after io-latency-goal duration (which is defaulted to 1.5 times reactor CPU latency goal). In cases when disk or kernel slow down, there's flow-ratio guard that slows down the dispatch rate as well [1]. However, sometimes it may not be enough and requests delay their execution even more. Slowly executing requests are indirectly reported via toal-disk-delay metrics [2], but this metrics accumulates several requests into one counter potentially smoothing spikes by good requests. This detector is aimed at detecting individual slow requests and logging them along with some related statistics that should help to understand what's going on. refs: #1766 [1] (eefa837) [2] (0238d25) fixes: #1311 closes: #1609 Closes #2371 * github.com:scylladb/seastar: reactor: Add --io-completion-notify-ms option io_queue: Stall detector io_queue: Keep local variable with request execution delay io_queue: Rename flow ratio timer to be more generic reactor: Export _polls counter (internally)
There are three places where IO dispatch loop is throttled
This PR removes the last one, because it leads to self-slowdown in case of reactor stalls. This back-link was introduced to catch the case when the disk suddenly slows down to stop dispatched to over-load it with requests, but effectively this back-link measures not the real disk dispatch rate, but the disk+kernel+reactor dispatch rate. Despite the "kernel" part is tiny, the reactor part can grow large triggering the self slow-down effect.
Here's some math.
Let's assume that a some point scheduler dispatched N_d requests. It means that it was able to grab N_d tokens in T_d duration, the rate of dispatch is R_d = N_d/T_d. The requests are to be completed by the reactor next tick. Let's assume it takes T_c time until reactor gets there and it completes N_c requests. The rate of completion is thus R_c = N_c/T_c. Apparently, N_c <= N_d, because kernel cannot complete more requests that it was queued.
In case reactor experiences a stall during the completion tick, T_c > T_d and since N_c <= N_d consequentially N_d/T_d > N_c/T_c. In case reactor doesn't stall, the number of requests that will complete N_c = N_d/T_d * T_c, because this is how dispatch rate is defined. This is equivalent to N_c/T_c = N_d/T_d.
Finally: R_d >= R_c i.e. the dispatch rate is equal of greater than the completion rate where the "equal" part is less likely and is only if reactor clockworks and doesn't stall.
The mentioned back-link makes sure that R_d <= R_c, coupled with the stalls (even the small ones) this drives the R_d down each tick, causing the R_c to go down as well, then again.
The removed fuse is replaced with the flow-monitor based on dispatch-to-completion rate. Normally, the number of requests dispatched for a certain duration divided by the number of requests completed for the same duration must be 1.0. Otherwise that would mean that requests accumulate in disk. However, this ratio cannot be such immediately and in the longer run it tends to be slightly greater that 1.0, because if reactor polls kernel for IO completions more often, it won't get more requests that it was dispatched. But even a small delay in polling would make Nr_completed / duration less because of the larger denominator value.
Having said that, the new backlink is based on the flow-ratio. When the "average" value of dispatched/completed rates exceeds some threshold (configurable, 1.5 by default) the "cost" of individual requests increases thus reducing the dispatch rate.
The main difference from the current implementation is that the new backlink is not "immediate". The averaging is the exponential moving average filter with 100ms updates and 0.95 smoothing factor. Current backlink is immediate in a sense that delay to deliver a completion immediately slows down the next tick dispatch thus accumulating spontaneous reactor micro-stalls.
This can be reproduced by the test introduced in #1724 . It's not (yet) in the PR, but making the tokens release loop artificially release ~1% more tokens fixes this case, which also supports the theory of reduced completion rate being the culprit. BTW, it cannot be the fix, because the ... over-release factor is not constant and it's hard to calculate it.
fixes: #1641
refs: #1311
refs: #1492 (*) in fact, this is the metrics that correlates with the flow ratio to grow above 1.0, but this metrics is sort of look at quota-violation from the IO angle
refs: #1774 this PR has attached metrics screenshots demonstrating the effect on stressed scylla