Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk stall detector #1311

Closed
xemul opened this issue Nov 28, 2022 · 4 comments · Fixed by #2371
Closed

Disk stall detector #1311

xemul opened this issue Nov 28, 2022 · 4 comments · Fixed by #2371

Comments

@xemul
Copy link
Contributor

xemul commented Nov 28, 2022

Similarly to cpu stall detector that prints warnings and calltraces when reactor doesn't tick when it needs, it would be nice to have IO stall detector that warns us about disk being slow.

scylla_io_queue_total_exec_sec metrics exists, but short spikes still can come under the radar

@bhalevy
Copy link
Member

bhalevy commented Apr 11, 2023

@xemul what else needs to be done here other than #1492?

@xemul
Copy link
Contributor Author

xemul commented Apr 11, 2023

I've the branch already :) Need to polish it and push

xemul added a commit to xemul/seastar that referenced this issue Apr 11, 2023
The stall threshold is the duration a request is expected to get
dispatched and executed within. In current model it's expected to be
io-latency-goal value, but it's better to have it configurable.

When the threshold is broken, the report helper is called and the
threshold is doubled. Once in a while the threshold is lowered to
allow for more reports in the future.

Reported is the number of queued requests in the whole queue and
per-class as well as the number of currently executing requests also
both kinds -- in the whole queue and per-class.

fixes: scylladb#1311

Signed-off-by: Pavel Emelyanov <[email protected]>
@mykaul
Copy link

mykaul commented Apr 23, 2023

With the issues we've seen with some customer setup, I wonder if this could be helpful? If so, @xemul - please try to push this (so it'll get into 5.3).

@xemul
Copy link
Contributor Author

xemul commented Apr 24, 2023

With the issues we've seen with some customer setup, I wonder if this could be helpful?

This was implemented this way exactly after looking at the issue with "some customer"'s cluster

avikivity added a commit that referenced this issue Oct 17, 2023
…m Pavel Emelyanov

There are three places where IO dispatch loop is throttled

  * self-throttling with token bucket according to math model
  * per-shard one-tick threshold
  * 2-bucket approach when tokens are replenished only after they are released from disk

This PR removes the last one, because it leads to self-slowdown in case of reactor stalls. This back-link was introduced to catch the case when the disk suddenly slows down to stop dispatched to over-load it with requests, but effectively this back-link measures not the real disk dispatch rate, but the disk+kernel+reactor dispatch rate. Despite the "kernel" part is tiny, the reactor part can grow large triggering the self slow-down effect.

Here's some math.

Let's assume that a some point scheduler dispatched N_d requests. It means that it was able to grab N_d tokens in T_d duration, the rate of dispatch is R_d = N_d/T_d. The requests are to be completed by the reactor next tick. Let's assume it takes T_c time until reactor gets there and it completes N_c requests. The rate of completion is thus R_c = N_c/T_c. Apparently, N_c <= N_d, because kernel cannot complete more requests that it was queued.

In case reactor experiences a stall during the completion tick, T_c > T_d and since N_c <= N_d consequentially N_d/T_d > N_c/T_c. In case reactor doesn't stall, the number of requests that will complete N_c = N_d/T_d * T_c, because this is how dispatch rate is defined. This is equivalent to N_c/T_c = N_d/T_d.

Finally: R_d >= R_c i.e. the dispatch rate is equal of greater than the completion rate where the "equal" part is less likely and is only if reactor clockworks and doesn't stall.

The mentioned back-link makes sure that R_d <= R_c, coupled with the stalls (even the small ones) this drives the R_d down each tick, causing the R_c to go down as well, then again.

The removed fuse is replaced with the flow-monitor based on dispatch-to-completion rate. Normally, the number of requests dispatched for a certain duration divided by the number of requests completed for the same duration must be 1.0. Otherwise that would mean that requests accumulate in disk. However, this ratio cannot be such immediately and in the longer run it tends to be slightly greater that 1.0, because if reactor polls kernel for IO completions more often, it won't get more requests that it was dispatched. But even a small delay in polling would make Nr_completed / duration less because of the larger denominator value.

Having said that, the new backlink is based on the flow-ratio. When the "average" value of dispatched/completed rates exceeds some threshold (configurable, 1.5 by default) the "cost" of individual requests increases thus reducing the dispatch rate.

The main difference from the current implementation is that the new backlink is not "immediate". The averaging is the exponential moving average filter with 100ms updates and 0.95 smoothing factor. Current backlink is immediate in a sense that delay to deliver a completion immediately slows down the next tick dispatch thus accumulating spontaneous reactor micro-stalls.

This can be reproduced by the test introduced in #1724 . It's not (yet) in the PR, but making the tokens release loop artificially release ~1% more tokens fixes this case, which also supports the theory of reduced completion rate being the culprit. BTW, it cannot be the fix, because the ... over-release factor is not constant and it's hard to calculate it.

fixes: #1641
refs: #1311
refs: #1492 (*) in fact, _this_ is the metrics that correlates with the flow ratio to grow above 1.0, but this metrics is sort of look at quota-violation from the IO angle
refs: #1774 this PR has attached metrics screenshots demonstrating the effect on stressed scylla

Closes #1766

* github.com:scylladb/seastar:
  doc: Add document describing all the math behind IO scheduler
  io_queue: Add flow-rate based self slowdown backlink
  io_queue: Make main throttler uncapped
  io_queue: Add queue-wide metrics
  io_queue: Introduce "flow monitor"
  io_queue: Count total number of dispatched and completed requests so far
  io_queue: Introduce io_group::io_latency_goal()
graphcareful pushed a commit to graphcareful/seastar that referenced this issue Mar 20, 2024
…m Pavel Emelyanov

There are three places where IO dispatch loop is throttled

  * self-throttling with token bucket according to math model
  * per-shard one-tick threshold
  * 2-bucket approach when tokens are replenished only after they are released from disk

This PR removes the last one, because it leads to self-slowdown in case of reactor stalls. This back-link was introduced to catch the case when the disk suddenly slows down to stop dispatched to over-load it with requests, but effectively this back-link measures not the real disk dispatch rate, but the disk+kernel+reactor dispatch rate. Despite the "kernel" part is tiny, the reactor part can grow large triggering the self slow-down effect.

Here's some math.

Let's assume that a some point scheduler dispatched N_d requests. It means that it was able to grab N_d tokens in T_d duration, the rate of dispatch is R_d = N_d/T_d. The requests are to be completed by the reactor next tick. Let's assume it takes T_c time until reactor gets there and it completes N_c requests. The rate of completion is thus R_c = N_c/T_c. Apparently, N_c <= N_d, because kernel cannot complete more requests that it was queued.

In case reactor experiences a stall during the completion tick, T_c > T_d and since N_c <= N_d consequentially N_d/T_d > N_c/T_c. In case reactor doesn't stall, the number of requests that will complete N_c = N_d/T_d * T_c, because this is how dispatch rate is defined. This is equivalent to N_c/T_c = N_d/T_d.

Finally: R_d >= R_c i.e. the dispatch rate is equal of greater than the completion rate where the "equal" part is less likely and is only if reactor clockworks and doesn't stall.

The mentioned back-link makes sure that R_d <= R_c, coupled with the stalls (even the small ones) this drives the R_d down each tick, causing the R_c to go down as well, then again.

The removed fuse is replaced with the flow-monitor based on dispatch-to-completion rate. Normally, the number of requests dispatched for a certain duration divided by the number of requests completed for the same duration must be 1.0. Otherwise that would mean that requests accumulate in disk. However, this ratio cannot be such immediately and in the longer run it tends to be slightly greater that 1.0, because if reactor polls kernel for IO completions more often, it won't get more requests that it was dispatched. But even a small delay in polling would make Nr_completed / duration less because of the larger denominator value.

Having said that, the new backlink is based on the flow-ratio. When the "average" value of dispatched/completed rates exceeds some threshold (configurable, 1.5 by default) the "cost" of individual requests increases thus reducing the dispatch rate.

The main difference from the current implementation is that the new backlink is not "immediate". The averaging is the exponential moving average filter with 100ms updates and 0.95 smoothing factor. Current backlink is immediate in a sense that delay to deliver a completion immediately slows down the next tick dispatch thus accumulating spontaneous reactor micro-stalls.

This can be reproduced by the test introduced in scylladb#1724 . It's not (yet) in the PR, but making the tokens release loop artificially release ~1% more tokens fixes this case, which also supports the theory of reduced completion rate being the culprit. BTW, it cannot be the fix, because the ... over-release factor is not constant and it's hard to calculate it.

fixes: scylladb#1641
refs: scylladb#1311
refs: scylladb#1492 (*) in fact, _this_ is the metrics that correlates with the flow ratio to grow above 1.0, but this metrics is sort of look at quota-violation from the IO angle
refs: scylladb#1774 this PR has attached metrics screenshots demonstrating the effect on stressed scylla

Closes scylladb#1766

* github.com:scylladb/seastar:
  doc: Add document describing all the math behind IO scheduler
  io_queue: Add flow-rate based self slowdown backlink
  io_queue: Make main throttler uncapped
  io_queue: Add queue-wide metrics
  io_queue: Introduce "flow monitor"
  io_queue: Count total number of dispatched and completed requests so far
  io_queue: Introduce io_group::io_latency_goal()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants