Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission: ioLoadListener compaction token calculation is too abrupt #91519

Closed
sumeerbhola opened this issue Nov 8, 2022 · 7 comments · Fixed by #104577
Closed

admission: ioLoadListener compaction token calculation is too abrupt #91519

sumeerbhola opened this issue Nov 8, 2022 · 7 comments · Fixed by #104577
Assignees
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Nov 8, 2022

ioLoadListener calculates multiple types of tokens, one of which is based on compaction bandwidth out of L0. Compaction bandwidth capacity out of L0 is hard to predict.

  • Pebble may not be using all the compaction concurrency available. And if Pebble were to use all the compaction concurrency (which itself may be variable in the future), it is hard to know how much more will be given to L0, since there is sophisticated scoring happening in the level compaction decision making. Note, this is unlike flushes where we do have a dedicated concurrency of 1 and do make predictions based on idle time.
  • Related to the scoring, the allocation of compaction capacity to L0 can vary.

For these reasons we have used a measurement based approach with exponential smoothing, where the measurements are taken only when we know there is some backlog, so all compactions ought to be running. At a high level I think we can continue with this approach. The problem is that we have abrupt behavior:
above an unhealthy threshold (actually a predicate defined by a disjunction sublevel-count > L or file-count > F), we use compaction bandwidth (C), to allocate C/2 tokens. Below the unhealthy threshold, the token count is infinity.

This results in bursty admission behavior where we go over the threshold, restrict tokens for a few intervals (each interval is 15s long), and then go below the threshold and have unlimited tokens and admit everything, which again puts us above the threshold. It is typical to see something like 2-3 intervals about the threshold anfd then 1 interval below. This is bad but the badness is somewhat restricted because (a) the admitted requests have to evaluate which steals time away from the admitting logic, (b) our typical workloads don't have huge concurrency so the waiting requests are limited by this concurrency.
With replication admission control we will make this worse by doing logical admission of all the waiting requests when we switch from above the threshold to below, causing another big fan-in burst (https://docs.google.com/document/d/1iCfSlTO0P6nvoGC6sLGB5YqMOcO047CMpREgG_NSLCw/edit#heading=h.sw7pci2vwkk3).

Instead we should switch to a piece-wise linear function for defining the tokens. Let us define a sub-level count threshold L and a file-count threshold F that we would like to be roughly stable at under overload. Say L=10 and F=500. These are half the current defaults of 20 and 1000 since (a) the current thresholds are higher than what we would like to sustain at, (b) we will keep the current C/2 logic at 2L and 2F. Regardless, L and F are configurable.

Then we define a score = max(sublevel-count/L, file-count/F). The compaction token function is:

  • score < 1 : unlimited
  • score in [1, 2): tokens = -C/2 x score + 3C/2
    This means C tokens when score=1, and will linearly decrease to C/2 tokens when the score is 2.
  • score >= 2: tokens = C/2

Jira issue: CRDB-21299

Epic CRDB-25469

@sumeerbhola sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control labels Nov 8, 2022
@irfansharif
Copy link
Contributor

I see the following throughput graph for kv0 that we run nightly. Once we smooth out token calculations, I wonder if we'll see throughput smoothing here too. Also see this internal thread where we observe throughput isolations in a closed-loop YCSB run when IO tokens get exhausted.

image

@sumeerbhola
Copy link
Collaborator Author

Once we smooth out token calculations, I wonder if we'll see throughput smoothing here too

Correct. I have been abusing these jagged graphs to figure out which of the roachperf workloads are IO limited.

@irfansharif
Copy link
Contributor

Something else to investigate when working on this. @andreimatei was running write-heavy YCSB with AC switched off, and then suddenly switched on, after which throughput completely collapsed for 2m. Was this due to a lack of #95563? Or something else? Discussed internally here.

image

@bananabrick
Copy link
Contributor

Ran kv0 against #104577, and I still see too much fluctuations in the write throughput. Will figure out the problem tomorrow.

@bananabrick
Copy link
Contributor

bananabrick commented Jun 13, 2023

Posting results of kv0/enc=false/nodes=32/cpu=32/size=4kb for master vs #104577.

master vs #104577
Screenshot 2023-06-12 at 11 51 54 PM

We see a couple of new behaviours:

  1. The ops/sec is consistently slightly lower, and the latencies are consistently slightly higher. I think this makes sense given that we start throttling at 500 files or 10 sublevels, and previously we would only throttle once we were over 1000 files or 10 sublevels:
    master
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 1800.0s        0       13020962         7233.9     26.5     16.8    100.7    130.0   4831.8  write

#104577

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 1800.0s        0       12334182         6852.3     28.0     16.8     83.9    113.2  14495.5  write
  1. While the result is mostly smoothed, there are still some spikes. I think these are unavoidable as long as we give out unlimited tokens when the score we compute is < 1, or max(sublevels/10, files/500) < 1. We could probably avoid spikes by giving out smoothedIntL0CompactedBytes when score is < 1, but not sure if we want that.
  • Note that each spike lasts for a shorter duration. This is because we start restricting tokens at 10 sublevels rather than 20.
  • There's fewer spikes. This is because there's a smaller range of sublevels(0-10), where we grant unlimited tokens.
  1. There's two places where the ops/sec drops to 0. This isn't consistently reproducible over runs, and we see that the ops/sec also drops to 0 on master.

I also had a db console link with some metrics for both master vs #104577, but the cluster got wiped.

@bananabrick
Copy link
Contributor

Flushing comment from slack:

Was trying to figure out the drop in ops/sec for 3 seconds which we discussed in the meeting: https://grafana.testeng.crdb.io/d/ngnKs_j7x/admission-control?orgId=1&var-cluster=nair-test&var-instances=All&from=1686617577813&to=1686630838914. The beginning of the drop correlates with a drop in the "Leaseholders" dashboard for node 3, and then we have another drop for a few seconds when the "Leaseholders" for node 3 starts increasing again.

In my second run of the experiment with smoothing enabled, we don't see any drops. I don't think the drops were caused by the smoothing.

Screenshot 2023-06-13 at 11 51 50 AM

I'm going to try and figure out the delays/exhaustion.

@bananabrick
Copy link
Contributor

bananabrick commented Jun 14, 2023

Screenshot 2023-06-14 at 7 46 22 PM

Token exhaustions correlates directly with tokens taken without permission.

craig bot pushed a commit that referenced this issue Jun 17, 2023
104577: admission: smooth compaction token calculation r=bananabrick a=bananabrick

We change the way we compute the score for compaction token
calculation, and use that score to smoothen the calculation.

The score is calculated as max(sublevels/20, l0files/1000).

Prior to this commit, we were overloaded if score was above 1 and
underloaded if score was below 1. This meant that it was easy to
have an underloaded system with unlimited tokens, and then become
overloaded due to the unlimited tokens. Note that our score calculations
also use the admission control overload threshold of 20, and 1000.

In this pr, we the keep the previous definition of score, but we're only
overloaded if score >= 1. We can consider score < 0.5 as underload, and score
in [0.5, 1) as medium load.

During underload, we still give out unlimited tokens, and during
overload we still give out C = smoothedIntL0CompactedBytes / 2 tokens.

But during medium load we hand out (C/2, C]  tokens. This scheme will
ideally remove the sharp spikes in the granted tokens due to switching
back and forth between overloaded and underloaded systems.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-25469
Fixes: #91519
Release note: None

Co-authored-by: Arjun Nair <[email protected]>
@craig craig bot closed this as completed in 43cd5ac Jun 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants