admission: ioLoadListener compaction token calculation is too abrupt #91519

sumeerbhola · 2022-11-08T16:36:52Z

ioLoadListener calculates multiple types of tokens, one of which is based on compaction bandwidth out of L0. Compaction bandwidth capacity out of L0 is hard to predict.

Pebble may not be using all the compaction concurrency available. And if Pebble were to use all the compaction concurrency (which itself may be variable in the future), it is hard to know how much more will be given to L0, since there is sophisticated scoring happening in the level compaction decision making. Note, this is unlike flushes where we do have a dedicated concurrency of 1 and do make predictions based on idle time.
Related to the scoring, the allocation of compaction capacity to L0 can vary.

For these reasons we have used a measurement based approach with exponential smoothing, where the measurements are taken only when we know there is some backlog, so all compactions ought to be running. At a high level I think we can continue with this approach. The problem is that we have abrupt behavior:
above an unhealthy threshold (actually a predicate defined by a disjunction sublevel-count > L or file-count > F), we use compaction bandwidth (C), to allocate C/2 tokens. Below the unhealthy threshold, the token count is infinity.

This results in bursty admission behavior where we go over the threshold, restrict tokens for a few intervals (each interval is 15s long), and then go below the threshold and have unlimited tokens and admit everything, which again puts us above the threshold. It is typical to see something like 2-3 intervals about the threshold anfd then 1 interval below. This is bad but the badness is somewhat restricted because (a) the admitted requests have to evaluate which steals time away from the admitting logic, (b) our typical workloads don't have huge concurrency so the waiting requests are limited by this concurrency.
With replication admission control we will make this worse by doing logical admission of all the waiting requests when we switch from above the threshold to below, causing another big fan-in burst (https://docs.google.com/document/d/1iCfSlTO0P6nvoGC6sLGB5YqMOcO047CMpREgG_NSLCw/edit#heading=h.sw7pci2vwkk3).

Instead we should switch to a piece-wise linear function for defining the tokens. Let us define a sub-level count threshold L and a file-count threshold F that we would like to be roughly stable at under overload. Say L=10 and F=500. These are half the current defaults of 20 and 1000 since (a) the current thresholds are higher than what we would like to sustain at, (b) we will keep the current C/2 logic at 2L and 2F. Regardless, L and F are configurable.

Then we define a score = max(sublevel-count/L, file-count/F). The compaction token function is:

score < 1 : unlimited
score in [1, 2): tokens = -C/2 x score + 3C/2
This means C tokens when score=1, and will linearly decrease to C/2 tokens when the score is 2.
score >= 2: tokens = C/2

Jira issue: CRDB-21299

Epic CRDB-25469

The text was updated successfully, but these errors were encountered:

irfansharif · 2023-03-11T00:18:23Z

I see the following throughput graph for kv0 that we run nightly. Once we smooth out token calculations, I wonder if we'll see throughput smoothing here too. Also see this internal thread where we observe throughput isolations in a closed-loop YCSB run when IO tokens get exhausted.

sumeerbhola · 2023-03-12T14:25:54Z

Once we smooth out token calculations, I wonder if we'll see throughput smoothing here too

Correct. I have been abusing these jagged graphs to figure out which of the roachperf workloads are IO limited.

irfansharif · 2023-03-15T20:48:35Z

Something else to investigate when working on this. @andreimatei was running write-heavy YCSB with AC switched off, and then suddenly switched on, after which throughput completely collapsed for 2m. Was this due to a lack of #95563? Or something else? Discussed internally here.

bananabrick · 2023-06-09T02:25:15Z

Ran kv0 against #104577, and I still see too much fluctuations in the write throughput. Will figure out the problem tomorrow.

bananabrick · 2023-06-13T04:00:38Z

Posting results of kv0/enc=false/nodes=32/cpu=32/size=4kb for master vs #104577.

master vs #104577

We see a couple of new behaviours:

The ops/sec is consistently slightly lower, and the latencies are consistently slightly higher. I think this makes sense given that we start throttling at 500 files or 10 sublevels, and previously we would only throttle once we were over 1000 files or 10 sublevels:
master

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 1800.0s        0       13020962         7233.9     26.5     16.8    100.7    130.0   4831.8  write

#104577

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 1800.0s        0       12334182         6852.3     28.0     16.8     83.9    113.2  14495.5  write

While the result is mostly smoothed, there are still some spikes. I think these are unavoidable as long as we give out unlimited tokens when the score we compute is < 1, or max(sublevels/10, files/500) < 1. We could probably avoid spikes by giving out smoothedIntL0CompactedBytes when score is < 1, but not sure if we want that.

Note that each spike lasts for a shorter duration. This is because we start restricting tokens at 10 sublevels rather than 20.
There's fewer spikes. This is because there's a smaller range of sublevels(0-10), where we grant unlimited tokens.

There's two places where the ops/sec drops to 0. This isn't consistently reproducible over runs, and we see that the ops/sec also drops to 0 on master.

I also had a db console link with some metrics for both master vs #104577, but the cluster got wiped.

bananabrick · 2023-06-13T20:07:30Z

Flushing comment from slack:

Was trying to figure out the drop in ops/sec for 3 seconds which we discussed in the meeting: https://grafana.testeng.crdb.io/d/ngnKs_j7x/admission-control?orgId=1&var-cluster=nair-test&var-instances=All&from=1686617577813&to=1686630838914. The beginning of the drop correlates with a drop in the "Leaseholders" dashboard for node 3, and then we have another drop for a few seconds when the "Leaseholders" for node 3 starts increasing again.

In my second run of the experiment with smoothing enabled, we don't see any drops. I don't think the drops were caused by the smoothing.

I'm going to try and figure out the delays/exhaustion.

bananabrick · 2023-06-14T23:45:49Z

Token exhaustions correlates directly with tokens taken without permission.

104577: admission: smooth compaction token calculation r=bananabrick a=bananabrick We change the way we compute the score for compaction token calculation, and use that score to smoothen the calculation. The score is calculated as max(sublevels/20, l0files/1000). Prior to this commit, we were overloaded if score was above 1 and underloaded if score was below 1. This meant that it was easy to have an underloaded system with unlimited tokens, and then become overloaded due to the unlimited tokens. Note that our score calculations also use the admission control overload threshold of 20, and 1000. In this pr, we the keep the previous definition of score, but we're only overloaded if score >= 1. We can consider score < 0.5 as underload, and score in [0.5, 1) as medium load. During underload, we still give out unlimited tokens, and during overload we still give out C = smoothedIntL0CompactedBytes / 2 tokens. But during medium load we hand out (C/2, C] tokens. This scheme will ideally remove the sharp spikes in the granted tokens due to switching back and forth between overloaded and underloaded systems. Epic: https://cockroachlabs.atlassian.net/browse/CRDB-25469 Fixes: #91519 Release note: None Co-authored-by: Arjun Nair <[email protected]>

sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control labels Nov 8, 2022

sumeerbhola mentioned this issue Nov 8, 2022

admission: graceful degradation #82114

Closed

irfansharif mentioned this issue Jan 23, 2023

admission: lack of intra-tenant prioritization for IO work #95678

Open

exalate-issue-sync bot added the T-kv KV Team label Mar 15, 2023

irfansharif assigned bananabrick May 9, 2023

bananabrick mentioned this issue Jun 15, 2023

admission: smooth compaction token calculation #104577

Merged

craig bot closed this as completed in 43cd5ac Jun 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admission: ioLoadListener compaction token calculation is too abrupt #91519

admission: ioLoadListener compaction token calculation is too abrupt #91519

sumeerbhola commented Nov 8, 2022 •

edited by exalate-issue-sync bot

Loading

irfansharif commented Mar 11, 2023

sumeerbhola commented Mar 12, 2023

irfansharif commented Mar 15, 2023

bananabrick commented Jun 9, 2023

bananabrick commented Jun 13, 2023 •

edited

Loading

bananabrick commented Jun 13, 2023

bananabrick commented Jun 14, 2023 •

edited

Loading

admission: ioLoadListener compaction token calculation is too abrupt #91519

admission: ioLoadListener compaction token calculation is too abrupt #91519

Comments

sumeerbhola commented Nov 8, 2022 • edited by exalate-issue-sync bot Loading

irfansharif commented Mar 11, 2023

sumeerbhola commented Mar 12, 2023

irfansharif commented Mar 15, 2023

bananabrick commented Jun 9, 2023

bananabrick commented Jun 13, 2023 • edited Loading

bananabrick commented Jun 13, 2023

bananabrick commented Jun 14, 2023 • edited Loading

sumeerbhola commented Nov 8, 2022 •

edited by exalate-issue-sync bot

Loading

bananabrick commented Jun 13, 2023 •

edited

Loading

bananabrick commented Jun 14, 2023 •

edited

Loading