-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
admission: ioLoadListener compaction token calculation is too abrupt #91519
Comments
I see the following throughput graph for kv0 that we run nightly. Once we smooth out token calculations, I wonder if we'll see throughput smoothing here too. Also see this internal thread where we observe throughput isolations in a closed-loop YCSB run when IO tokens get exhausted. |
Correct. I have been abusing these jagged graphs to figure out which of the roachperf workloads are IO limited. |
Something else to investigate when working on this. @andreimatei was running write-heavy YCSB with AC switched off, and then suddenly switched on, after which throughput completely collapsed for 2m. Was this due to a lack of #95563? Or something else? Discussed internally here. |
Ran kv0 against #104577, and I still see too much fluctuations in the write throughput. Will figure out the problem tomorrow. |
Posting results of kv0/enc=false/nodes=32/cpu=32/size=4kb for master vs #104577. master vs #104577 We see a couple of new behaviours:
I also had a db console link with some metrics for both master vs #104577, but the cluster got wiped. |
Flushing comment from slack: Was trying to figure out the drop in ops/sec for 3 seconds which we discussed in the meeting: https://grafana.testeng.crdb.io/d/ngnKs_j7x/admission-control?orgId=1&var-cluster=nair-test&var-instances=All&from=1686617577813&to=1686630838914. The beginning of the drop correlates with a drop in the "Leaseholders" dashboard for node 3, and then we have another drop for a few seconds when the "Leaseholders" for node 3 starts increasing again. In my second run of the experiment with smoothing enabled, we don't see any drops. I don't think the drops were caused by the smoothing. I'm going to try and figure out the delays/exhaustion. |
104577: admission: smooth compaction token calculation r=bananabrick a=bananabrick We change the way we compute the score for compaction token calculation, and use that score to smoothen the calculation. The score is calculated as max(sublevels/20, l0files/1000). Prior to this commit, we were overloaded if score was above 1 and underloaded if score was below 1. This meant that it was easy to have an underloaded system with unlimited tokens, and then become overloaded due to the unlimited tokens. Note that our score calculations also use the admission control overload threshold of 20, and 1000. In this pr, we the keep the previous definition of score, but we're only overloaded if score >= 1. We can consider score < 0.5 as underload, and score in [0.5, 1) as medium load. During underload, we still give out unlimited tokens, and during overload we still give out C = smoothedIntL0CompactedBytes / 2 tokens. But during medium load we hand out (C/2, C] tokens. This scheme will ideally remove the sharp spikes in the granted tokens due to switching back and forth between overloaded and underloaded systems. Epic: https://cockroachlabs.atlassian.net/browse/CRDB-25469 Fixes: #91519 Release note: None Co-authored-by: Arjun Nair <[email protected]>
ioLoadListener
calculates multiple types of tokens, one of which is based on compaction bandwidth out of L0. Compaction bandwidth capacity out of L0 is hard to predict.For these reasons we have used a measurement based approach with exponential smoothing, where the measurements are taken only when we know there is some backlog, so all compactions ought to be running. At a high level I think we can continue with this approach. The problem is that we have abrupt behavior:
above an unhealthy threshold (actually a predicate defined by a disjunction sublevel-count > L or file-count > F), we use compaction bandwidth (C), to allocate C/2 tokens. Below the unhealthy threshold, the token count is infinity.
This results in bursty admission behavior where we go over the threshold, restrict tokens for a few intervals (each interval is 15s long), and then go below the threshold and have unlimited tokens and admit everything, which again puts us above the threshold. It is typical to see something like 2-3 intervals about the threshold anfd then 1 interval below. This is bad but the badness is somewhat restricted because (a) the admitted requests have to evaluate which steals time away from the admitting logic, (b) our typical workloads don't have huge concurrency so the waiting requests are limited by this concurrency.
With replication admission control we will make this worse by doing logical admission of all the waiting requests when we switch from above the threshold to below, causing another big fan-in burst (https://docs.google.com/document/d/1iCfSlTO0P6nvoGC6sLGB5YqMOcO047CMpREgG_NSLCw/edit#heading=h.sw7pci2vwkk3).
Instead we should switch to a piece-wise linear function for defining the tokens. Let us define a sub-level count threshold L and a file-count threshold F that we would like to be roughly stable at under overload. Say L=10 and F=500. These are half the current defaults of 20 and 1000 since (a) the current thresholds are higher than what we would like to sustain at, (b) we will keep the current C/2 logic at 2L and 2F. Regardless, L and F are configurable.
Then we define a score = max(sublevel-count/L, file-count/F). The compaction token function is:
This means C tokens when score=1, and will linearly decrease to C/2 tokens when the score is 2.
Jira issue: CRDB-21299
Epic CRDB-25469
The text was updated successfully, but these errors were encountered: