admission: metrics and tests for evaluating improvements #85469

sumeerbhola · 2022-08-02T16:15:51Z

Many of the improvements to admission control in the past have fixed an obvious gap e.g. lack of awareness of a bottleneck resource, incorrect accounting for how many resources are consumed etc.
As we evolve admission control, there are improvements that have the risk to cause regressions in peak throughput, isolation (both throughput and latency). Our current approach to run custom experiments for each PR and post results (graphs) with reproduction steps is not sufficient going forward.

We need a set of tests that setup particular cluster configurations, and one or multiple workloads. Most interesting behavior in admission control requires multiple workloads, of differing importance (e.g. regular sql reads/writes and bulk work of various kinds), or different tenants. With the introduction of elastic work, even a single workload can be affected in terms of throughput, since we may leave resources under-utilized to avoid risking higher latency for regular work. We also need summary metrics for the test run, so we don't have to examine graphs: e.g. throughput, throughput variance, latency percentiles etc. for each workload.

This is an umbrella issue that we can close after we've developed an initial set of tests that we think give reasonable coverage for the current admission control implementation.

@irfansharif @tbg

Jira issue: CRDB-18257

andrewbaptist · 2022-08-02T17:10:14Z

One idea for this (what I had done at a previous company) is to create a matrix of "workload x condition" and measure the performance (*) at each of the combinations.

Workloads would be things like

TPC-C
TPC-H
YCSB (some combination of A, B, C, F)
(Note I don't know enough about the workloads to know if this is enough or we need others)

The condition would be things like

9 node, all healthy, single site
9 node, all healthy, 3 regions with 50ms latency between them
8 node healthy, 1 node slow disk (decide if single or multi-site)
8 node healthy, 1 node decommissioning

The "sweet spot" for this is probably 5 workloads, and 5 conditions, so 25 tests total. Each test is run twice, once to determine "max throughput" and once to determine "P99 latency". The max throughput version simply runs the workload as fast as possible and an outcome is a single number of "test run time". The "P99 latency" runs at 50-75% of "today's best throughput number" and records the median P99 latency across the test run.

In total this give 50 performance outcome numbers which can then be shown on a "matrix bingo board" with the two old numbers and two new numbers. For a given change, it is unlikely (especially once low hanging fixes are done) that all 50 metrics will be positive. Coloring the matrix with the following:

"Either improved >25%" - Dark green
"Both improved 5-25%" - Light green
"Both within 25%, but one increased >5% or decreased >5%" - Yellow
"Both within 5%" - White
"Both decreased 5-25%" - Light red
"Either decreased >25%" - Dark red

And doing this consistently for features allows a very easy way to visualize the impact of a change and what needs to be done to investigate further.

A big caveat with this testing is the stability of this run-over-run. If the same build is run twice in a row, then the comparison needs to be 100% white (no changes > 5%, and ideally <2%), or this becomes a less useful test. Anything that can be done to maximize that is important. Additionally, the 50 tests should run in ~8 hours (definitely <24 hours) to allow this to be done rapidly enough for developers to iterate on.

To accomplish this requires a few things

Exact definition of the workloads and conditions
Generation of a script that will run all the tests and generate a "report" (maybe this is stored into a database?)
Visualization to compare 2 reports and generate the colored report
Building of baselines and determining the variance across about 10 runs today

This would be a great project to do in the stabilization period after 8/15

irfansharif · 2022-10-03T21:09:13Z

I forgot about this issue but have since cargo-culted a bunch of text in #89208 instead. I'm also referencing that in internal milestones as such, and noted a few targeted reproductions we want to flesh out first. I've also alluded to library components we'd want under pkg/roachtest to make our lives easier.

One idea for this (what I had done at a previous company) is to create a matrix of "workload x condition" and measure the performance (*) at each of the combinations.

I agree it's a good idea, but I'm a bit wary of the matrix-style approach to start with given the $$$ spend + support burden -- getting pinged by the team city bot on NxM variants of the same regression, to be dealt with with by a team of 2-3 people, feels like a lot. Let's try the targeted form first to see what we learn and come back to this matrix thing after. I imagine it would also need some support from @cockroachdb/test-eng folks.

sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control labels Aug 2, 2022

sumeerbhola mentioned this issue Aug 25, 2022

admission: enable disk bandwidth bottleneck resource #86857

Closed

4 tasks

irfansharif mentioned this issue Oct 3, 2022

admission: add integration tests #89208

Open

21 tasks

irfansharif closed this as not planned Won't fix, can't repro, duplicate, stale Oct 3, 2022

exalate-issue-sync bot closed this as completed Oct 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admission: metrics and tests for evaluating improvements #85469

admission: metrics and tests for evaluating improvements #85469

sumeerbhola commented Aug 2, 2022 •

edited by cockroach-jira-scripts

Loading

andrewbaptist commented Aug 2, 2022

irfansharif commented Oct 3, 2022

admission: metrics and tests for evaluating improvements #85469

admission: metrics and tests for evaluating improvements #85469

Comments

sumeerbhola commented Aug 2, 2022 • edited by cockroach-jira-scripts Loading

andrewbaptist commented Aug 2, 2022

irfansharif commented Oct 3, 2022

sumeerbhola commented Aug 2, 2022 •

edited by cockroach-jira-scripts

Loading