-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
admission: metrics and tests for evaluating improvements #85469
Comments
One idea for this (what I had done at a previous company) is to create a matrix of "workload x condition" and measure the performance (*) at each of the combinations. Workloads would be things like
The condition would be things like
The "sweet spot" for this is probably 5 workloads, and 5 conditions, so 25 tests total. Each test is run twice, once to determine "max throughput" and once to determine "P99 latency". The max throughput version simply runs the workload as fast as possible and an outcome is a single number of "test run time". The "P99 latency" runs at 50-75% of "today's best throughput number" and records the median P99 latency across the test run. In total this give 50 performance outcome numbers which can then be shown on a "matrix bingo board" with the two old numbers and two new numbers. For a given change, it is unlikely (especially once low hanging fixes are done) that all 50 metrics will be positive. Coloring the matrix with the following: "Either improved >25%" - Dark green And doing this consistently for features allows a very easy way to visualize the impact of a change and what needs to be done to investigate further. A big caveat with this testing is the stability of this run-over-run. If the same build is run twice in a row, then the comparison needs to be 100% white (no changes > 5%, and ideally <2%), or this becomes a less useful test. Anything that can be done to maximize that is important. Additionally, the 50 tests should run in ~8 hours (definitely <24 hours) to allow this to be done rapidly enough for developers to iterate on. To accomplish this requires a few things
This would be a great project to do in the stabilization period after 8/15 |
I forgot about this issue but have since cargo-culted a bunch of text in #89208 instead. I'm also referencing that in internal milestones as such, and noted a few targeted reproductions we want to flesh out first. I've also alluded to library components we'd want under
I agree it's a good idea, but I'm a bit wary of the matrix-style approach to start with given the $$$ spend + support burden -- getting pinged by the team city bot on NxM variants of the same regression, to be dealt with with by a team of 2-3 people, feels like a lot. Let's try the targeted form first to see what we learn and come back to this matrix thing after. I imagine it would also need some support from @cockroachdb/test-eng folks. |
Many of the improvements to admission control in the past have fixed an obvious gap e.g. lack of awareness of a bottleneck resource, incorrect accounting for how many resources are consumed etc.
As we evolve admission control, there are improvements that have the risk to cause regressions in peak throughput, isolation (both throughput and latency). Our current approach to run custom experiments for each PR and post results (graphs) with reproduction steps is not sufficient going forward.
We need a set of tests that setup particular cluster configurations, and one or multiple workloads. Most interesting behavior in admission control requires multiple workloads, of differing importance (e.g. regular sql reads/writes and bulk work of various kinds), or different tenants. With the introduction of elastic work, even a single workload can be affected in terms of throughput, since we may leave resources under-utilized to avoid risking higher latency for regular work. We also need summary metrics for the test run, so we don't have to examine graphs: e.g. throughput, throughput variance, latency percentiles etc. for each workload.
This is an umbrella issue that we can close after we've developed an initial set of tests that we think give reasonable coverage for the current admission control implementation.
@irfansharif @tbg
Jira issue: CRDB-18257
The text was updated successfully, but these errors were encountered: