admission: add integration tests #89208

irfansharif · 2022-10-03T16:56:11Z

This is a tracking issue for roachtests we want to introduce to validate existing/new AC machinery (subsuming #85469 which I forgot existed). They'll typically codify manual experiments that have been useful in developing said machinery. These roachtests should demonstrate performance isolation (throughput, latency) in the face of:

Snapshots (roachtests: introduce admission-control/snapshot-overload #89191 is a first pass; the general solution space is tracked in admission,kvserver: subject snapshot ingestion to admission control #80607)
Backups (something admission,kvserver: introduce an elastic cpu limiter #86638 supposedly helps with)
Rangefeed catch up scans can be CPU intensive and aren't accounted for in AC (roachtest: introduce admission-control/elastic-cdc #89656, integrated in kv,rangefeed: integrate catchup scans with elastic cpu #89709)
Multiple tenants running multiple workloads (merging multitenant: re-enable admission control fairness tests #89721 to start off)

Index backfills (admission: roachtest-ify index/column backfills impacting foreground traffic #83826; admission: investigate TPC-E online index creation problem #85641; roachtest: Index overload automation of TPC-E #90005)
- Equivalents for CREATE INDEX, DROP INDEX.
ttl,admission: reduce performance impact of large row-level TTL jobs #98722. Large TTL job running (https://github.com/cockroachlabs/support/issues/1961, https://cockroachdb.zendesk.com/agent/tickets/15684, https://github.com/cockroachlabs/support/issues/1628, https://github.com/cockroachlabs/support/issues/2050)
INSERT INTO newtable SELECT * FROM oldtable where oldtable is ~1.5TB. In https://github.com/cockroachlabs/support/issues/2102 we saw the large insert cause a lot of open write intents, and intent resolution is not subject to AC. We saw this peg CPU at 100% and cause IO token exhaustion. See admission: make intent resolution subject to admission control #97108. (https://github.com/cockroachlabs/support/issues/2102, https://github.com/cockroachlabs/support/issues/2237, https://github.com/cockroachlabs/support/issues/2249, https://github.com/cockroachlabs/support/issues/2240)
Large volume MVCC GC work (copied from kvserver: pacing/admission control for mvcc gc #82955). https://github.com/cockroachlabs/support/issues/2263.
kvflowcontrol,admission: use flow control during raft log catchup post node-restart #98710. Nodes restarting observe a large volume of follower writes as part of raft log catchup, taking IO tokens away from leaseholder writes on the restarted node. https://github.com/cockroachlabs/support/issues/2287, https://github.com/cockroachlabs/support/issues/2304, https://github.com/cockroachlabs/support/issues/1980. Test repro: kv: Test to measure slowdown after a node restart #95161.

Later:

For some of these, we'll want variants that hit CPU and IO saturation separately. We would also like a multi-workload test with varying priorities, or originating from different tenants (e.g. NormalPri reads/writes and BulkNormalPri work from another tenant). We also want library functions in roachtests to better experimentation/tests: #89978.

Jira issue: CRDB-20126

The text was updated successfully, but these errors were encountered:

blathers-crl · 2022-10-03T16:56:13Z

Hi @irfansharif, please add a C-ategory label to your issue. Check out the label system docs.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

Informs cockroachdb#89208. This test sets up a 3-node CRDB cluster on 8vCPU machines running 1000-warehouse TPC-C with an aggressive (every 20m) full backup schedule. We've observed latency spikes during backups because of its CPU-heavy nature -- it can elevate CPU scheduling latencies which in turn translates to an increase in foreground latency. In cockroachdb#86638 we introduced admission control mechanisms to dynamically pace such work while maintaining acceptable CPU scheduling latencies (sub millisecond p99s). This roachtest exercises that machinery. In future commits we'll add libraries to the roachtest package to automatically spit out the degree to which {CPU-scheduler,foreground} latencies are protected. Release note: None

87763: changefeedccl: mark kv senderrors retryable r=samiskin a=samiskin Resolves #87300 Changefeeds can encounter senderrors during a normal upgrade procedure and therefore should retry. This was done in the kvfeed however is apparently not high level enough as a send error was still observed to cause a permanent failure. This PR moves the senderror checking to the top level IsRetryable check to handle it regardless of its source. Release justification: low risk important bug fix Release note (bug fix): Changefeeds will now never permanently error on a "failed to send RPC" error. 89445: opt: assert that inverted scans have inverted constraints r=mgartner a=mgartner This commit adds an assertion to ensure that inverted index scans have inverted constraints. If they do not, there is a likely a bug that can cause incorrect query results (e.g., #88047). This assertion is made in release builds,not just test builds, because it is cheap to perform. Fixes #89440 Release note: None 89482: roachtests: introduce admission-control/elastic-backup r=irfansharif a=irfansharif Informs #89208. This test sets up a 3-node CRDB cluster on 8vCPU machines running 1000-warehouse TPC-C with an aggressive (every 20m) full backup schedule. We've observed latency spikes during backups because of its CPU-heavy nature -- it can elevate CPU scheduling latencies which in turn translates to an increase in foreground latency. In #86638 we introduced admission control mechanisms to dynamically pace such work while maintaining acceptable CPU scheduling latencies (sub millisecond p99s). This roachtest exercises that machinery. In future commits we'll add libraries to the roachtest package to automatically spit out the degree to which {CPU-scheduler,foreground} latencies are protected. Release note: None Co-authored-by: Shiranka Miskin <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: irfan sharif <[email protected]>

Part of cockroachdb#89208. This test sets up a 3-node CRDB cluster on 8vCPU machines running 1000-warehouse TPC-C, and kicks off a few changefeed backfills concurrently. We've observed latency spikes during backfills because of its CPU/scan-heavy nature -- it can elevate CPU scheduling latencies which in turn translates to an increase in foreground latency. Also in this commit: routing std{err,out} from prometheus/grafana setup that roachtests do to the logger in scope. Release note: None

88974: sql: add support for `DELETE FROM ... USING` r=faizaanmadhani a=faizaanmadhani See commit messages for details. Resolves: #40963 89459: metrics: expose pebble flush utilization r=jbowens a=coolcom200 Create a new `GaugeFloat64` metric for pebble’s flush utilization. This metric is not cumulative, rather, it is the metric over an interval. This interval is determined by the `interval` parameter of the `Node.startComputePeriodicMetrics` method. In order to compute the metric over an interval the previous value of the metric must be stored. As a result, a map is constructed that takes a pointer to a store and maps it to a pointer to storage metrics: `make(map[*kvserver.Store]*storage.Metrics)`. This map is passed to `node.computeMetricsPeriodically` which gets the store to calculate its metrics and then updates the previous metrics in the map. Refactor `store.go`'s metric calculation by separating `ComputeMetrics(ctx context.Context, tick int) error` into two methods: * `ComputeMetrics(ctx context.Context) error` * `ComputeMetricsPeriodically(ctx context.Context, prevMetrics *storage.Metrics, tick int) (m storage.Metrics, err error)` Both methods call the `computeMetrics` which contains the common code between the two calls. Before this, the process for retrieving metrics instantaneous was to pass a tick value such as `-1` or `0` to the `ComputeMetrics(ctx context.Context, tick int)` however it can be done with a call to `ComputeMetrics(ctx context.Context)` The `store.ComputeMetricsPeriodically` method will also return the latest storage metrics. These metrics are used to update the mapping between stores and metrics used for computing the metric delta over an interval. Release Note: None Resolves part of #85755 Depends on #88972, cockroachdb/pebble#2001 Epic: CRDB-17515 89656: roachtest: introduce admission-control/elastic-cdc r=irfansharif a=irfansharif Part of #89208. This test sets up a 3-node CRDB cluster on 8vCPU machines running 1000-warehouse TPC-C, and kicks off a few changefeed backfills concurrently. We've observed latency spikes during backfills because of its CPU/scan-heavy nature -- it can elevate CPU scheduling latencies which in turn translates to an increase in foreground latency. Also in this commit: routing std{err,out} from prometheus/grafana setup that roachtests do to the logger in scope. Release note: None Co-authored-by: Faizaan Madhani <[email protected]> Co-authored-by: Leon Fattakhov <[email protected]> Co-authored-by: irfan sharif <[email protected]>

These tests will only serve as coarse-grained benchmarks for things AC cares about -- we don't need to run them nightly. They've spent ~2 weeks through our nightly CI suite without flaking so reduce frequency to a weekly cadence. We'll do this same thing for most tests added as part of \cockroachdb#89208. Release note: None

90579: roachtest: reduce frequency of benchmark-only AC tests r=irfansharif a=irfansharif First four commits are from #89709 and should be ignored here. These tests will only serve as coarse-grained benchmarks for things AC cares about -- we don't need to run them nightly. They've spent ~2 weeks through our nightly CI suite without flaking so reduce frequency to a weekly cadence. We'll do this same thing for most tests added as part of #89208. Release note: None Co-authored-by: irfan sharif <[email protected]>

Part of cockroachdb#89208. We've seen incidents where a large index being dropped caused a moderate latency/throughput impact on foreground load. Index/table/database drop all share the same underlying code, we make metadata only changes on the descriptor, wait out the GC TTL, and issue range deletion tombstones over the now-deleted keyspace. On subsequent pebble compactions, storage space is reclaimed. See pkg/sql/schema_changer.go for where all this work originates. This particular test makes two copies of the 4TiB replicated TPC-E 100k dataset, runs foreground workload against one and drops the other. It makes use of disk snapshots of course, because we're impatient. Release note: None

104051: roachtest: add admission-control/database-drop r=irfansharif a=irfansharif **roachtest: add admission-control/database-drop** Part of #89208. We've seen incidents where a large index being dropped caused a moderate latency/throughput impact on foreground load. Index/table/database drop all share the same underlying code, we make metadata only changes on the descriptor, wait out the GC TTL, and issue range deletion tombstones over the now-deleted keyspace. On subsequent pebble compactions, storage space is reclaimed. See pkg/sql/schema_changer.go for where all this work originates. This particular test makes two copies of the 4TiB replicated TPC-E 100k dataset, runs foreground workload against one and drops the other. It makes use of disk snapshots of course, because we're impatient. **roachprod/gce: use prefix search in snapshot listing API** Like the API intended. Only noticed in the subsequent commit when introducing a second use of disk snapshots that happened to use "tpcc-100k" in its name, somewhere. **roachtest: update index-backfill roachtest** Adding a bit more commentary, throwing in another concurrent index backfill for free, and fighting some open CRDB bugs unrelated to the test. **roachtest: fix race with concurrent map writes** This author forgot that struct copies weren't deep enough. Co-authored-by: irfan sharif <[email protected]>

Rename + own as an AC integration test, similar to ones we have for backup/changefeeds/etc. We'll integrate row-level TTL reads in the subsequent commit Part of cockroachdb#89208. Release note: None

This is just resuscitating cockroachdb#81516, tests we added when developing follower pausing but never checked in. Some of this work is applicable in the context of replication admission control, when we apply flow control for regular writes. The tests add add an IO nemesis on n3/s3, restricting disk bandwidth to 20MiB/s. They then observe what happens when remote nodes n1 and n2 issue follower write traffic to it, either across many ranges or just 1 (potentially triggering the per-replica proposal quota pool). We can observe LSM state on n3/s3 under such conditions, and also what would happen if the concentrated follower node also what happens if that very node also serves some foreground load. Part of cockroachdb#89208. Release note: None

This is just resuscitating cockroachdb#81516, tests we added when developing follower pausing but never checked in. Some of this work is applicable in the context of replication admission control, when we apply flow control for regular writes. The tests add an IO nemesis on n3/s3, restricting disk bandwidth to 20MiB/s. They then observe what happens when remote nodes n1 and n2 issue follower write traffic to it, either across many ranges or just 1 (potentially triggering the per-replica proposal quota pool). We can observe LSM state on n3/s3 under such conditions, and also what would happen if the concentrated follower node also serves some foreground load. Part of cockroachdb#89208. Release note: None

111070: roachtest: add admission-overload/follower-overload r=sumeerbhola a=irfansharif This is just resuscitating #81516, tests we added when developing follower pausing but never checked in. Some of this work is applicable in the context of replication admission control, when we apply flow control for regular writes. The tests add an IO nemesis on n3/s3, restricting disk bandwidth to 20MiB/s. They then observe what happens when remote nodes n1 and n2 issue follower write traffic to it, either across many ranges or just 1 (potentially triggering the per-replica proposal quota pool). We can observe LSM state on n3/s3 under such conditions, and also what would happen if the concentrated follower node also serves some foreground load. Part of #89208. Release note: None 112660: kvflowcontroller: fix logging of blocked streams r=pavelkalinnikov,aadityasondhi a=sumeerbhola There were two bugs that are fixed: - The blocked_stream_count metric was incorrectly capped to 100. - Streams were being logged with stats that were never blocked. Some additonal improvements/fixes: - Controller.mu was being unnecessarily acquired for read paths that don't care about concurrent additions to the map. - WorkClass.SafeFormat was calling redact.SafePrinter.Print in some cases so "elastic" was not being treated as unsafe. There is a unit test to test the overflow logic of the logs, and to verify that the metric is correct even when the logs overflow. Epic: none Release note: None Co-authored-by: irfan sharif <[email protected]> Co-authored-by: sumeerbhola <[email protected]>

irfansharif added the A-admission-control label Oct 3, 2022

irfansharif added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Oct 3, 2022

irfansharif mentioned this issue Oct 3, 2022

admission: investigate TPC-E online index creation problem #85641

Closed

irfansharif assigned irfansharif and andrewbaptist Oct 3, 2022

irfansharif mentioned this issue Oct 3, 2022

admission: metrics and tests for evaluating improvements #85469

Closed

irfansharif mentioned this issue Oct 6, 2022

roachtests: introduce admission-control/elastic-backup #89482

Merged

irfansharif mentioned this issue Oct 14, 2022

roachtest,admission: add libraries for better tests/experiments #89978

Open

6 tasks

irfansharif mentioned this issue Oct 17, 2022

roachtest: introduce admission-control/elastic-cdc #89656

Merged

irfansharif mentioned this issue Oct 20, 2022

admission: intra-tenant inter-{database,workload} performance isolation #90299

Open

irfansharif mentioned this issue Oct 24, 2022

roachtest: reduce frequency of benchmark-only AC tests #90579

Merged

irfansharif mentioned this issue May 29, 2023

roachtest: add admission-control/database-drop #104051

Merged

irfansharif mentioned this issue Sep 21, 2023

roachtest: add admission-overload/follower-overload #111070

Merged

aadityasondhi self-assigned this Oct 2, 2023

aadityasondhi added the T-admission-control Admission Control label Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admission: add integration tests #89208

admission: add integration tests #89208

irfansharif commented Oct 3, 2022 •

edited

Loading

blathers-crl bot commented Oct 3, 2022

admission: add integration tests #89208

admission: add integration tests #89208

Comments

irfansharif commented Oct 3, 2022 • edited Loading

blathers-crl bot commented Oct 3, 2022

irfansharif commented Oct 3, 2022 •

edited

Loading