-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
admission: add integration tests #89208
Labels
A-admission-control
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-admission-control
Admission Control
Comments
Hi @irfansharif, please add a C-ategory label to your issue. Check out the label system docs. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
irfansharif
added
the
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
label
Oct 3, 2022
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this issue
Oct 6, 2022
Informs cockroachdb#89208. This test sets up a 3-node CRDB cluster on 8vCPU machines running 1000-warehouse TPC-C with an aggressive (every 20m) full backup schedule. We've observed latency spikes during backups because of its CPU-heavy nature -- it can elevate CPU scheduling latencies which in turn translates to an increase in foreground latency. In cockroachdb#86638 we introduced admission control mechanisms to dynamically pace such work while maintaining acceptable CPU scheduling latencies (sub millisecond p99s). This roachtest exercises that machinery. In future commits we'll add libraries to the roachtest package to automatically spit out the degree to which {CPU-scheduler,foreground} latencies are protected. Release note: None
craig bot
pushed a commit
that referenced
this issue
Oct 6, 2022
87763: changefeedccl: mark kv senderrors retryable r=samiskin a=samiskin Resolves #87300 Changefeeds can encounter senderrors during a normal upgrade procedure and therefore should retry. This was done in the kvfeed however is apparently not high level enough as a send error was still observed to cause a permanent failure. This PR moves the senderror checking to the top level IsRetryable check to handle it regardless of its source. Release justification: low risk important bug fix Release note (bug fix): Changefeeds will now never permanently error on a "failed to send RPC" error. 89445: opt: assert that inverted scans have inverted constraints r=mgartner a=mgartner This commit adds an assertion to ensure that inverted index scans have inverted constraints. If they do not, there is a likely a bug that can cause incorrect query results (e.g., #88047). This assertion is made in release builds,not just test builds, because it is cheap to perform. Fixes #89440 Release note: None 89482: roachtests: introduce admission-control/elastic-backup r=irfansharif a=irfansharif Informs #89208. This test sets up a 3-node CRDB cluster on 8vCPU machines running 1000-warehouse TPC-C with an aggressive (every 20m) full backup schedule. We've observed latency spikes during backups because of its CPU-heavy nature -- it can elevate CPU scheduling latencies which in turn translates to an increase in foreground latency. In #86638 we introduced admission control mechanisms to dynamically pace such work while maintaining acceptable CPU scheduling latencies (sub millisecond p99s). This roachtest exercises that machinery. In future commits we'll add libraries to the roachtest package to automatically spit out the degree to which {CPU-scheduler,foreground} latencies are protected. Release note: None Co-authored-by: Shiranka Miskin <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: irfan sharif <[email protected]>
6 tasks
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this issue
Oct 17, 2022
Part of cockroachdb#89208. This test sets up a 3-node CRDB cluster on 8vCPU machines running 1000-warehouse TPC-C, and kicks off a few changefeed backfills concurrently. We've observed latency spikes during backfills because of its CPU/scan-heavy nature -- it can elevate CPU scheduling latencies which in turn translates to an increase in foreground latency. Also in this commit: routing std{err,out} from prometheus/grafana setup that roachtests do to the logger in scope. Release note: None
craig bot
pushed a commit
that referenced
this issue
Oct 17, 2022
88974: sql: add support for `DELETE FROM ... USING` r=faizaanmadhani a=faizaanmadhani See commit messages for details. Resolves: #40963 89459: metrics: expose pebble flush utilization r=jbowens a=coolcom200 Create a new `GaugeFloat64` metric for pebble’s flush utilization. This metric is not cumulative, rather, it is the metric over an interval. This interval is determined by the `interval` parameter of the `Node.startComputePeriodicMetrics` method. In order to compute the metric over an interval the previous value of the metric must be stored. As a result, a map is constructed that takes a pointer to a store and maps it to a pointer to storage metrics: `make(map[*kvserver.Store]*storage.Metrics)`. This map is passed to `node.computeMetricsPeriodically` which gets the store to calculate its metrics and then updates the previous metrics in the map. Refactor `store.go`'s metric calculation by separating `ComputeMetrics(ctx context.Context, tick int) error` into two methods: * `ComputeMetrics(ctx context.Context) error` * `ComputeMetricsPeriodically(ctx context.Context, prevMetrics *storage.Metrics, tick int) (m storage.Metrics, err error)` Both methods call the `computeMetrics` which contains the common code between the two calls. Before this, the process for retrieving metrics instantaneous was to pass a tick value such as `-1` or `0` to the `ComputeMetrics(ctx context.Context, tick int)` however it can be done with a call to `ComputeMetrics(ctx context.Context)` The `store.ComputeMetricsPeriodically` method will also return the latest storage metrics. These metrics are used to update the mapping between stores and metrics used for computing the metric delta over an interval. Release Note: None Resolves part of #85755 Depends on #88972, cockroachdb/pebble#2001 Epic: CRDB-17515 89656: roachtest: introduce admission-control/elastic-cdc r=irfansharif a=irfansharif Part of #89208. This test sets up a 3-node CRDB cluster on 8vCPU machines running 1000-warehouse TPC-C, and kicks off a few changefeed backfills concurrently. We've observed latency spikes during backfills because of its CPU/scan-heavy nature -- it can elevate CPU scheduling latencies which in turn translates to an increase in foreground latency. Also in this commit: routing std{err,out} from prometheus/grafana setup that roachtests do to the logger in scope. Release note: None Co-authored-by: Faizaan Madhani <[email protected]> Co-authored-by: Leon Fattakhov <[email protected]> Co-authored-by: irfan sharif <[email protected]>
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this issue
Oct 24, 2022
These tests will only serve as coarse-grained benchmarks for things AC cares about -- we don't need to run them nightly. They've spent ~2 weeks through our nightly CI suite without flaking so reduce frequency to a weekly cadence. We'll do this same thing for most tests added as part of \cockroachdb#89208. Release note: None
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this issue
Nov 4, 2022
These tests will only serve as coarse-grained benchmarks for things AC cares about -- we don't need to run them nightly. They've spent ~2 weeks through our nightly CI suite without flaking so reduce frequency to a weekly cadence. We'll do this same thing for most tests added as part of \cockroachdb#89208. Release note: None
craig bot
pushed a commit
that referenced
this issue
Nov 4, 2022
90579: roachtest: reduce frequency of benchmark-only AC tests r=irfansharif a=irfansharif First four commits are from #89709 and should be ignored here. These tests will only serve as coarse-grained benchmarks for things AC cares about -- we don't need to run them nightly. They've spent ~2 weeks through our nightly CI suite without flaking so reduce frequency to a weekly cadence. We'll do this same thing for most tests added as part of #89208. Release note: None Co-authored-by: irfan sharif <[email protected]>
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this issue
May 29, 2023
Part of cockroachdb#89208. We've seen incidents where a large index being dropped caused a moderate latency/throughput impact on foreground load. Index/table/database drop all share the same underlying code, we make metadata only changes on the descriptor, wait out the GC TTL, and issue range deletion tombstones over the now-deleted keyspace. On subsequent pebble compactions, storage space is reclaimed. See pkg/sql/schema_changer.go for where all this work originates. This particular test makes two copies of the 4TiB replicated TPC-E 100k dataset, runs foreground workload against one and drops the other. It makes use of disk snapshots of course, because we're impatient. Release note: None
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this issue
May 31, 2023
Part of cockroachdb#89208. We've seen incidents where a large index being dropped caused a moderate latency/throughput impact on foreground load. Index/table/database drop all share the same underlying code, we make metadata only changes on the descriptor, wait out the GC TTL, and issue range deletion tombstones over the now-deleted keyspace. On subsequent pebble compactions, storage space is reclaimed. See pkg/sql/schema_changer.go for where all this work originates. This particular test makes two copies of the 4TiB replicated TPC-E 100k dataset, runs foreground workload against one and drops the other. It makes use of disk snapshots of course, because we're impatient. Release note: None
craig bot
pushed a commit
that referenced
this issue
May 31, 2023
104051: roachtest: add admission-control/database-drop r=irfansharif a=irfansharif **roachtest: add admission-control/database-drop** Part of #89208. We've seen incidents where a large index being dropped caused a moderate latency/throughput impact on foreground load. Index/table/database drop all share the same underlying code, we make metadata only changes on the descriptor, wait out the GC TTL, and issue range deletion tombstones over the now-deleted keyspace. On subsequent pebble compactions, storage space is reclaimed. See pkg/sql/schema_changer.go for where all this work originates. This particular test makes two copies of the 4TiB replicated TPC-E 100k dataset, runs foreground workload against one and drops the other. It makes use of disk snapshots of course, because we're impatient. **roachprod/gce: use prefix search in snapshot listing API** Like the API intended. Only noticed in the subsequent commit when introducing a second use of disk snapshots that happened to use "tpcc-100k" in its name, somewhere. **roachtest: update index-backfill roachtest** Adding a bit more commentary, throwing in another concurrent index backfill for free, and fighting some open CRDB bugs unrelated to the test. **roachtest: fix race with concurrent map writes** This author forgot that struct copies weren't deep enough. Co-authored-by: irfan sharif <[email protected]>
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this issue
Aug 15, 2023
Rename + own as an AC integration test, similar to ones we have for backup/changefeeds/etc. We'll integrate row-level TTL reads in the subsequent commit Part of cockroachdb#89208. Release note: None
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this issue
Aug 21, 2023
Rename + own as an AC integration test, similar to ones we have for backup/changefeeds/etc. We'll integrate row-level TTL reads in the subsequent commit Part of cockroachdb#89208. Release note: None
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this issue
Aug 22, 2023
Rename + own as an AC integration test, similar to ones we have for backup/changefeeds/etc. We'll integrate row-level TTL reads in the subsequent commit Part of cockroachdb#89208. Release note: None
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this issue
Aug 22, 2023
Rename + own as an AC integration test, similar to ones we have for backup/changefeeds/etc. We'll integrate row-level TTL reads in the subsequent commit Part of cockroachdb#89208. Release note: None
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this issue
Sep 21, 2023
This is just resuscitating cockroachdb#81516, tests we added when developing follower pausing but never checked in. Some of this work is applicable in the context of replication admission control, when we apply flow control for regular writes. The tests add add an IO nemesis on n3/s3, restricting disk bandwidth to 20MiB/s. They then observe what happens when remote nodes n1 and n2 issue follower write traffic to it, either across many ranges or just 1 (potentially triggering the per-replica proposal quota pool). We can observe LSM state on n3/s3 under such conditions, and also what would happen if the concentrated follower node also what happens if that very node also serves some foreground load. Part of cockroachdb#89208. Release note: None
aadityasondhi
pushed a commit
to irfansharif/cockroach
that referenced
this issue
Oct 25, 2023
This is just resuscitating cockroachdb#81516, tests we added when developing follower pausing but never checked in. Some of this work is applicable in the context of replication admission control, when we apply flow control for regular writes. The tests add add an IO nemesis on n3/s3, restricting disk bandwidth to 20MiB/s. They then observe what happens when remote nodes n1 and n2 issue follower write traffic to it, either across many ranges or just 1 (potentially triggering the per-replica proposal quota pool). We can observe LSM state on n3/s3 under such conditions, and also what would happen if the concentrated follower node also what happens if that very node also serves some foreground load. Part of cockroachdb#89208. Release note: None
aadityasondhi
pushed a commit
to irfansharif/cockroach
that referenced
this issue
Oct 25, 2023
This is just resuscitating cockroachdb#81516, tests we added when developing follower pausing but never checked in. Some of this work is applicable in the context of replication admission control, when we apply flow control for regular writes. The tests add an IO nemesis on n3/s3, restricting disk bandwidth to 20MiB/s. They then observe what happens when remote nodes n1 and n2 issue follower write traffic to it, either across many ranges or just 1 (potentially triggering the per-replica proposal quota pool). We can observe LSM state on n3/s3 under such conditions, and also what would happen if the concentrated follower node also serves some foreground load. Part of cockroachdb#89208. Release note: None
aadityasondhi
pushed a commit
to irfansharif/cockroach
that referenced
this issue
Oct 25, 2023
This is just resuscitating cockroachdb#81516, tests we added when developing follower pausing but never checked in. Some of this work is applicable in the context of replication admission control, when we apply flow control for regular writes. The tests add an IO nemesis on n3/s3, restricting disk bandwidth to 20MiB/s. They then observe what happens when remote nodes n1 and n2 issue follower write traffic to it, either across many ranges or just 1 (potentially triggering the per-replica proposal quota pool). We can observe LSM state on n3/s3 under such conditions, and also what would happen if the concentrated follower node also serves some foreground load. Part of cockroachdb#89208. Release note: None
craig bot
pushed a commit
that referenced
this issue
Oct 25, 2023
111070: roachtest: add admission-overload/follower-overload r=sumeerbhola a=irfansharif This is just resuscitating #81516, tests we added when developing follower pausing but never checked in. Some of this work is applicable in the context of replication admission control, when we apply flow control for regular writes. The tests add an IO nemesis on n3/s3, restricting disk bandwidth to 20MiB/s. They then observe what happens when remote nodes n1 and n2 issue follower write traffic to it, either across many ranges or just 1 (potentially triggering the per-replica proposal quota pool). We can observe LSM state on n3/s3 under such conditions, and also what would happen if the concentrated follower node also serves some foreground load. Part of #89208. Release note: None 112660: kvflowcontroller: fix logging of blocked streams r=pavelkalinnikov,aadityasondhi a=sumeerbhola There were two bugs that are fixed: - The blocked_stream_count metric was incorrectly capped to 100. - Streams were being logged with stats that were never blocked. Some additonal improvements/fixes: - Controller.mu was being unnecessarily acquired for read paths that don't care about concurrent additions to the map. - WorkClass.SafeFormat was calling redact.SafePrinter.Print in some cases so "elastic" was not being treated as unsafe. There is a unit test to test the overflow logic of the logs, and to verify that the metric is correct even when the logs overflow. Epic: none Release note: None Co-authored-by: irfan sharif <[email protected]> Co-authored-by: sumeerbhola <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-admission-control
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-admission-control
Admission Control
This is a tracking issue for roachtests we want to introduce to validate existing/new AC machinery (subsuming #85469 which I forgot existed). They'll typically codify manual experiments that have been useful in developing said machinery. These roachtests should demonstrate performance isolation (throughput, latency) in the face of:
Next:
INSERT INTO newtable SELECT * FROM oldtable
where oldtable is ~1.5TB. In https://github.com/cockroachlabs/support/issues/2102 we saw the large insert cause a lot of open write intents, and intent resolution is not subject to AC. We saw this peg CPU at 100% and cause IO token exhaustion. See admission: make intent resolution subject to admission control #97108. (https://github.com/cockroachlabs/support/issues/2102, https://github.com/cockroachlabs/support/issues/2237, https://github.com/cockroachlabs/support/issues/2249, https://github.com/cockroachlabs/support/issues/2240)Later:
IMPORT
s can cause IO token exhaustion + LSM inversion, and delays of 10+s in admission wait queues (https://github.com/cockroachlabs/support/issues/2156). This might be related to admission: lack of intra-tenant prioritization for IO work #95678.For some of these, we'll want variants that hit CPU and IO saturation separately. We would also like a multi-workload test with varying priorities, or originating from different tenants (e.g. NormalPri reads/writes and BulkNormalPri work from another tenant). We also want library functions in roachtests to better experimentation/tests: #89978.
Jira issue: CRDB-20126
The text was updated successfully, but these errors were encountered: