Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission: add integration tests #89208

Open
10 of 21 tasks
irfansharif opened this issue Oct 3, 2022 · 1 comment
Open
10 of 21 tasks

admission: add integration tests #89208

irfansharif opened this issue Oct 3, 2022 · 1 comment
Assignees
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-admission-control Admission Control

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Oct 3, 2022

This is a tracking issue for roachtests we want to introduce to validate existing/new AC machinery (subsuming #85469 which I forgot existed). They'll typically codify manual experiments that have been useful in developing said machinery. These roachtests should demonstrate performance isolation (throughput, latency) in the face of:

Next:

Later:

For some of these, we'll want variants that hit CPU and IO saturation separately. We would also like a multi-workload test with varying priorities, or originating from different tenants (e.g. NormalPri reads/writes and BulkNormalPri work from another tenant). We also want library functions in roachtests to better experimentation/tests: #89978.

Jira issue: CRDB-20126

@blathers-crl
Copy link

blathers-crl bot commented Oct 3, 2022

Hi @irfansharif, please add a C-ategory label to your issue. Check out the label system docs.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@irfansharif irfansharif added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Oct 3, 2022
irfansharif added a commit to irfansharif/cockroach that referenced this issue Oct 6, 2022
Informs cockroachdb#89208. This test sets up a 3-node CRDB cluster on 8vCPU
machines running 1000-warehouse TPC-C with an aggressive (every 20m)
full backup schedule. We've observed latency spikes during backups
because of its CPU-heavy nature -- it can elevate CPU scheduling
latencies which in turn translates to an increase in foreground latency.
In cockroachdb#86638 we introduced admission control mechanisms to dynamically pace
such work while maintaining acceptable CPU scheduling latencies (sub
millisecond p99s). This roachtest exercises that machinery. In future
commits we'll add libraries to the roachtest package to automatically
spit out the degree to which {CPU-scheduler,foreground} latencies are
protected.

Release note: None
craig bot pushed a commit that referenced this issue Oct 6, 2022
87763: changefeedccl: mark kv senderrors retryable r=samiskin a=samiskin

Resolves #87300

Changefeeds can encounter senderrors during a normal upgrade procedure and therefore should retry.  This was done in the kvfeed however is apparently not high level enough as a send error was still observed to cause a permanent failure.

This PR moves the senderror checking to the top level IsRetryable check to handle it regardless of its source.

Release justification: low risk important bug fix

Release note (bug fix): Changefeeds will now never permanently error on a "failed to send RPC" error.

89445: opt: assert that inverted scans have inverted constraints r=mgartner a=mgartner

This commit adds an assertion to ensure that inverted index scans have inverted constraints. If they do not, there is a likely a bug that can cause incorrect query results (e.g., #88047). This assertion is made in release builds,not just test builds, because it is cheap to perform.

Fixes #89440

Release note: None

89482: roachtests: introduce admission-control/elastic-backup r=irfansharif a=irfansharif

Informs #89208. This test sets up a 3-node CRDB cluster on 8vCPU machines running 1000-warehouse TPC-C with an aggressive (every 20m) full backup schedule. We've observed latency spikes during backups because of its CPU-heavy nature -- it can elevate CPU scheduling latencies which in turn translates to an increase in foreground latency. In #86638 we introduced admission control mechanisms to dynamically pace such work while maintaining acceptable CPU scheduling latencies (sub millisecond p99s). This roachtest exercises that machinery. In future commits we'll add libraries to the roachtest package to automatically spit out the degree to which {CPU-scheduler,foreground} latencies are protected.

Release note: None

Co-authored-by: Shiranka Miskin <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this issue Oct 17, 2022
Part of cockroachdb#89208. This test sets up a 3-node CRDB cluster on 8vCPU
machines running 1000-warehouse TPC-C, and kicks off a few changefeed
backfills concurrently. We've observed latency spikes during backfills
because of its CPU/scan-heavy nature -- it can elevate CPU scheduling
latencies which in turn translates to an increase in foreground latency.

Also in this commit: routing std{err,out} from prometheus/grafana setup
that roachtests do to the logger in scope.

Release note: None
craig bot pushed a commit that referenced this issue Oct 17, 2022
88974: sql: add support for `DELETE FROM ... USING` r=faizaanmadhani a=faizaanmadhani

See commit messages for details.

Resolves: #40963

89459: metrics: expose pebble flush utilization r=jbowens a=coolcom200

Create a new `GaugeFloat64` metric for pebble’s flush utilization. This
metric is not cumulative, rather, it is the metric over an interval.
This interval is determined by the `interval` parameter of the
`Node.startComputePeriodicMetrics` method.

In order to compute the metric over an interval the previous value of
the metric must be stored. As a result, a map is constructed that takes
a pointer to a store and maps it to a pointer to storage metrics:
`make(map[*kvserver.Store]*storage.Metrics)`. This map is passed to
`node.computeMetricsPeriodically` which gets the store to calculate its
metrics and then updates the previous metrics in the map.

Refactor `store.go`'s metric calculation by separating
`ComputeMetrics(ctx context.Context, tick int) error` into two methods:

* `ComputeMetrics(ctx context.Context) error`
* `ComputeMetricsPeriodically(ctx context.Context, prevMetrics
  *storage.Metrics, tick int) (m storage.Metrics, err error)`

Both methods call the `computeMetrics` which contains the common code
between the two calls. Before this, the process for retrieving metrics
instantaneous was to pass a tick value such as `-1` or `0` to the
`ComputeMetrics(ctx context.Context, tick int)` however it can be
done with a call to `ComputeMetrics(ctx context.Context)`

The `store.ComputeMetricsPeriodically` method will also return the
latest storage metrics. These metrics are used to update the mapping
between stores and metrics used for computing the metric delta over an
interval.

Release Note: None

Resolves part of #85755
Depends on #88972, cockroachdb/pebble#2001
Epic: CRDB-17515


89656: roachtest: introduce admission-control/elastic-cdc r=irfansharif a=irfansharif

Part of #89208. This test sets up a 3-node CRDB cluster on 8vCPU machines running 1000-warehouse TPC-C, and kicks off a few changefeed backfills concurrently. We've observed latency spikes during backfills because of its CPU/scan-heavy nature -- it can elevate CPU scheduling latencies which in turn translates to an increase in foreground latency.

Also in this commit: routing std{err,out} from prometheus/grafana setup that roachtests do to the logger in scope.

Release note: None

Co-authored-by: Faizaan Madhani <[email protected]>
Co-authored-by: Leon Fattakhov <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this issue Oct 24, 2022
These tests will only serve as coarse-grained benchmarks for things AC
cares about -- we don't need to run them nightly. They've spent ~2 weeks
through our nightly CI suite without flaking so reduce frequency to a
weekly cadence. We'll do this same thing for most tests added as part of
\cockroachdb#89208.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Nov 4, 2022
These tests will only serve as coarse-grained benchmarks for things AC
cares about -- we don't need to run them nightly. They've spent ~2 weeks
through our nightly CI suite without flaking so reduce frequency to a
weekly cadence. We'll do this same thing for most tests added as part of
\cockroachdb#89208.

Release note: None
craig bot pushed a commit that referenced this issue Nov 4, 2022
90579: roachtest: reduce frequency of benchmark-only AC tests r=irfansharif a=irfansharif

First four commits are from #89709 and should be ignored here. These tests will only serve as coarse-grained benchmarks for things AC cares about -- we don't need to run them nightly. They've spent ~2 weeks through our nightly CI suite without flaking so reduce frequency to a weekly cadence. We'll do this same thing for most tests added as part of #89208.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this issue May 29, 2023
Part of cockroachdb#89208. We've seen incidents where a large index being dropped
caused a moderate latency/throughput impact on foreground load.
Index/table/database drop all share the same underlying code, we make
metadata only changes on the descriptor, wait out the GC TTL, and issue
range deletion tombstones over the now-deleted keyspace. On subsequent
pebble compactions, storage space is reclaimed. See
pkg/sql/schema_changer.go for where all this work originates.

This particular test makes two copies of the 4TiB replicated TPC-E 100k
dataset, runs foreground workload against one and drops the other. It
makes use of disk snapshots of course, because we're impatient.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue May 31, 2023
Part of cockroachdb#89208. We've seen incidents where a large index being dropped
caused a moderate latency/throughput impact on foreground load.
Index/table/database drop all share the same underlying code, we make
metadata only changes on the descriptor, wait out the GC TTL, and issue
range deletion tombstones over the now-deleted keyspace. On subsequent
pebble compactions, storage space is reclaimed. See
pkg/sql/schema_changer.go for where all this work originates.

This particular test makes two copies of the 4TiB replicated TPC-E 100k
dataset, runs foreground workload against one and drops the other. It
makes use of disk snapshots of course, because we're impatient.

Release note: None
craig bot pushed a commit that referenced this issue May 31, 2023
104051: roachtest: add admission-control/database-drop r=irfansharif a=irfansharif

**roachtest: add admission-control/database-drop**

Part of #89208. We've seen incidents where a large index being dropped
caused a moderate latency/throughput impact on foreground load.
Index/table/database drop all share the same underlying code, we make
metadata only changes on the descriptor, wait out the GC TTL, and issue
range deletion tombstones over the now-deleted keyspace. On subsequent
pebble compactions, storage space is reclaimed. See
pkg/sql/schema_changer.go for where all this work originates.

This particular test makes two copies of the 4TiB replicated TPC-E 100k
dataset, runs foreground workload against one and drops the other. It
makes use of disk snapshots of course, because we're impatient.

**roachprod/gce: use prefix search in snapshot listing API**

Like the API intended. Only noticed in the subsequent commit when
introducing a second use of disk snapshots that happened to use
"tpcc-100k" in its name, somewhere.

**roachtest: update index-backfill roachtest**

Adding a bit more commentary, throwing in another concurrent index
backfill for free, and fighting some open CRDB bugs unrelated to the
test.

**roachtest: fix race with concurrent map writes**

This author forgot that struct copies weren't deep enough.

Co-authored-by: irfan sharif <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this issue Aug 15, 2023
Rename + own as an AC integration test, similar to ones we have for
backup/changefeeds/etc. We'll integrate row-level TTL reads in the
subsequent commit

Part of cockroachdb#89208.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Aug 21, 2023
Rename + own as an AC integration test, similar to ones we have for
backup/changefeeds/etc. We'll integrate row-level TTL reads in the
subsequent commit

Part of cockroachdb#89208.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Aug 22, 2023
Rename + own as an AC integration test, similar to ones we have for
backup/changefeeds/etc. We'll integrate row-level TTL reads in the
subsequent commit

Part of cockroachdb#89208.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Aug 22, 2023
Rename + own as an AC integration test, similar to ones we have for
backup/changefeeds/etc. We'll integrate row-level TTL reads in the
subsequent commit

Part of cockroachdb#89208.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Sep 21, 2023
This is just resuscitating cockroachdb#81516, tests we added when developing
follower pausing but never checked in. Some of this work is applicable
in the context of replication admission control, when we apply flow
control for regular writes. The tests add add an IO nemesis on n3/s3,
restricting disk bandwidth to 20MiB/s. They then observe what happens
when remote nodes n1 and n2 issue follower write traffic to it, either
across many ranges or just 1 (potentially triggering the per-replica
proposal quota pool). We can observe LSM state on n3/s3 under such
conditions, and also what would happen if the concentrated follower node
also what happens if that very node also serves some foreground load.

Part of cockroachdb#89208.

Release note: None
@aadityasondhi aadityasondhi self-assigned this Oct 2, 2023
@aadityasondhi aadityasondhi added the T-admission-control Admission Control label Oct 3, 2023
aadityasondhi pushed a commit to irfansharif/cockroach that referenced this issue Oct 25, 2023
This is just resuscitating cockroachdb#81516, tests we added when developing
follower pausing but never checked in. Some of this work is applicable
in the context of replication admission control, when we apply flow
control for regular writes. The tests add add an IO nemesis on n3/s3,
restricting disk bandwidth to 20MiB/s. They then observe what happens
when remote nodes n1 and n2 issue follower write traffic to it, either
across many ranges or just 1 (potentially triggering the per-replica
proposal quota pool). We can observe LSM state on n3/s3 under such
conditions, and also what would happen if the concentrated follower node
also what happens if that very node also serves some foreground load.

Part of cockroachdb#89208.

Release note: None
aadityasondhi pushed a commit to irfansharif/cockroach that referenced this issue Oct 25, 2023
This is just resuscitating cockroachdb#81516, tests we added when developing
follower pausing but never checked in. Some of this work is applicable
in the context of replication admission control, when we apply flow
control for regular writes. The tests add an IO nemesis on n3/s3,
restricting disk bandwidth to 20MiB/s. They then observe what happens
when remote nodes n1 and n2 issue follower write traffic to it, either
across many ranges or just 1 (potentially triggering the per-replica
proposal quota pool). We can observe LSM state on n3/s3 under such
conditions, and also what would happen if the concentrated follower node
also serves some foreground load.

Part of cockroachdb#89208.

Release note: None
aadityasondhi pushed a commit to irfansharif/cockroach that referenced this issue Oct 25, 2023
This is just resuscitating cockroachdb#81516, tests we added when developing
follower pausing but never checked in. Some of this work is applicable
in the context of replication admission control, when we apply flow
control for regular writes. The tests add an IO nemesis on n3/s3,
restricting disk bandwidth to 20MiB/s. They then observe what happens
when remote nodes n1 and n2 issue follower write traffic to it, either
across many ranges or just 1 (potentially triggering the per-replica
proposal quota pool). We can observe LSM state on n3/s3 under such
conditions, and also what would happen if the concentrated follower node
also serves some foreground load.

Part of cockroachdb#89208.

Release note: None
craig bot pushed a commit that referenced this issue Oct 25, 2023
111070: roachtest: add admission-overload/follower-overload r=sumeerbhola a=irfansharif

This is just resuscitating #81516, tests we added when developing
follower pausing but never checked in. Some of this work is applicable
in the context of replication admission control, when we apply flow
control for regular writes. The tests add an IO nemesis on n3/s3,
restricting disk bandwidth to 20MiB/s. They then observe what happens
when remote nodes n1 and n2 issue follower write traffic to it, either
across many ranges or just 1 (potentially triggering the per-replica
proposal quota pool). We can observe LSM state on n3/s3 under such
conditions, and also what would happen if the concentrated follower node
also serves some foreground load.

Part of #89208.

Release note: None

112660: kvflowcontroller: fix logging of blocked streams r=pavelkalinnikov,aadityasondhi a=sumeerbhola

There were two bugs that are fixed:
- The blocked_stream_count metric was incorrectly capped to 100.
- Streams were being logged with stats that were never blocked.

Some additonal improvements/fixes:
- Controller.mu was being unnecessarily acquired for read paths that don't care about concurrent additions to the map.
- WorkClass.SafeFormat was calling redact.SafePrinter.Print in some cases so "elastic" was not being treated as unsafe.

There is a unit test to test the overflow logic of the logs, and to verify that the metric is correct even when the logs overflow.

Epic: none

Release note: None

Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: sumeerbhola <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-admission-control Admission Control
Projects
None yet
Development

No branches or pull requests

3 participants