Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: disable merge queue until kvsubscriber has updated #78122

Merged
merged 1 commit into from
Mar 21, 2022

Conversation

irfansharif
Copy link
Contributor

If we don't have any span configs available, enabling range merges would
be extremely dangerous -- we could collapse everything into a single
range. We've observed this happen when the kvsubscriber's initial scan
overflows its bounded buffer, preventing it from ever getting a
snapshot. A future commit will fix the bounded memory issue, but the
side-effect pointed out the need for this important safe guard.

Informs #77687.

Release justification: bug fix
Release note: None

@irfansharif irfansharif requested a review from ajwerner March 19, 2022 06:39
@irfansharif irfansharif requested a review from a team as a code owner March 19, 2022 06:39
@cockroach-teamcity
Copy link
Member

This change is Reviewable

If we don't have any span configs available, enabling range merges would
be extremely dangerous -- we could collapse everything into a single
range. We've observed this happen when the kvsubscriber's initial scan
overflows its bounded buffer, preventing it from ever getting a
snapshot. A future commit will fix the bounded memory issue, but the
side-effect pointed out the need for this important safe guard.

Informs cockroachdb#77687.

Release justification: bug fix
Release note: None
@irfansharif irfansharif force-pushed the 220319.kvsubscriber-no-merge branch from 583fd11 to 1c47b74 Compare March 19, 2022 22:00
@tbg
Copy link
Member

tbg commented Mar 21, 2022

Drive-by which you can feel free to disregard, where does the merge queue interact with the span configs? And is that perhaps a place where the API can be extended to indicate that the spans aren't known yet?

@irfansharif
Copy link
Contributor Author

There are two places. When we receive configs over a keyspan, we enqueue the corresponding ranges into merge+split queues in case the config necessitates a split/allows a merge:

s.mergeQueue.Async(replCtx, "span config update", true /* wait */, func(ctx context.Context, h queueHelper) {

When the merge queue processes replicas in its own background thread, it checks if the extending a replicas keyspace a key further would necessitate a split:

if confReader.NeedsSplit(ctx, desc.StartKey, desc.EndKey.Next()) {

Right above processing individual replicas we check if the merge queue is enabled. Given that, folding this check in where it is felt reasonable/minimally invasive.

@irfansharif irfansharif requested a review from arulajmani March 21, 2022 14:32
Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)

@irfansharif
Copy link
Contributor Author

CI failure looks unrelated.

fatal error: concurrent map read and map write

goroutine 13768 [running]:
runtime.throw({0x413c25e, 0xc0178a6258})
	/usr/local/go/src/runtime/panic.go:1198 +0x71 fp=0xc024d1def8 sp=0xc024d1dec8 pc=0x452331
runtime.mapaccess2(0x3fe4d00, 0xb, 0xc024d1df78)
	/usr/local/go/src/runtime/map.go:469 +0x205 fp=0xc024d1df38 sp=0xc024d1def8 pc=0x429a45
github.com/cockroachdb/cockroach/pkg/sql/schemachanger/screl.fieldIsExported({0x53c7ff0, 0x3d4a9e0}, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/schemachanger/screl/walk.go:133 +0x77 fp=0xc024d1e0b0 sp=0xc024d1df38 pc=0x15faff7
github.com/cockroachdb/cockroach/pkg/sql/schemachanger/screl.walk.func4({0x3d4a9e0, 0xc0178a6240, 0xc00afa1c00})
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/schemachanger/screl/walk.go:106 +0x1a5 fp=0xc024d1e158 sp=0xc024d1e0b0 pc=0x15fa9e5
github.com/cockroachdb/cockroach/pkg/sql/schemachanger/screl.walk({0x53c7ff0, 0x3b014c0}, {0x3fb02a0, 0xc0178a6240}, 0xc024d1e2c0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/schemachanger/screl/walk.go:125 +0x3a9 fp=0xc024d1e278 sp=0xc024d1e158 pc=0x15fa789
github.com/cockroachdb/cockroach/pkg/sql/schemachanger/screl.WalkDescIDs({0x5354668, 0xc0178a6240}, 0x226b4e5)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/schemachanger/screl/walk.go:26 +0x88 fp=0xc024d1e2d0 sp=0xc024d1e278 pc=0x15fa0a8
github.com/cockroachdb/cockroach/pkg/sql/schemachanger/screl.AllTargetDescIDs({{0xc026bcf000, 0x10, 0x20}, {0xc01bf06930, 0x1, 0x1}, {{0xc0196cb042, 0x4}, {0x0, 0x0}}})
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/schemachanger/screl/scalars.go:52 +0x125 fp=0xc024d1e338 sp=0xc024d1e2d0 pc=0x15f9d85
github.com/cockroachdb/cockroach/pkg/sql/schemachanger/scplan/internal/scstage.makeDescriptorStates(0xa5a4819d1138001, 0x0, {{0xc026bcf000, 0x10, 0x20}, {0xc01bf06930, 0x1, 0x1}, {{0xc0196cb042, 0x4}, ...}}, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/schemachanger/scplan/internal/scstage/build.go:535 +0x78 fp=0xc024d1e608 sp=0xc024d1e338 pc=0x226d378
github.com/cockroachdb/cockroach/pkg/sql/schemachanger/scplan/internal/scstage.buildContext.setJobStateOnDescriptorOps({0x0, 0xc0206c4a00, 0xc01a7459f8, 0x1, {{0xc026bcf000, 0x10, 0x20}, {0xc01bf06930, 0x1, 0x1}, ...}, ...}, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/schemachanger/scplan/internal/scstage/build.go:518 +0x95 fp=0xc024d1e6f0 sp=0xc024d1e608 pc=0x226d035
github.com/cockroachdb/cockroach/pkg/sql/schemachanger/scplan/internal/scstage.stageBuilder.computeExtraJobOps({{0x0, 0xc0206c4a00, 0xc01a7459f8, 0x1, {{0xc026bcf000, 0x10, 0x20}, {0xc01bf06930, 0x1, 0x1}, ...}, ...}, ...}, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/schemachanger/scplan/internal/scstage/build.go:445 +0x2f5 fp=0xc024d1e898 sp=0xc024d1e6f0 pc=0x226c875
github.com/cockroachdb/cockroach/pkg/sql/schemachanger/scplan/internal/scstage.stageBuilder.build({{0x0, 0xc0206c4a00, 0xc01a7459f8, 0x1, {{0xc026bcf000, 0x10, 0x20}, {0xc01bf06930, 0x1, 0x1}, ...}, ...}, ...})
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/schemachanger/scplan/internal/scstage/build.go:423 +0x39b fp=0xc024d1eb18 sp=0xc024d1e898 pc=0x226c4bb

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 21, 2022

Build succeeded:

@craig craig bot merged commit e25031a into cockroachdb:master Mar 21, 2022
@irfansharif irfansharif deleted the 220319.kvsubscriber-no-merge branch March 21, 2022 18:03
@andreimatei
Copy link
Contributor

Would it make sense to block for the kvsubscriber connection in the server's join phase?

@ajwerner
Copy link
Contributor

I think it might.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Irfan have you considered doing that instead?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

irfansharif added a commit to irfansharif/cockroach that referenced this pull request Aug 19, 2022
Fixes cockroachdb#73107. Fixes cockroachdb#79109. Fixes cockroachdb#85372.

It's possible that under certain conditions the merge queue doesn't
immediately merge what these tests are expecting. One possible reasons is
span config asynchrony; the merge queue is actually disabled until each
store learns about at least some span config state: cockroachdb#78122 (if we don't
have any span configs available, enabling range merges would be
extremely dangerous -- we could collapse everything into a single
range).

Release justification: stability work
Release note: None
craig bot pushed a commit that referenced this pull request Aug 19, 2022
86325: ui: add workload insight details page v1 r=ericharmeling a=ericharmeling

This commit adds the v1 Workload Insight Details page to the DB Console, via the `cluster-ui` package. The v1 Workload Insight Details page only includes details about a specific High Wait Time transaction insight event, populated with information served a new "endpoint" built on top of the SQL-over-HTTP API.

Note that this v1 page only includes information available on contention for individual transaction executions recorded in the `crdb_internal.transaction_contention_events` table. As most transaction statistics are aggregated across multiple transaction executions, there are some data missing in this v1, most notably: end time, rows processed, priority, full scan, retry count, last retry reason, session id.

Fixes #83775.

https://www.loom.com/share/3f7dbd1326954c4ca7949ed8539ae4e9

Release note (ui change): Added new Workload Insight Details page to DB Console

Release justification: low-risk updates to new functionality

86412: kvserver: deflake TestMergeQueue{...} tests r=irfansharif a=irfansharif

Fixes #73107. Fixes #79109. Fixes #85372.

It's possible that under certain conditions the merge queue doesn't
immediately merge what these tests are expecting. One possible reasons is
span config asynchrony; the merge queue is actually disabled until each
store learns about at least some span config state: #78122 (if we don't
have any span configs available, enabling range merges would be
extremely dangerous -- we could collapse everything into a single
range).

Release justification: stability work
Release note: None

86416: bazel: remove `devdarwinx86_64` toolchain r=rail,dt,jlinder a=rickystewart

We haven't come up with a `clang` 10 toolchain for macOS/arm64 because
LLVM doesn't publish `clang` archives for that architecture. Instead of
letting the two platforms differ to this extent, just remove the `dev`
toolchain for x86_64.

Closes #72828.

Release justification: non-production code changes
Release note: None

86421: kvevent: implement chunked blocking buffer r=jayshrivastava a=jayshrivastava

### **kvevent: refactor blocking buffer benchmark**
This change updates the blocking buffer micro benchmark
in several ways:
- it uses different types of events
- it uses more producers than consumers to keep the
  buffer full
- it makes b.N correspond to the total number of events,
  so the benchmark can analyze allocs per event

Release note: None

Release justification: This change updates a benchmark only.

### **kvevent: implement chunked buffer event queue**
This change implements a simple chunked event queue.
The purpose of this queue is to be used by
kvevent.blockingBuffer in subsequent commits.

Release note: None

Release justification: This change does not affect
any production code. It adds files which are not
called by any packages.

### **kvevent: refactor memory buffer to chunked linked list**
This change refactors kvevent/blocking_buffer.go to use
a chunked linked list instead of a regular linked list to
reduce pointer usage. Note that the underlying sync.Pool,
which is also a linked list, will use less pointers due to
us pooling chunks instead of events.

Release note: None

Release justification: This change significantly
improves performance by significantly reducing
pressure on GC. Consequently, this significantly
improves foreground SQL p99 latency. GC has
been causing severe issues in production changefeeds.
Merging this change in this release is worth it
for its potential to reduce incidents.

### **Results (micro)**
These are the results of running the microbenchmark.
`./dev bench pkg/ccl/changefeedccl/kvevent --filter=BenchmarkMemBuffer --count=10 --bench-mem  --stream-output  --test-args="--test.benchtime=45s" -- --nocache_test_results --test_verbose_timeout_warnings |& tee bench.txt`
```
name          old time/op    new time/op    delta
MemBuffer-10    1.22µs ± 2%    0.85µs ± 3%  -30.04%  (p=0.000 n=8+10)

name          old alloc/op   new alloc/op   delta
MemBuffer-10     0.00B          0.00B          ~     (all equal)

name          old allocs/op  new allocs/op  delta
MemBuffer-10      0.00           0.00          ~     (all equal)
```
- Memory usage is 0 due to pooling in both implementations. 
- We can achieve a higher throughput with the chunked implementation - about 50-60M events in 45 seconds as opposed to ~40M with the old implementation. 


### **Results (Macro)**
Full results are published [here](https://docs.google.com/document/d/10-Pc_cnkCUhkRFpLyBAId39q1aqsjuqTb2k4ehTmQCY/edit?usp=sharing). In summary:

I analyzed performing by running TPC-C for 30 mins on a 15 node cluster with 10k warehouses. Before starting the workload, I started a changefeed on the `order_line` table (~200GB). I also set the following cluster settings to stress the buffer and pressure GC:
`changefeed.backfill.concurrent_scan_requests = 100;`
`changefeed.memory.per_changefeed_limit = '1073741824';` (~1GB)

Then, I analyzed SQL latency from admin UI and GC performance using the output of `GODEBUG=gctrace=1.` These are the outcomes:
- The p99 SQL latency during the workload was reduced from approx. 1.75s -> 0.150s (91%)
- CPU time spent doing GC was reduced from 37.86 mins -> 20.75 mins (45%)
- The p99 spike at the beginning of the workload was reduced from approx. 15s -> 12s (20%)

### Relevant Issues
Addresses: #84582
(for now...)

86426: sql: bump memory limits in a couple of tests r=yuzefovich a=yuzefovich

This commit is similar to dd82d13 where
we bumped --max-sql-memory limit. We have recently started hitting the
memory limit flakes again (because we do more accounting and have more
internal queries), so this commit bumps up the limits significantly
while still preserving the tests' spirit.

Fixes: #86372.

Release justification: test-only change.

Release note: None

86458: insights: surface suboptimal plans and failed executions r=matthewtodd a=matthewtodd

Fixes #85827.

Release justification: Category 2: Bug fixes and low-risk updates to new
functionality.

Release note: None

86471: vendor: bump Pebble to 406c1dce54c9 r=itsbilal,nicktrav a=jbowens

```
406c1dce *: add simpleLevelIterator, reduce merging levels in external iter
cd7f076e metamorphic: reposition iterator after SetOptions
5f6b4325 db: make RangeKeyMasking.Filter a constructor function
```

In addition to the vendor bump, this commit makes some small code
changes in response to interface changes within Pebbe. I kept these
changes minimal for the Pebble bump PR, and we can follow up with
fully adapting call sites to take advantage of the new ability to pass
'levels' of sstables into the sstable iterator.

Release note: None
Release justification: Non-production code changes, low-risk updates to
new functionality.

86481: Makefile: fix logpb generation r=nicktrav a=ajwerner

We needed to pass the flag. This will hit problems if any of the deps of the
file change.

Release justification: fixes the build

Release note: None

Co-authored-by: Eric Harmeling <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: Jayant Shrivastava <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Matthew Todd <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Mar 11, 2023
...subscribed to span configs. Do the same for the store
rebalancer. We applied this treatment for the merge queue back in cockroachdb#78122
since the fallback behavior, if not subscribed, is to use the statically
defined span config for every operation.

- For the replicate queue this mean obtusely applying a replication
  factor of 3, regardless of configuration. This was possible typically
  post node restart before subscription was initially established. We
  saw this in cockroachdb#98385. It was possible then for us to ignore configured
  voter/non-voter/lease constraints.
- For the split queue, we wouldn't actually compute any split keys if
  unsubscribed, so the missing check was somewhat benign. But we would
  be obtusely applying the default range sizes [128MiB,512MiB], so for
  clusters configured with larger range sizes, this could lead to a
  burst of splitting post node-restart.
- For the MVCC GC queue, it would mean applying the the statically
  configured default GC TTL and ignoring any set protected timestamps.
  The latter is best-effort protection but could result in internal
  operations relying on protection (like backups, changefeeds) failing
  informatively. For clusters configured with GC TTL greater than the
  default, post node-restart it could lead to a burst of MVCC GC
  activity and AOST queries failing to find expected data.
- For the store rebalancer, ignoring span configs could result in
  violating lease preferences and voter constraints.

Fixes cockroachdb#98421.
Fixes cockroachdb#98385.

Release note (bug fix): It was previously possible for CockroachDB to
not respect non-default zone configs. This only happened for a short
window after nodes with existing replicas were restarted, and
self-rectified within seconds. This manifested in a few ways:
- If num_replicas was set to something other than 3, we would still
  add or remove replicas to get to 3x replication.
  - If num_voters was set explicitly to get a mix of voting and
    non-voting replicas, it would be ignored. CockroachDB could possibly
    remove non-voting replicas.
- If range_min_bytes or range_max_bytes were changed from 128 MiB and
  512 MiB respectively, we would instead try to size ranges to be within
  [128 MiB, 512MiB]. This could appear as an excess amount of range
  splits or merges, as visible in the Replication Dashboard under "Range
  Operations".
- If gc.ttlseconds was set to something other than 90000 seconds, we
  would still GC data only older than 90000s/25h. If the GC TTL was set
  to something larger than 25h, AOST queries going further back may now
  start failing. For GC TTLs less than the 25h default, clusters would
  observe increased disk usage due to more retained garbage.
- If constraints, lease_preferences or voter_constraints were set, they
  would be ignored. Range data and leases would possibly be moved
  outside where prescribed.
This issues only lasted a few seconds post node-restarts, and any zone
config violations were rectified shortly after.
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Mar 13, 2023
...subscribed to span configs. Do the same for the store
rebalancer. We applied this treatment for the merge queue back in cockroachdb#78122
since the fallback behavior, if not subscribed, is to use the statically
defined span config for every operation.

- For the replicate queue this mean obtusely applying a replication
  factor of 3, regardless of configuration. This was possible typically
  post node restart before subscription was initially established. We
  saw this in cockroachdb#98385. It was possible then for us to ignore configured
  voter/non-voter/lease constraints.
- For the split queue, we wouldn't actually compute any split keys if
  unsubscribed, so the missing check was somewhat benign. But we would
  be obtusely applying the default range sizes [128MiB,512MiB], so for
  clusters configured with larger range sizes, this could lead to a
  burst of splitting post node-restart.
- For the MVCC GC queue, it would mean applying the the statically
  configured default GC TTL and ignoring any set protected timestamps.
  The latter is best-effort protection but could result in internal
  operations relying on protection (like backups, changefeeds) failing
  informatively. For clusters configured with GC TTL greater than the
  default, post node-restart it could lead to a burst of MVCC GC
  activity and AOST queries failing to find expected data.
- For the store rebalancer, ignoring span configs could result in
  violating lease preferences and voter/non-voter constraints.

Fixes cockroachdb#98421.
Fixes cockroachdb#98385.

While here, we also introduce the following non-public cluster settings
to selectively enable/disable KV queues:
- kv.mvcc_gc_queue.enabled
- kv.split_queue.enabled
- kv.replicate_queue.enabled

Release note (bug fix): It was previously possible for CockroachDB to
not respect non-default zone configs. This could only happen for a short
window after nodes with existing replicas were restarted (measured in
seconds), and self-rectified (also within seconds). This manifested in a
few ways:
- If num_replicas was set to something other than 3, we would still
  add or remove replicas to get to 3x replication.
  - If num_voters was set explicitly to get a mix of voting and
    non-voting replicas, it would be ignored. CockroachDB could possibly
    remove non-voting replicas.
- If range_min_bytes or range_max_bytes were changed from 128 MiB and
  512 MiB respectively, we would instead try to size ranges to be within
  [128 MiB, 512MiB]. This could appear as an excess amount of range
  splits or merges, as visible in the Replication Dashboard under "Range
  Operations".
- If gc.ttlseconds was set to something other than 90000 seconds, we
  would still GC data only older than 90000s/25h. If the GC TTL was set
  to something larger than 25h, AOST queries going further back may now
  start failing. For GC TTLs less than the 25h default, clusters would
  observe increased disk usage due to more retained garbage.
- If constraints, lease_preferences or voter_constraints were set, they
  would be ignored. Range data and leases would possibly be moved
  outside where prescribed.
This issues lasted a few seconds post node-restarts, and any zone config
violations were rectified shortly after.
craig bot pushed a commit that referenced this pull request Mar 13, 2023
98261: sql: add crdb_internal views for system statistics tables r=ericharmeling a=ericharmeling

This commit adds two new crdb_internal views:

- crdb_internal.statement_statistics_persisted, which surfaces the system.statement_statistics table
- crdb_internal.transaction_statistics_persisted, which surfaces the system.transaction_statistics table

Example output from after flush:

```
[email protected]:26257/insights> select * from crdb_internal.statement_statistics_persisted limit 3;
-[ RECORD 1 ]
aggregated_ts              | 2023-03-08 23:00:00+00
fingerprint_id             | \x3ab7869b0bc4aa5a
transaction_fingerprint_id | \x95d43bd78dc51d85
plan_hash                  | \x9aec25074eb1e3a0
app_name                   | $ cockroach sql
node_id                    | 1
agg_interval               | 01:00:00
metadata                   | {"db": "insights", "distsql": true, "failed": false, "fullScan": true, "implicitTxn": true, "query": "SELECT * FROM crdb_internal.statement_statistics_persisted", "querySummary": "SELECT * FROM crdb_internal.statement_statis...", "stmtTyp": "TypeDML", "vec": true}
statistics                 | {"execution_statistics": {"cnt": 1, "contentionTime": {"mean": 0, "sqDiff": 0}, "cpuSQLNanos": {"mean": 24667, "sqDiff": 0}, "maxDiskUsage": {"mean": 0, "sqDiff": 0}, "maxMemUsage": {"mean": 2.048E+4, "sqDiff": 0}, "mvccIteratorStats": {"blockBytes": {"mean": 0, "sqDiff": 0}, "blockBytesInCache": {"mean": 0, "sqDiff": 0}, "keyBytes": {"mean": 0, "sqDiff": 0}, "pointCount": {"mean": 0, "sqDiff": 0}, "pointsCoveredByRangeTombstones": {"mean": 0, "sqDiff": 0}, "rangeKeyContainedPoints": {"mean": 0, "sqDiff": 0}, "rangeKeyCount": {"mean": 0, "sqDiff": 0}, "rangeKeySkippedPoints": {"mean": 0, "sqDiff": 0}, "seekCount": {"mean": 1, "sqDiff": 0}, "seekCountInternal": {"mean": 1, "sqDiff": 0}, "stepCount": {"mean": 0, "sqDiff": 0}, "stepCountInternal": {"mean": 0, "sqDiff": 0}, "valueBytes": {"mean": 0, "sqDiff": 0}}, "networkBytes": {"mean": 0, "sqDiff": 0}, "networkMsgs": {"mean": 0, "sqDiff": 0}}, "index_recommendations": [], "statistics": {"bytesRead": {"mean": 0, "sqDiff": 0}, "cnt": 1, "firstAttemptCnt": 1, "idleLat": {"mean": 0, "sqDiff": 0}, "indexes": ["42@1"], "lastErrorCode": "", "lastExecAt": "2023-03-08T23:14:04.614242Z", "latencyInfo": {"max": 0.001212208, "min": 0.001212208, "p50": 0, "p90": 0, "p99": 0}, "maxRetries": 0, "nodes": [1], "numRows": {"mean": 0, "sqDiff": 0}, "ovhLat": {"mean": 0.000007791999999999955, "sqDiff": 0}, "parseLat": {"mean": 0.000016666, "sqDiff": 0}, "planGists": ["AgFUAgD/FwAAAAcYBhg="], "planLat": {"mean": 0.000691666, "sqDiff": 0}, "regions": ["us-east1"], "rowsRead": {"mean": 0, "sqDiff": 0}, "rowsWritten": {"mean": 0, "sqDiff": 0}, "runLat": {"mean": 0.000496084, "sqDiff": 0}, "svcLat": {"mean": 0.001212208, "sqDiff": 0}}}
plan                       | {"Children": [], "Name": ""}
index_recommendations      | {}
indexes_usage              | ["42@1"]
-[ RECORD 2 ]
aggregated_ts              | 2023-03-08 23:00:00+00
fingerprint_id             | \x44c9fdb49be676cf
transaction_fingerprint_id | \xc1efcc0bba0909f8
plan_hash                  | \x780a1ba35976b15d
app_name                   | insights
node_id                    | 1
agg_interval               | 01:00:00
metadata                   | {"db": "insights", "distsql": false, "failed": false, "fullScan": false, "implicitTxn": false, "query": "UPDATE insights_workload_table_0 SET balance = balance + $1 WHERE id = $1", "querySummary": "UPDATE insights_workload_table_0 SET balance = balan... WHERE id = $1", "stmtTyp": "TypeDML", "vec": true}
statistics                 | {"execution_statistics": {"cnt": 28, "contentionTime": {"mean": 0, "sqDiff": 0}, "cpuSQLNanos": {"mean": 402538.75, "sqDiff": 1160598792985.25}, "maxDiskUsage": {"mean": 0, "sqDiff": 0}, "maxMemUsage": {"mean": 4.096E+4, "sqDiff": 0}, "mvccIteratorStats": {"blockBytes": {"mean": 31570.321428571428, "sqDiff": 20932497128.107143}, "blockBytesInCache": {"mean": 0, "sqDiff": 0}, "keyBytes": {"mean": 0, "sqDiff": 0}, "pointCount": {"mean": 6.857142857142857, "sqDiff": 435.42857142857133}, "pointsCoveredByRangeTombstones": {"mean": 0, "sqDiff": 0}, "rangeKeyContainedPoints": {"mean": 0, "sqDiff": 0}, "rangeKeyCount": {"mean": 0, "sqDiff": 0}, "rangeKeySkippedPoints": {"mean": 0, "sqDiff": 0}, "seekCount": {"mean": 2, "sqDiff": 0}, "seekCountInternal": {"mean": 2, "sqDiff": 0}, "stepCount": {"mean": 0, "sqDiff": 0}, "stepCountInternal": {"mean": 4.857142857142857, "sqDiff": 435.42857142857133}, "valueBytes": {"mean": 360.32142857142856, "sqDiff": 756476.107142857}}, "networkBytes": {"mean": 0, "sqDiff": 0}, "networkMsgs": {"mean": 0, "sqDiff": 0}}, "index_recommendations": [], "statistics": {"bytesRead": {"mean": 159.04887361588396, "sqDiff": 3909.7441771668127}, "cnt": 2619, "firstAttemptCnt": 2619, "idleLat": {"mean": 0.021495726165330273, "sqDiff": 36.39774900003032}, "indexes": ["106@1"], "lastErrorCode": "", "lastExecAt": "2023-03-08T23:31:03.079093Z", "latencyInfo": {"max": 1.724660916, "min": 0.0001765, "p50": 0.000757916, "p90": 0.00191375, "p99": 0.004730417}, "maxRetries": 0, "nodes": [1], "numRows": {"mean": 1, "sqDiff": 0}, "ovhLat": {"mean": 0.0000018584035891561339, "sqDiff": 3.132932109484058E-7}, "parseLat": {"mean": 0, "sqDiff": 0}, "planGists": ["AgHUAQIADwIAAAcKBQoh1AEAAA=="], "planLat": {"mean": 0.0002562748900343638, "sqDiff": 0.0002118085353898781}, "regions": ["us-east1"], "rowsRead": {"mean": 1, "sqDiff": 0}, "rowsWritten": {"mean": 1, "sqDiff": 0}, "runLat": {"mean": 0.0024048477613592997, "sqDiff": 4.850230671161608}, "svcLat": {"mean": 0.0026629810549828195, "sqDiff": 4.852464499918359}}}
plan                       | {"Children": [], "Name": ""}
index_recommendations      | {}
indexes_usage              | ["106@1"]
-[ RECORD 3 ]
aggregated_ts              | 2023-03-08 23:00:00+00
fingerprint_id             | \x54202c7b75a5ba87
transaction_fingerprint_id | \x0000000000000000
plan_hash                  | \xbee0e52ec8c08bdd
app_name                   | $$ $ cockroach demo
node_id                    | 1
agg_interval               | 01:00:00
metadata                   | {"db": "insights", "distsql": false, "failed": false, "fullScan": false, "implicitTxn": false, "query": "INSERT INTO system.jobs(id, created, status, payload, progress, claim_session_id, claim_instance_id, job_type) VALUES ($1, $1, __more1_10__)", "querySummary": "INSERT INTO system.jobs(id, created, st...)", "stmtTyp": "TypeDML", "vec": true}
statistics                 | {"execution_statistics": {"cnt": 1, "contentionTime": {"mean": 0, "sqDiff": 0}, "cpuSQLNanos": {"mean": 300625, "sqDiff": 0}, "maxDiskUsage": {"mean": 0, "sqDiff": 0}, "maxMemUsage": {"mean": 1.024E+4, "sqDiff": 0}, "mvccIteratorStats": {"blockBytes": {"mean": 0, "sqDiff": 0}, "blockBytesInCache": {"mean": 0, "sqDiff": 0}, "keyBytes": {"mean": 0, "sqDiff": 0}, "pointCount": {"mean": 0, "sqDiff": 0}, "pointsCoveredByRangeTombstones": {"mean": 0, "sqDiff": 0}, "rangeKeyContainedPoints": {"mean": 0, "sqDiff": 0}, "rangeKeyCount": {"mean": 0, "sqDiff": 0}, "rangeKeySkippedPoints": {"mean": 0, "sqDiff": 0}, "seekCount": {"mean": 0, "sqDiff": 0}, "seekCountInternal": {"mean": 0, "sqDiff": 0}, "stepCount": {"mean": 0, "sqDiff": 0}, "stepCountInternal": {"mean": 0, "sqDiff": 0}, "valueBytes": {"mean": 0, "sqDiff": 0}}, "networkBytes": {"mean": 0, "sqDiff": 0}, "networkMsgs": {"mean": 0, "sqDiff": 0}}, "index_recommendations": [], "statistics": {"bytesRead": {"mean": 0, "sqDiff": 0}, "cnt": 1, "firstAttemptCnt": 1, "idleLat": {"mean": 9223372036.854776, "sqDiff": 0}, "indexes": [], "lastErrorCode": "", "lastExecAt": "2023-03-08T23:13:25.132671Z", "latencyInfo": {"max": 0.000589375, "min": 0.000589375, "p50": 0, "p90": 0, "p99": 0}, "maxRetries": 0, "nodes": [1], "numRows": {"mean": 1, "sqDiff": 0}, "ovhLat": {"mean": 0.0000016249999999999988, "sqDiff": 0}, "parseLat": {"mean": 0, "sqDiff": 0}, "planGists": ["AiACHgA="], "planLat": {"mean": 0.000203792, "sqDiff": 0}, "regions": ["us-east1"], "rowsRead": {"mean": 0, "sqDiff": 0}, "rowsWritten": {"mean": 1, "sqDiff": 0}, "runLat": {"mean": 0.000383958, "sqDiff": 0}, "svcLat": {"mean": 0.000589375, "sqDiff": 0}}}
plan                       | {"Children": [], "Name": ""}
index_recommendations      | {}
indexes_usage              | []

Time: 4ms total (execution 3ms / network 1ms)

[email protected]:26257/insights> select * from crdb_internal.transaction_statistics_persisted limit 3;
-[ RECORD 1 ]
aggregated_ts  | 2023-03-08 23:00:00+00
fingerprint_id | \x17d80cf128571d63
app_name       | $ internal-migration-job-mark-job-succeeded
node_id        | 1
agg_interval   | 01:00:00
metadata       | {"stmtFingerprintIDs": ["b8bbb1bdae56aabc"]}
statistics     | {"execution_statistics": {"cnt": 1, "contentionTime": {"mean": 0, "sqDiff": 0}, "cpuSQLNanos": {"mean": 64709, "sqDiff": 0}, "maxDiskUsage": {"mean": 0, "sqDiff": 0}, "maxMemUsage": {"mean": 1.024E+4, "sqDiff": 0}, "mvccIteratorStats": {"blockBytes": {"mean": 0, "sqDiff": 0}, "blockBytesInCache": {"mean": 0, "sqDiff": 0}, "keyBytes": {"mean": 0, "sqDiff": 0}, "pointCount": {"mean": 0, "sqDiff": 0}, "pointsCoveredByRangeTombstones": {"mean": 0, "sqDiff": 0}, "rangeKeyContainedPoints": {"mean": 0, "sqDiff": 0}, "rangeKeyCount": {"mean": 0, "sqDiff": 0}, "rangeKeySkippedPoints": {"mean": 0, "sqDiff": 0}, "seekCount": {"mean": 0, "sqDiff": 0}, "seekCountInternal": {"mean": 0, "sqDiff": 0}, "stepCount": {"mean": 0, "sqDiff": 0}, "stepCountInternal": {"mean": 0, "sqDiff": 0}, "valueBytes": {"mean": 0, "sqDiff": 0}}, "networkBytes": {"mean": 0, "sqDiff": 0}, "networkMsgs": {"mean": 0, "sqDiff": 0}}, "statistics": {"bytesRead": {"mean": 0, "sqDiff": 0}, "cnt": 6, "commitLat": {"mean": 0, "sqDiff": 0}, "idleLat": {"mean": 0, "sqDiff": 0}, "maxRetries": 0, "numRows": {"mean": 1, "sqDiff": 0}, "retryLat": {"mean": 0, "sqDiff": 0}, "rowsRead": {"mean": 0, "sqDiff": 0}, "rowsWritten": {"mean": 1, "sqDiff": 0}, "svcLat": {"mean": 0.00026919450000000006, "sqDiff": 1.7615729685500012E-8}}}
-[ RECORD 2 ]
aggregated_ts  | 2023-03-08 23:00:00+00
fingerprint_id | \x2b024f7e2567a238
app_name       | $ internal-get-job-row
node_id        | 1
agg_interval   | 01:00:00
metadata       | {"stmtFingerprintIDs": ["8461f232a36615e7"]}
statistics     | {"execution_statistics": {"cnt": 1, "contentionTime": {"mean": 0, "sqDiff": 0}, "cpuSQLNanos": {"mean": 50835, "sqDiff": 0}, "maxDiskUsage": {"mean": 0, "sqDiff": 0}, "maxMemUsage": {"mean": 3.072E+4, "sqDiff": 0}, "mvccIteratorStats": {"blockBytes": {"mean": 0, "sqDiff": 0}, "blockBytesInCache": {"mean": 0, "sqDiff": 0}, "keyBytes": {"mean": 0, "sqDiff": 0}, "pointCount": {"mean": 3, "sqDiff": 0}, "pointsCoveredByRangeTombstones": {"mean": 0, "sqDiff": 0}, "rangeKeyContainedPoints": {"mean": 0, "sqDiff": 0}, "rangeKeyCount": {"mean": 0, "sqDiff": 0}, "rangeKeySkippedPoints": {"mean": 0, "sqDiff": 0}, "seekCount": {"mean": 1, "sqDiff": 0}, "seekCountInternal": {"mean": 1, "sqDiff": 0}, "stepCount": {"mean": 3, "sqDiff": 0}, "stepCountInternal": {"mean": 3, "sqDiff": 0}, "valueBytes": {"mean": 186, "sqDiff": 0}}, "networkBytes": {"mean": 0, "sqDiff": 0}, "networkMsgs": {"mean": 0, "sqDiff": 0}}, "statistics": {"bytesRead": {"mean": 284.81818181818176, "sqDiff": 3465.636363636355}, "cnt": 11, "commitLat": {"mean": 0.000003469727272727273, "sqDiff": 4.946789218181818E-11}, "idleLat": {"mean": 0, "sqDiff": 0}, "maxRetries": 0, "numRows": {"mean": 1, "sqDiff": 0}, "retryLat": {"mean": 0, "sqDiff": 0}, "rowsRead": {"mean": 1, "sqDiff": 0}, "rowsWritten": {"mean": 0, "sqDiff": 0}, "svcLat": {"mean": 0.0006771060909090909, "sqDiff": 8.91510436082909E-7}}}
-[ RECORD 3 ]
aggregated_ts  | 2023-03-08 23:00:00+00
fingerprint_id | \x37e130a1df20d299
app_name       | $ internal-create-stats
node_id        | 1
agg_interval   | 01:00:00
metadata       | {"stmtFingerprintIDs": ["98828ded59216546"]}
statistics     | {"execution_statistics": {"cnt": 1, "contentionTime": {"mean": 0, "sqDiff": 0}, "cpuSQLNanos": {"mean": 11875, "sqDiff": 0}, "maxDiskUsage": {"mean": 0, "sqDiff": 0}, "maxMemUsage": {"mean": 1.024E+4, "sqDiff": 0}, "mvccIteratorStats": {"blockBytes": {"mean": 0, "sqDiff": 0}, "blockBytesInCache": {"mean": 0, "sqDiff": 0}, "keyBytes": {"mean": 0, "sqDiff": 0}, "pointCount": {"mean": 0, "sqDiff": 0}, "pointsCoveredByRangeTombstones": {"mean": 0, "sqDiff": 0}, "rangeKeyContainedPoints": {"mean": 0, "sqDiff": 0}, "rangeKeyCount": {"mean": 0, "sqDiff": 0}, "rangeKeySkippedPoints": {"mean": 0, "sqDiff": 0}, "seekCount": {"mean": 0, "sqDiff": 0}, "seekCountInternal": {"mean": 0, "sqDiff": 0}, "stepCount": {"mean": 0, "sqDiff": 0}, "stepCountInternal": {"mean": 0, "sqDiff": 0}, "valueBytes": {"mean": 0, "sqDiff": 0}}, "networkBytes": {"mean": 0, "sqDiff": 0}, "networkMsgs": {"mean": 0, "sqDiff": 0}}, "statistics": {"bytesRead": {"mean": 0, "sqDiff": 0}, "cnt": 1, "commitLat": {"mean": 0.000002291, "sqDiff": 0}, "idleLat": {"mean": 0, "sqDiff": 0}, "maxRetries": 0, "numRows": {"mean": 0, "sqDiff": 0}, "retryLat": {"mean": 0, "sqDiff": 0}, "rowsRead": {"mean": 0, "sqDiff": 0}, "rowsWritten": {"mean": 0, "sqDiff": 0}, "svcLat": {"mean": 0.008680208, "sqDiff": 0}}}

Time: 3ms total (execution 2ms / network 1ms)
```

Epic: none

Release note: Added two views to the crdb_internal catalog: crdb_internal.statement_statistics_persisted, which surfaces data in the persisted system.statement_statistics table, and crdb_internal.transaction_statistics_persisted, which surfaces the system.transaction_statistics table.

98422: kvserver: disable {split,replicate,mvccGC} queues until... r=irfansharif a=irfansharif

...subscribed to span configs. Do the same for the store rebalancer. We applied this treatment for the merge queue back in #78122 since the fallback behavior, if not subscribed, is to use the statically defined span config for every operation.

- For the replicate queue this mean obtusely applying a replication factor of 3, regardless of configuration. This was possible typically post node restart before subscription was initially established. We saw this in #98385. It was possible then for us to ignore configured voter/non-voter/lease constraints.
- For the split queue, we wouldn't actually compute any split keys if unsubscribed, so the missing check was somewhat benign. But we would be obtusely applying the default range sizes [128MiB,512MiB], so for clusters configured with larger range sizes, this could lead to a burst of splitting post node-restart.
- For the MVCC GC queue, it would mean applying the the statically configured default GC TTL and ignoring any set protected timestamps. The latter is best-effort protection but could result in internal operations relying on protection (like backups, changefeeds) failing informatively. For clusters configured with GC TTL greater than the default, post node-restart it could lead to a burst of MVCC GC activity and AOST queries failing to find expected data.
- For the store rebalancer, ignoring span configs could result in violating lease preferences and voter constraints.

Fixes #98421.
Fixes #98385.

Release note (bug fix): It was previously possible for CockroachDB to not respect non-default zone configs. This only happened for a short window after nodes with existing replicas were restarted, and self-rectified within seconds. This manifested in a few ways:
- If num_replicas was set to something other than 3, we would still add or remove replicas to get to 3x replication.
  - If num_voters was set explicitly to get a mix of voting and non-voting replicas, it would be ignored. CockroachDB could possibly remove non-voting replicas.
- If range_min_bytes or range_max_bytes were changed from 128 MiB and 512 MiB respectively, we would instead try to size ranges to be within [128 MiB, 512MiB]. This could appear as an excess amount of range splits or merges, as visible in the Replication Dashboard under "Range Operations".
- If gc.ttlseconds was set to something other than 90000 seconds, we would still GC data only older than 90000s/25h. If the GC TTL was set to something larger than 25h, AOST queries going further back may now start failing. For GC TTLs less than the 25h default, clusters would observe increased disk usage due to more retained garbage.
- If constraints, lease_preferences or voter_constraints were set, they would be ignored. Range data and leases would possibly be moved outside where prescribed. This issues only lasted a few seconds post node-restarts, and any zone config violations were rectified shortly after.

98468: sql: add closest-instance physical planning r=dt a=dt

This changes physical planning, specifically how the SQL instance for a
given KV node ID is resolved, to be more generalized w.r.t. different
locality tier taxonomies.

Previously this function had a special case that checked for, and only
for, a specific locality tier with the key "region" and if it was
found, picked a random instance from the subset of instances where their
value for that matched the value for the KV node.

Matching on and only on the "region" tier is both too specific and not
specific enough: it is "too specific" in that it requires a tier with
the key "region" to be used and to match, and is "not specific enough"
in that it simultaneously ignores more specific locality tiers that
would indicate closer matches (e.g. subregion, AZ, data-center or rack).

Instead, this change generalizes this function to identify the subset of
instances that have the "closest match" in localities to the KV node and
pick one of them, where closest match is defined as the longest matching
prefix of locality tiers. In a simple, single-tier locality taxonomy
using the key "region" this should yield the same behavior as the
previous implementation, as all instances with a matching "region" will
have the same longest matching prefix (at length 1), however this more
general approach should better handle other locality taxonomies that use
more tiers and/or tiers with names other than "region".

Currently this change only applies to physical planning for secondary
tenants until physical planning is unified for system and secondary
tenants.

Release note: none.
Epic: CRDB-16910


98471: changefeedccl: fix kafka messagetoolarge test failure r=samiskin a=samiskin

Fixes: #93847

This fixes the following bug in the TestChangefeedKafkaMessageTooLarge test setup:
1. The feed starts sending messages, randomly triggering a MessageTooLarge error causing a retry with a smaller batch size
2. Eventually, while the retrying process is still ongoing, all 2000 rows are successfully received by the mock kafka sink, causing assertPayloads to complete, causing the test to closeFeed and run CANCEL on the changefeed.
3. The retrying process gets stuck in sendMessage where it can't send the message to the feedCh which has been closed since the changefeed is trying to close, but it also can't exit on the mock sink's tg.done since that only closes after the feed fully closes, which requires the retrying process to end.

Release note: None

Co-authored-by: Eric Harmeling <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Shiranka Miskin <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Mar 16, 2023
...subscribed to span configs. Do the same for the store
rebalancer. We applied this treatment for the merge queue back in cockroachdb#78122
since the fallback behavior, if not subscribed, is to use the statically
defined span config for every operation.

- For the replicate queue this mean obtusely applying a replication
  factor of 3, regardless of configuration. This was possible typically
  post node restart before subscription was initially established. We
  saw this in cockroachdb#98385. It was possible then for us to ignore configured
  voter/non-voter/lease constraints.
- For the split queue, we wouldn't actually compute any split keys if
  unsubscribed, so the missing check was somewhat benign. But we would
  be obtusely applying the default range sizes [128MiB,512MiB], so for
  clusters configured with larger range sizes, this could lead to a
  burst of splitting post node-restart.
- For the MVCC GC queue, it would mean applying the the statically
  configured default GC TTL and ignoring any set protected timestamps.
  The latter is best-effort protection but could result in internal
  operations relying on protection (like backups, changefeeds) failing
  informatively. For clusters configured with GC TTL greater than the
  default, post node-restart it could lead to a burst of MVCC GC
  activity and AOST queries failing to find expected data.
- For the store rebalancer, ignoring span configs could result in
  violating lease preferences and voter/non-voter constraints.

Fixes cockroachdb#98421.
Fixes cockroachdb#98385.

While here, we also introduce the following non-public cluster settings
to selectively enable/disable KV queues:
- kv.mvcc_gc_queue.enabled
- kv.split_queue.enabled
- kv.replicate_queue.enabled

Release note (bug fix): It was previously possible for CockroachDB to
not respect non-default zone configs. This could only happen for a short
window after nodes with existing replicas were restarted (measured in
seconds), and self-rectified (also within seconds). This manifested in a
few ways:
- If num_replicas was set to something other than 3, we would still
  add or remove replicas to get to 3x replication.
  - If num_voters was set explicitly to get a mix of voting and
    non-voting replicas, it would be ignored. CockroachDB could possibly
    remove non-voting replicas.
- If range_min_bytes or range_max_bytes were changed from 128 MiB and
  512 MiB respectively, we would instead try to size ranges to be within
  [128 MiB, 512MiB]. This could appear as an excess amount of range
  splits or merges, as visible in the Replication Dashboard under "Range
  Operations".
- If gc.ttlseconds was set to something other than 90000 seconds, we
  would still GC data only older than 90000s/25h. If the GC TTL was set
  to something larger than 25h, AOST queries going further back may now
  start failing. For GC TTLs less than the 25h default, clusters would
  observe increased disk usage due to more retained garbage.
- If constraints, lease_preferences or voter_constraints were set, they
  would be ignored. Range data and leases would possibly be moved
  outside where prescribed.
This issues lasted a few seconds post node-restarts, and any zone config
violations were rectified shortly after.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants