Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvstreamer: improve the performance #82159

Closed
Tracked by #54680
yuzefovich opened this issue May 31, 2022 · 7 comments
Closed
Tracked by #54680

kvstreamer: improve the performance #82159

yuzefovich opened this issue May 31, 2022 · 7 comments
Assignees
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-sql-queries SQL Queries Team

Comments

@yuzefovich
Copy link
Member

yuzefovich commented May 31, 2022

Currently, the streamer is slower when comparing to the old code path when the requests were already parallelized (index joins as well as lookup joins when lookup columns form a key). In an index join benchmark I saw about 20-25% regression at some point. We should improve the performance. My goal is to bring the gap to about 10%.

Getting to parity is not really feasible since the streamer has to perform additional work / tracking when comparing to the old-code path. However, when comparing against 22.1 release, parity might be possible - this is the aspirational goal.

Jira issue: CRDB-16229

@yuzefovich yuzefovich added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label May 31, 2022
@yuzefovich yuzefovich self-assigned this May 31, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label May 31, 2022
@ajwerner
Copy link
Contributor

It seems to me that one way the streamer could do better than the thing that preceded it by accumulating more than one batch of input and in parallel choosing to send to ranges where it expects a large amount of data. To some extent, I guess this is covered by

// Currently, enqueuing new requests while there are still requests in progress
// from the previous invocation is prohibited.
// TODO(yuzefovich): lift this restriction and introduce the pipelining.

@yuzefovich
Copy link
Member Author

Yeah, I think pipelining of those things essentially implements the idea you're describing. I filed #82163 to track this explicitly. I'm not sure it'll be implemented in 22.2 timeframe though.

craig bot pushed a commit that referenced this issue Jun 4, 2022
82305: colfetcher: increase input batch size limit when using the Streamer r=yuzefovich a=yuzefovich

This commit bumps up the input batch size limit in the vectorized index
join from 4MiB to 8MiB when the streamer API is used. The old number was
copy-pasted from the row-by-row join reader, and the new number is
derived based on running TPCH queries 4, 5, 6, 10, 12, 14, 15, 16 using
`tpchvec/bench`. I did a quick run of the same queries with the
row-by-row engine used by default, and it seemed like the default of
4MiB there is reasonable, so I didn't change that one.

With the non-streamer code path we didn't want to go higher than 4MiB
since it showed diminishing returns while exposing the cluster to higher
instability (since the non-streamer code path doesn't use the memory
limits in the KV layer). The streamer does use memory limits for
BatchRequests, so it should be safe to increase this limit.

Additionally, this commit makes it so that 1/8th (rather than 1/4th) of
the workmem limit is reserved for the output batch of the cFetcher. With
the default value of 64MiB this would translate to 8MiB which is large
enough.

Addresses: #82159.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
craig bot pushed a commit that referenced this issue Jun 6, 2022
82422: kvstreamer: clean up Result struct r=yuzefovich a=yuzefovich

This commit cleans up the `Result` struct in order to reduce its memory
size, bringing it down into 48 byte size class.
```
name                             old time/op    new time/op    delta
IndexJoin/Cockroach-24             6.29ms ± 1%    6.22ms ± 2%   -1.06%  (p=0.024 n=9+9)
IndexJoin/MultinodeCockroach-24    7.99ms ± 1%    7.93ms ± 2%     ~     (p=0.165 n=10+10)

name                             old alloc/op   new alloc/op   delta
IndexJoin/Cockroach-24             1.64MB ± 1%    1.48MB ± 0%   -9.25%  (p=0.000 n=9+10)
IndexJoin/MultinodeCockroach-24    2.37MB ± 1%    2.20MB ± 1%   -7.06%  (p=0.000 n=10+10)

name                             old allocs/op  new allocs/op  delta
IndexJoin/Cockroach-24              8.15k ± 1%     7.15k ± 1%  -12.28%  (p=0.000 n=8+10)
IndexJoin/MultinodeCockroach-24     12.7k ± 1%     11.7k ± 1%   -8.18%  (p=0.000 n=10+10)
```

The main change of this commit is the removal of the concept of "enqueue
keys" from the Streamer API in favor of relying on `Result.Position`.
When requests are unique, then a single `Result` can only satisfy
a single enqueue key; however, for non-unique requests a single `Result`
can satisfy multiple requests and, thus, can have multiple enqueue keys.
At the moment, only unique requests are supported though. Once
non-unique requests are supported too, we'll need to figure out how to
handle those (maybe we'll be returning a `Result` `N` number of times if
it satisfies `N` original requests with different values for
`Position`).

Also, initially multiple "enqueue keys" support was envisioned for the
case of `multiSpanGenerator` in the lookup joins (i.e. multi-equality
lookup joins); however, I believe we should push that complexity out of
the streamer (into `TxnKVStreamer`) which is what this commit does.

Other changes done in this commit:
- unexport `ScanResp.Complete` field since this is currently only
used within the `kvstreamer` package
- reorder all existing fields so that the footprint of the struct is
minimized (in particular, `scanComplete` field is moved to the bottom
and `ScanResp` anonymous struct is removed)
- make `subRequestIdx` `int32` rather than `int`. This value is bound
by the number of ranges in the cluster, so max int32 is more than
sufficient.

Addresses: #82159.
Addresses: #82160.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
craig bot pushed a commit that referenced this issue Jun 10, 2022
81565: roachtest: benchmark node decommission r=AlexTalks a=AlexTalks

roachtest: benchmark node decommission

While previously some roachtests existed for the purposes of
testing the decommission process, we have not had any benchmarks to
track how long it takes to decommission a node, making it difficult to
reason about how to understand what makes decommission so slow. This
change adds benchmarks for node decommission under a number of
configurations, including variable numbers of nodes/cpus, TPCC
warehouses, and with admission control enabled vs. disabled.

Some initial runs of the test have shown the following averages:
```
decommissionBench/nodes=4/cpu=16/warehouses=1000: 16m14s
decommissionBench/nodes=4/cpu=16/warehouses=1000/no-admission: 15m48s
decommissionBench/nodes=4/cpu=16/warehouses=1000/while-down: 20m36s
decommissionBench/nodes=8/cpu=16/warehouses=3000: 18m30s
```

Release note: None

82382: kvstreamer: optimize singleRangeBatch.Less r=yuzefovich a=yuzefovich

**bench: add benchmarks for lookup joins**

This commit adds benchmarks for lookup joins, both when equality columns
are and are not key, both with and without maintaining ordering.

Release note: None

**kvstreamer: optimize singleRangeBatch.Less**

This commit optimizes `singleRangeBatch.Less` method which is used when
sorting the requests inside of these objects in the OutOfOrder mode
(which is needed to get the low-level Pebble speedups) by storing the
start keys explicitly instead of performing a couple of function calls
on each `Less` invocation.

```
name                             old time/op    new time/op    delta
IndexJoin/Cockroach-24             6.30ms ± 1%    5.78ms ± 1%  -8.31%  (p=0.000 n=10+10)
IndexJoin/MultinodeCockroach-24    8.01ms ± 1%    7.51ms ± 1%  -6.28%  (p=0.000 n=10+10)

name                             old alloc/op   new alloc/op   delta
IndexJoin/Cockroach-24             1.55MB ± 0%    1.57MB ± 0%  +0.98%  (p=0.000 n=9+10)
IndexJoin/MultinodeCockroach-24    2.28MB ± 2%    2.30MB ± 1%    ~     (p=0.400 n=10+9)

name                             old allocs/op  new allocs/op  delta
IndexJoin/Cockroach-24              8.16k ± 1%     8.13k ± 1%    ~     (p=0.160 n=10+10)
IndexJoin/MultinodeCockroach-24     12.7k ± 1%     12.6k ± 0%    ~     (p=0.128 n=10+9)
```
```
name                                                    old time/op    new time/op    delta
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24             6.89ms ± 1%    6.43ms ± 1%  -6.65%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24    8.03ms ± 1%    7.48ms ± 2%  -6.92%  (p=0.000 n=10+10)
LookupJoinNoOrdering/Cockroach-24                         9.21ms ± 3%    8.82ms ± 5%  -4.23%  (p=0.007 n=10+10)
LookupJoinNoOrdering/MultinodeCockroach-24                11.9ms ± 3%    11.5ms ± 3%  -3.36%  (p=0.002 n=9+10)

name                                                    old alloc/op   new alloc/op   delta
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24             1.81MB ± 1%    1.84MB ± 0%  +1.23%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24    2.50MB ± 2%    2.54MB ± 1%  +1.76%  (p=0.004 n=10+10)
LookupJoinNoOrdering/Cockroach-24                         1.89MB ± 0%    1.91MB ± 1%  +1.09%  (p=0.000 n=9+9)
LookupJoinNoOrdering/MultinodeCockroach-24                2.37MB ± 2%    2.42MB ± 4%  +1.85%  (p=0.010 n=10+9)

name                                                    old allocs/op  new allocs/op  delta
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24              10.8k ± 0%     10.8k ± 1%    ~     (p=0.615 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24     15.1k ± 1%     15.0k ± 0%    ~     (p=0.101 n=10+10)
LookupJoinNoOrdering/Cockroach-24                          13.3k ± 1%     13.3k ± 1%    ~     (p=0.549 n=10+9)
LookupJoinNoOrdering/MultinodeCockroach-24                 17.3k ± 1%     17.3k ± 1%    ~     (p=0.460 n=10+8)
```

Addresses: #82159

Release note: None

82740: build: remove crdb-protobuf-client node_modules with ui-maintainer-clean r=maryliag,rickystewart a=sjbarag

Since version 33 [1], dev ui clean --all removes the
pkg/ui/workspaces/db-console/src/js/node_modules tree. Remove that tree
with make ui-maintainer-clean to keep parity between the two build
systems.

[1] 2e9e7a5 (dev: bump to version 33, 2022-05-27)

Release note: None

82744: ui: update cluster-ui to v22.2.0-prerelease-2 r=maryliag a=maryliag

Update cluster-ui to the latest value publishes

Release note: None

82748: ci: skip Docker test in CI r=ZhouXing19 a=rickystewart

This has been flaky for a while, skipping until we have more information
about what's going on here.

Release note: None

Co-authored-by: Alex Sarkesian <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Sean Barag <[email protected]>
Co-authored-by: Marylia Gutierrez <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
craig bot pushed a commit that referenced this issue Jun 23, 2022
83166: roachtest: add streamer config to tpchvec r=yuzefovich a=yuzefovich

**roachtest: disable merge queue in some TPCH tests**

This commit disables the merge queue during most of the TPCH tests. Some
of these tests are performance oriented, so we want to keep as many
things constant as possible. Also, having more ranges around gives us
better testing coverage of the distributed query execution.

Release note: None

**roachtest: add streamer config to tpchvec**

This commit refactors the `tpchvec` roachtest to introduce a new
`streamer` config which runs all TPCH queries with the streamer ON and
OFF by default. This is used to make it easier to track the performance
of the streamer as well as to catch regressions in performance in the
future (since the test will fail if OFF config is significantly faster
than ON config). At the moment, the test will fail if the streamer ON
config is slower by at least 3x than the OFF config, but over time
I plan to reducing that threshold.

Informs: #82159.

Release note: None

**roachtest: refactor tpchvec a bit**

This commit refactors `tpchvec` roachtest so that queries run in the
query-major order rather than the config-major order. Previously, we
would perform the cluster setup, run all queries on that setup, then
perform the setup for the second test config, run all queries again,
and then analyze the results. However, I believe for perf-oriented
tests it's better to run each query on all configs right away (so
that the chance of range movement was relatively low), and this
commit makes such a change. This required the removal of
`perf_no_stats` test config (which probably wasn't adding much value).

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
@yuzefovich
Copy link
Member Author

yuzefovich commented Jun 23, 2022

Here is the comparison of the streamer OFF vs streamer ON (on master with #82384, #82387, #82865, #83010, #83472 cherry-picked):

Q1: 3.66s	3.69s	0.75%
Q2: 0.27s	0.24s	-10.66%
Q3: 2.54s	2.04s	-19.65%
Q4: 2.36s	1.43s	-39.43%
Q5: 3.38s	2.44s	-27.71%
Q6: 9.40s	8.18s	-12.98%
Q7: 8.71s	6.43s	-26.13%
Q8: 1.37s	0.95s	-30.46%
Q9: 7.88s	5.22s	-33.73%
Q10: 2.56s	1.64s	-36.15%
Q11: 0.75s	0.66s	-11.59%
Q12: 10.21s	8.35s	-18.22%
Q13: 1.09s	1.08s	-0.63%
Q14: 0.78s	0.68s	-11.80%
Q15: 6.64s	6.54s	-1.43%
Q16: 1.10s	0.94s	-14.58%
Q17: 0.57s	0.59s	3.15%
Q18: 6.07s	6.06s	-0.20%
Q19: 3.79s	3.78s	-0.19%
Q20: 17.11s	14.27s	-16.62%
Q21: 10.19s	6.00s	-41.08%
Q22: 0.74s	0.66s	-10.61%

craig bot pushed a commit that referenced this issue Jun 29, 2022
83472: kvstreamer: improve the avg response size heuristic r=yuzefovich a=yuzefovich

This commit improves the heuristic we use for estimating the average
response size. Previously, we used a simple average, and now we multiple
the average by 1.5.

Having this multiple is good for a couple of reasons:
- this allows us to fulfill requests that are slightly larger than the
current average. For example, imagine that we're processing three
requests of sizes 100B, 110B, and 120B sequentially, one at a time.
Without the multiple, after the first request, our estimate would be 100B
so the second request would come back empty (with ResumeNextBytes=110),
so we'd have to re-issue the second request. At that point the average is
105B, so the third request would again come back empty and need to be
re-issued with larger TargetBytes. Having the multiple allows us to
handle such a scenario without any requests coming back empty. In
particular, TPCH Q17 has similar setup.
- this allows us to slowly grow the TargetBytes parameter over time when
requests can be returned partially multiple times (i.e. Scan requests
spanning multiple rows). Consider a case when a single Scan request has
to return 1MB worth of data, but each row is only 100B. With the initial
estimate of 1KB, every request would always come back with exactly 10
rows, and the avg response size would always stay at 1KB. We'd end up
issuing 1000 of such requests. Having a multiple here allows us to grow
the estimate over time, reducing the total number of requests needed.

This multiple seems to fix the remaining perf regression on Q17 when
comparing against the streamer OFF config.

This commit also introduces a cluster setting that controls this
multiple. Value of 1.5 was chosen using `tpchvec/bench` and this
setting.

Additionally, I introduced a similar cluster setting for the initial avg
response size estimate (currently hard-coded at 1KiB) and used
`tpchvec/bench`, and it showed that 1KiB value is pretty good. It was
also the value mentioned in the RFC, so I decided to remove the
corresponding setting.

Addresses: #82159.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
craig bot pushed a commit that referenced this issue Jul 1, 2022
83197: kvserver: bump the in-memory GC threshold as a pre-apply side effect r=aayushshah15 a=aayushshah15

This commit changes where the in-memory GC threshold on a Replica is
bumped during the application of a GC request.

Previously, the in-memory GC threshold was bumped as a post-apply side
effect. Additionally, GC requests do not acquire latches at the
timestamp that they are garbage collecting, and thus, readers need to
take additional care to ensure that results aren't served off of a
partial state.

Readers today rely on the invariant that the in-memory GC threshold is
bumped before the actual garbage collection. This today is true because
the mvccGCQueue issues GC requests in 2 phases: the first that simply
bumps the in-memory GC threshold, and the second one that performs the
actual garbage collection. If the in-memory GC threshold were bumped in
the pre-apply phase of command application, this usage quirk wouldn't
need to exist. That's what this commit does.

Relates to #55293

Release note: None



83659: kvstreamer: optimize the results buffers a bit r=yuzefovich a=yuzefovich

**kvstreamer: make InOrder-mode structs implement heap logic on their own**

This commit refactors the `inOrderRequestsProvider` as well as the
`inOrderResultsBuffer` to make them maintain the min heap over the
requests and results, respectively, on their own. This allows us to
avoid allocations for `interface{}` objects that occur when using
`heap.Interface`. The code was copied, with minor adjustments, from the
standard library.
```
name                                                  old time/op    new time/op    delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24             8.83ms ± 3%    8.58ms ± 3%   -2.82%  (p=0.002 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24    10.3ms ± 1%    10.1ms ± 2%   -1.86%  (p=0.009 n=10+10)
LookupJoinOrdering/Cockroach-24                         10.7ms ± 5%    10.5ms ± 3%     ~     (p=0.123 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                14.0ms ± 5%    13.6ms ± 5%   -3.16%  (p=0.043 n=10+10)

name                                                  old alloc/op   new alloc/op   delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24             1.77MB ± 2%    1.59MB ± 1%   -9.94%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24    2.57MB ± 3%    2.38MB ± 1%   -7.57%  (p=0.000 n=10+10)
LookupJoinOrdering/Cockroach-24                         1.67MB ± 0%    1.50MB ± 1%  -10.08%  (p=0.000 n=9+9)
LookupJoinOrdering/MultinodeCockroach-24                2.53MB ± 2%    2.34MB ± 2%   -7.78%  (p=0.000 n=10+9)

name                                                  old allocs/op  new allocs/op  delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24              12.5k ± 1%     10.4k ± 1%  -16.54%  (p=0.000 n=9+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24     17.6k ± 0%     15.5k ± 0%  -11.64%  (p=0.000 n=8+10)
LookupJoinOrdering/Cockroach-24                          14.4k ± 1%     12.4k ± 1%  -14.17%  (p=0.000 n=9+9)
LookupJoinOrdering/MultinodeCockroach-24                 20.6k ± 1%     18.5k ± 1%  -10.19%  (p=0.000 n=10+9)
```

Addresses: #82159.

Release note: None

**kvstreamer: refactor the OutOfOrder requests provider a bit**

This commit refactors the `requestsProvider` interface renaming a couple
of methods - `s/firstLocked/nextLocked/` and
`s/removeFirstLocked/removeNextLocked/`. The idea is that for the
InOrder mode "first" == "next", but for the OutOfOrder mode we can pick
an arbitrary request that is not necessarily "first". This commit then
makes the OutOfOrder requests provider pick its last request in the
queue - which should reduce the memory usage in case resume requests are
added later.

Previously, we would slice "next" requests off the front and append to
the end, and it could trigger reallocation of the underlying slice. Now,
we'll be updating only the "tail" of the queue, and since a single
request can result in at most one resume request, the initial slice
provided in `enqueue` will definitely have enough capacity.

Release note: None

**kvstreamer: reduce allocations in the InOrder results buffer**

This commit improves the InOrder results buffer by reusing the same
scratch space for returning the results on `get()` calls as well as by
reducing the size of `inOrderBufferedResult` struct from the 80 bytes
size class to the 64 bytes size class.

It also cleans up a couple of comments and clarifies `GetResults` method
a bit.
```
name                                                  old time/op    new time/op    delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24             8.47ms ± 3%    8.54ms ± 2%    ~     (p=0.182 n=9+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24    10.2ms ± 2%    10.1ms ± 2%    ~     (p=0.280 n=10+10)
LookupJoinOrdering/Cockroach-24                         10.4ms ± 3%    10.3ms ± 4%    ~     (p=0.393 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                13.8ms ± 5%    13.7ms ± 5%    ~     (p=0.739 n=10+10)

name                                                  old alloc/op   new alloc/op   delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24             1.59MB ± 2%    1.49MB ± 1%  -6.20%  (p=0.000 n=10+9)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24    2.36MB ± 2%    2.27MB ± 2%  -3.49%  (p=0.000 n=10+10)
LookupJoinOrdering/Cockroach-24                         1.51MB ± 1%    1.42MB ± 1%  -6.27%  (p=0.000 n=9+10)
LookupJoinOrdering/MultinodeCockroach-24                2.37MB ± 6%    2.26MB ± 2%  -4.77%  (p=0.000 n=10+9)

name                                                  old allocs/op  new allocs/op  delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24              10.4k ± 1%     10.3k ± 1%    ~     (p=0.055 n=8+9)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24     15.5k ± 0%     15.5k ± 1%  -0.34%  (p=0.037 n=10+10)
LookupJoinOrdering/Cockroach-24                          12.4k ± 1%     12.3k ± 1%  -0.98%  (p=0.000 n=9+10)
LookupJoinOrdering/MultinodeCockroach-24                 18.5k ± 1%     18.5k ± 2%    ~     (p=0.743 n=8+9)
```

Addresses: #82160.

Release note: None

83705: teamcity: `grep` for pebble version in `go.mod` rather than `DEPS.bzl` r=nicktrav,jbowens a=rickystewart

The `grep` can fail in the case that there is no more recent version of
`pebble` than the one that is vendored in-tree. This fixes that case.

Release note: None

Co-authored-by: Aayush Shah <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
craig bot pushed a commit that referenced this issue Jul 6, 2022
82865: kvcoord: optimize batch truncation loop r=yuzefovich a=yuzefovich

**kvcoord: truncate BatchRequest to range boundaries serially**

This commit refactors the DistSender's loop of iterating over the range
descriptors so that the truncation of the BatchRequest happens
serially. This incurs a minor performance hit when the requests are sent
in parallel, but it makes it possible to apply the optimizations to this
iteration in the following commits.

Release note: None

**kvcoord: merge truncate_test into batch_test**

This commit moves `TestTruncate` as well as `BenchmarkTruncate` into
`batch_test.go` file in the preparation for the refactor done in the
following commit. The code itself wasn't changed at all.

Release note: None

**kvcoord: introduce batch truncation helper**

This commit introduces a batch truncation helper that encompasses logic
of truncating requests to the boundaries of a single range as well as
returning the next key to seek the range iterator to. The helper is now
used both in the DistSender as well as in the Streamer. No modification
to the actual logic of `Truncate` nor `Next`/`prev` functions has been
done other than incorporating the return of the next seek key into
`Truncate` function itself. This is needed since the following commit
will tightly couple the truncation process with the next seek key
determination in order to optimize it.

The helper can be configured with a knob indicating whether `Truncate`
needs to return requests in the original order. This behavior is
necessary by the BatchRequests that contain writes since in several
spots we rely on the ordering assumptions (e.g. of increasing values of
`Sequence`).

The following adjustments were made to the tests:
- `BenchmarkTruncate` has been renamed to `BenchmarkTruncateLegacy`
- `TestTruncate` has been refactored to exercise the new and the old
code-paths
- `TestBatchPrevNext` has been refactored to run through the new code
path, also a few test cases have been adjusted slightly.

This commit also introduces some unit tests for the new code path when
it runs in a loop over multiple ranges as well as a corresponding
benchmark.

Release note: None

**kvcoord: optimize batch truncation loop**

This commit optimizes the batch truncation loop for the case when the
requests only use global keys.

The optimized approach sorts the requests according to their start keys
(with the Ascending scan direction) or in the reverse order of their end
keys (with the Descending scan direction).  Then on each `Truncate` call
it looks only at a subset of the requests (that haven't been fully
processed yet and don't start after the current range), allowing us to
avoid many unnecessary key comparisons. Please see a comment on the new
`truncateAsc` and `truncateDesc` functions for more details and examples.

The optimized approach actually has a worse time complexity in the
absolute worst case (when each request is a range-spanning one and
actually spans all of the ranges against which the requests are
truncated) - because of the need to sort the requests upfront - but in
practice, it is much faster, especially with point requests.

```
name                                                               old time/op    new time/op    delta
TruncateLoop/asc/reqs=128/ranges=4/type=get-24                       59.8µs ± 1%    33.9µs ± 3%   -43.38%  (p=0.000 n=10+10)
TruncateLoop/asc/reqs=128/ranges=4/type=scan-24                      83.0µs ± 4%    69.7µs ± 7%   -15.98%  (p=0.000 n=10+10)
TruncateLoop/asc/reqs=128/ranges=64/type=get-24                       865µs ± 1%      62µs ± 1%   -92.84%  (p=0.000 n=10+10)
TruncateLoop/asc/reqs=128/ranges=64/type=scan-24                     1.07ms ± 5%    0.46ms ± 8%   -56.95%  (p=0.000 n=10+10)
TruncateLoop/asc/reqs=16384/ranges=4/type=get-24                     7.09ms ± 0%    5.99ms ± 1%   -15.56%  (p=0.000 n=10+9)
TruncateLoop/asc/reqs=16384/ranges=4/type=scan-24                    9.22ms ± 0%   10.52ms ± 1%   +14.08%  (p=0.000 n=9+10)
TruncateLoop/asc/reqs=16384/ranges=64/type=get-24                     108ms ± 0%       6ms ± 2%   -94.50%  (p=0.000 n=8+10)
TruncateLoop/asc/reqs=16384/ranges=64/type=scan-24                    129ms ± 1%      71ms ± 1%   -45.06%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=4/type=get-24         60.2µs ± 1%    36.0µs ± 2%   -40.14%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=4/type=scan-24        82.4µs ± 8%    72.1µs ± 4%   -12.51%  (p=0.000 n=10+9)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=64/type=get-24         862µs ± 1%      72µs ± 1%   -91.65%  (p=0.000 n=10+9)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=64/type=scan-24       1.06ms ± 4%    0.49ms ± 8%   -53.39%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=4/type=get-24       7.24ms ± 0%    6.46ms ± 1%   -10.74%  (p=0.000 n=10+9)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=4/type=scan-24      10.1ms ± 2%     9.8ms ± 2%    -2.36%  (p=0.000 n=10+9)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=64/type=get-24       107ms ± 0%       7ms ± 0%   -93.57%  (p=0.000 n=8+10)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=64/type=scan-24      122ms ± 0%      80ms ± 1%   -34.32%  (p=0.000 n=9+9)
TruncateLoop/desc/reqs=128/ranges=4/type=get-24                      78.9µs ± 1%    36.4µs ± 3%   -53.81%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=4/type=scan-24                     79.4µs ± 4%    52.3µs ± 5%   -34.14%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=64/type=get-24                     1.16ms ± 1%    0.07ms ± 1%   -94.39%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=64/type=scan-24                    1.01ms ± 4%    0.46ms ± 5%   -54.56%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=16384/ranges=4/type=get-24                    9.42ms ± 0%    6.26ms ± 1%   -33.52%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=16384/ranges=4/type=scan-24                   8.41ms ± 1%    9.22ms ± 1%    +9.54%  (p=0.000 n=10+9)
TruncateLoop/desc/reqs=16384/ranges=64/type=get-24                    145ms ± 0%       6ms ± 1%   -95.63%  (p=0.000 n=9+9)
TruncateLoop/desc/reqs=16384/ranges=64/type=scan-24                   125ms ± 1%      67ms ± 1%   -46.31%  (p=0.000 n=9+9)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=4/type=get-24        77.8µs ± 1%    39.6µs ± 2%   -49.10%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=4/type=scan-24       74.0µs ± 3%    63.0µs ± 6%   -14.92%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=64/type=get-24       1.16ms ± 1%    0.08ms ± 1%   -93.47%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=64/type=scan-24      1.04ms ± 5%    0.47ms ± 7%   -54.65%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=4/type=get-24      9.50ms ± 0%    6.73ms ± 1%   -29.21%  (p=0.000 n=9+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=4/type=scan-24     8.88ms ± 1%   13.24ms ± 1%   +49.04%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=64/type=get-24      146ms ± 0%       7ms ± 1%   -94.98%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=64/type=scan-24     125ms ± 1%      74ms ± 1%   -40.75%  (p=0.000 n=10+9)

name                                                               old alloc/op   new alloc/op   delta
TruncateLoop/asc/reqs=128/ranges=4/type=get-24                       7.58kB ± 0%   21.00kB ±11%  +176.84%  (p=0.000 n=7+10)
TruncateLoop/asc/reqs=128/ranges=4/type=scan-24                      39.8kB ± 6%    49.1kB ±15%   +23.46%  (p=0.000 n=9+10)
TruncateLoop/asc/reqs=128/ranges=64/type=get-24                      6.48kB ± 5%   18.25kB ± 2%  +181.79%  (p=0.000 n=10+9)
TruncateLoop/asc/reqs=128/ranges=64/type=scan-24                      428kB ±20%     368kB ±13%   -13.85%  (p=0.003 n=10+10)
TruncateLoop/asc/reqs=16384/ranges=4/type=get-24                     1.60MB ± 0%    2.91MB ± 0%   +82.49%  (p=0.000 n=8+10)
TruncateLoop/asc/reqs=16384/ranges=4/type=scan-24                    5.24MB ± 0%    5.89MB ± 0%   +12.41%  (p=0.000 n=10+8)
TruncateLoop/asc/reqs=16384/ranges=64/type=get-24                    1.15MB ± 4%    2.41MB ± 1%  +110.09%  (p=0.000 n=10+10)
TruncateLoop/asc/reqs=16384/ranges=64/type=scan-24                   69.8MB ± 1%    64.6MB ± 1%    -7.55%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=4/type=get-24         9.77kB ±22%   21.98kB ± 4%  +125.07%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=4/type=scan-24        38.9kB ±23%    49.9kB ± 2%   +28.28%  (p=0.000 n=10+8)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=64/type=get-24        6.56kB ± 4%   20.20kB ± 3%  +208.11%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=64/type=scan-24        407kB ±15%     372kB ±13%    -8.68%  (p=0.043 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=4/type=get-24       1.65MB ± 0%    3.62MB ± 0%  +118.55%  (p=0.000 n=8+8)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=4/type=scan-24      6.60MB ± 2%    5.65MB ± 1%   -14.38%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=64/type=get-24      1.10MB ± 5%    2.77MB ± 1%  +152.58%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=64/type=scan-24     60.5MB ± 1%    67.9MB ± 1%   +12.19%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=4/type=get-24                      17.2kB ±10%    21.5kB ± 1%   +24.91%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=4/type=scan-24                     35.5kB ±12%    29.1kB ± 4%   -17.83%  (p=0.000 n=10+9)
TruncateLoop/desc/reqs=128/ranges=64/type=get-24                      138kB ± 1%      20kB ± 3%   -85.34%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=64/type=scan-24                     344kB ±15%     363kB ±10%    +5.50%  (p=0.035 n=10+10)
TruncateLoop/desc/reqs=16384/ranges=4/type=get-24                    2.78MB ± 0%    3.35MB ± 0%   +20.24%  (p=0.000 n=8+10)
TruncateLoop/desc/reqs=16384/ranges=4/type=scan-24                   4.42MB ± 6%    5.29MB ± 0%   +19.67%  (p=0.000 n=10+8)
TruncateLoop/desc/reqs=16384/ranges=64/type=get-24                   17.9MB ± 0%     2.7MB ± 2%   -85.21%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=16384/ranges=64/type=scan-24                  65.3MB ± 0%    61.0MB ± 1%    -6.65%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=4/type=get-24        15.9kB ± 3%    26.7kB ± 1%   +67.87%  (p=0.000 n=10+9)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=4/type=scan-24       29.4kB ± 6%    41.6kB ± 5%   +41.50%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=64/type=get-24        138kB ± 0%      23kB ± 3%   -83.61%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=64/type=scan-24       390kB ±19%     350kB ±11%   -10.16%  (p=0.015 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=4/type=get-24      2.69MB ± 4%    3.51MB ± 1%   +30.22%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=4/type=scan-24     4.89MB ± 1%    8.19MB ± 0%   +67.68%  (p=0.000 n=10+9)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=64/type=get-24     17.9MB ± 0%     3.0MB ± 2%   -83.34%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=64/type=scan-24    65.4MB ± 1%    62.9MB ± 1%    -3.81%  (p=0.000 n=10+10)

name                                                               old allocs/op  new allocs/op  delta
TruncateLoop/asc/reqs=128/ranges=4/type=get-24                         50.0 ± 0%      52.4 ±10%    +4.80%  (p=0.017 n=8+10)
TruncateLoop/asc/reqs=128/ranges=4/type=scan-24                         569 ±12%       557 ±15%      ~     (p=0.617 n=10+10)
TruncateLoop/asc/reqs=128/ranges=64/type=get-24                         207 ± 4%       210 ± 5%      ~     (p=0.380 n=10+10)
TruncateLoop/asc/reqs=128/ranges=64/type=scan-24                      6.97k ±13%     6.02k ± 9%   -13.64%  (p=0.000 n=10+10)
TruncateLoop/asc/reqs=16384/ranges=4/type=get-24                        126 ± 0%       122 ± 0%    -3.17%  (p=0.002 n=8+10)
TruncateLoop/asc/reqs=16384/ranges=4/type=scan-24                     51.9k ± 1%     42.8k ± 1%   -17.59%  (p=0.000 n=10+9)
TruncateLoop/asc/reqs=16384/ranges=64/type=get-24                     1.12k ± 1%     1.12k ± 1%    +0.43%  (p=0.027 n=10+10)
TruncateLoop/asc/reqs=16384/ranges=64/type=scan-24                     786k ± 1%      714k ± 1%    -9.13%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=4/type=get-24           41.2 ± 3%      58.0 ± 3%   +40.78%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=4/type=scan-24           574 ±18%       532 ±11%      ~     (p=0.143 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=64/type=get-24           205 ± 2%       234 ± 3%   +14.26%  (p=0.000 n=9+9)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=64/type=scan-24        6.70k ± 9%     6.00k ± 9%   -10.40%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=4/type=get-24          127 ± 0%       125 ± 0%    -1.57%  (p=0.001 n=8+9)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=4/type=scan-24       63.7k ± 1%     27.8k ± 2%   -56.34%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=64/type=get-24       1.14k ± 0%     1.14k ± 1%      ~     (p=0.515 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=64/type=scan-24       696k ± 1%      752k ± 1%    +7.97%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=4/type=get-24                         554 ± 1%       169 ± 2%   -69.52%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=4/type=scan-24                        519 ± 9%       268 ± 9%   -48.32%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=64/type=get-24                      8.38k ± 0%     0.33k ± 3%   -96.06%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=64/type=scan-24                     5.90k ±10%     5.89k ± 5%      ~     (p=0.796 n=10+10)
TruncateLoop/desc/reqs=16384/ranges=4/type=get-24                     65.7k ± 0%     16.5k ± 0%   -74.87%  (p=0.002 n=8+10)
TruncateLoop/desc/reqs=16384/ranges=4/type=scan-24                    39.5k ± 1%     27.8k ± 2%   -29.54%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=16384/ranges=64/type=get-24                    1.05M ± 0%     0.02M ± 0%   -98.33%  (p=0.000 n=9+10)
TruncateLoop/desc/reqs=16384/ranges=64/type=scan-24                    741k ± 0%      679k ± 1%    -8.32%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=4/type=get-24           559 ± 0%       182 ± 2%   -67.42%  (p=0.000 n=9+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=4/type=scan-24          438 ± 8%       404 ±13%    -7.76%  (p=0.014 n=9+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=64/type=get-24        8.39k ± 0%     0.36k ± 5%   -95.75%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=64/type=scan-24       6.38k ±11%     5.56k ± 9%   -12.85%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=4/type=get-24       65.7k ± 0%     16.5k ± 0%   -74.84%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=4/type=scan-24      46.8k ± 1%     67.6k ± 1%   +44.65%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=64/type=get-24      1.05M ± 0%     0.02M ± 0%   -98.33%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=64/type=scan-24      739k ± 1%      694k ± 1%    -6.08%  (p=0.000 n=10+10)
```

The truncation loops for the optimized strategy are very similar in both
directions, so I tried to extract the differences out into an interface.
However, this showed non-trivial slow down and increase in allocations,
so I chose to have some duplicated code to get the best performance.
Here is a snippet of the comparison when the interface was prototyped:

```
name                                                 old time/op    new time/op    delta
TruncateLoop/desc/reqs=128/ranges=4/type=get-24        36.9µs ± 3%    44.5µs ± 3%  +20.55%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=4/type=scan-24       74.8µs ± 5%    88.8µs ± 3%  +18.73%  (p=0.000 n=9+10)
TruncateLoop/desc/reqs=128/ranges=64/type=get-24       64.9µs ± 1%    78.3µs ± 1%  +20.72%  (p=0.000 n=10+9)
TruncateLoop/desc/reqs=128/ranges=64/type=scan-24       471µs ± 8%     682µs ±13%  +44.73%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=16384/ranges=4/type=get-24      6.34ms ± 1%    7.39ms ± 0%  +16.47%  (p=0.000 n=10+9)
TruncateLoop/desc/reqs=16384/ranges=4/type=scan-24     11.2ms ± 1%    12.4ms ± 1%  +10.36%  (p=0.000 n=10+9)
TruncateLoop/desc/reqs=16384/ranges=64/type=get-24     6.40ms ± 2%    7.39ms ± 1%  +15.47%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=16384/ranges=64/type=scan-24    70.9ms ± 1%   102.0ms ± 2%  +43.87%  (p=0.000 n=9+9)

name                                                 old alloc/op   new alloc/op   delta
TruncateLoop/desc/reqs=128/ranges=4/type=get-24        22.2kB ± 9%    30.4kB ± 0%  +36.55%  (p=0.000 n=10+7)
TruncateLoop/desc/reqs=128/ranges=4/type=scan-24       52.2kB ± 5%    67.6kB ± 4%  +29.47%  (p=0.000 n=8+10)
TruncateLoop/desc/reqs=128/ranges=64/type=get-24       20.0kB ± 2%    32.2kB ± 1%  +60.86%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=64/type=scan-24       372kB ±13%     600kB ±10%  +61.29%  (p=0.000 n=10+8)
TruncateLoop/desc/reqs=16384/ranges=4/type=get-24      3.24MB ± 0%    4.45MB ± 0%  +37.42%  (p=0.000 n=8+7)
TruncateLoop/desc/reqs=16384/ranges=4/type=scan-24     6.61MB ± 0%    7.86MB ± 0%  +18.90%  (p=0.000 n=10+9)
TruncateLoop/desc/reqs=16384/ranges=64/type=get-24     2.75MB ± 2%    3.74MB ± 1%  +36.03%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=16384/ranges=64/type=scan-24    65.7MB ± 1%    97.2MB ± 1%  +47.95%  (p=0.000 n=10+10)

name                                                 old allocs/op  new allocs/op  delta
TruncateLoop/desc/reqs=128/ranges=4/type=get-24           177 ± 2%       314 ± 0%  +77.40%  (p=0.000 n=10+8)
TruncateLoop/desc/reqs=128/ranges=4/type=scan-24          597 ± 8%       847 ± 8%  +41.89%  (p=0.000 n=9+10)
TruncateLoop/desc/reqs=128/ranges=64/type=get-24          329 ± 3%       531 ± 2%  +61.40%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=64/type=scan-24       6.02k ± 9%     9.56k ± 6%  +58.80%  (p=0.000 n=10+8)
TruncateLoop/desc/reqs=16384/ranges=4/type=get-24       16.5k ± 0%     32.9k ± 0%  +99.17%  (p=0.000 n=8+8)
TruncateLoop/desc/reqs=16384/ranges=4/type=scan-24      53.5k ± 1%     73.1k ± 1%  +36.69%  (p=0.000 n=10+9)
TruncateLoop/desc/reqs=16384/ranges=64/type=get-24      17.5k ± 0%     33.9k ± 0%  +94.02%  (p=0.000 n=6+10)
TruncateLoop/desc/reqs=16384/ranges=64/type=scan-24      727k ± 1%     1194k ± 1%  +64.31%  (p=0.000 n=10+10)
```

An additional knob is introduced to the batch truncation helper to
indicate whether the helper can take ownership of the passed-in requests
slice and reorder it as it pleases. This is the case for the Streamer,
but the DistSender relies on the order of requests not being modified,
so the helper makes a copy of the slice.

Some tests needed an adjustment since now we process requests not
necessarily in the original order, so the population of ResumeSpans
might be different.

Fixes: #68536.
Addresses: #82159.

Release note: None

83358: sql: initialize sql instance during instance provider start r=ajwerner a=rharding6373

Before this change, there was a race condition where the instance
provider and the instance reader would start before the instance
provider created a SQL instance, potentially causing the reader to not
cache the instance before initialization was complete. This is
a problem in multi-tenant environments, where we may not be able to plan
queries if the reader does not know of any SQL instances.

This change moves sql instance initialization into the instance
provider's `Start()` function before starting the reader, so that the
instance is guaranteed to be visible to the reader on its first
rangefeed scan of the `system.sql_instances` table.

Fixes: #82706
Fixes: #81807
Fixes: #81567

Release note: None

83683: kvstreamer: perform more memory accounting r=yuzefovich a=yuzefovich

**kvstreamer: perform more memory accounting**

This commit performs memory accounting for more of the internal state of
the streamer. In particular, it adds accounting for the overhead of
`Result`s in the results buffer (previously, we would only account for
the size of the Get or the Scan response but not for the `Result` struct
itself). It also adds accounting for all slices that are `O(N)` in size
where `N` is the number of requests.

Addresses: #82160.

Release note: None

**kvstreamer: account for the overhead of each request**

Previously, we didn't account for some of the overhead of Get and Scan
requests, and this commit fixes that.

Here is the list of all things that we need to account for a single Get
request:
- (if referenced by `[]roachpb.RequestUnion`) the overhead of the
`roachpb.RequestUnion` object
- the overhead of `roachpb.RequestUnion_Get` object
- the overhead of `roachpb.GetRequest` object
- the footprint of the key inside of `roachpb.GetRequest`.

Previously, we only accounted for the first and the fourth items, and
now we account for all of them. Additionally, we used the
auto-generated `Size` method for the fourth item which I believe
represent the size of the serialized protobuf message (possibly
compressed - I'm not sure), but we're interested in the capacities of
the underlying slices.

Addresses: #82160.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: rharding6373 <[email protected]>
@yuzefovich
Copy link
Member Author

yuzefovich commented Jul 6, 2022

Here is the comparison of the streamer OFF vs the streamer ON on micro-benchmarks (with #84607):

name                                                    old time/op    new time/op    delta
IndexJoin/Cockroach-24                                    5.26ms ± 1%    5.69ms ± 1%   +8.04%  (p=0.000 n=10+10)
IndexJoin/MultinodeCockroach-24                           6.96ms ± 1%    7.36ms ± 1%   +5.76%  (p=0.000 n=10+9)
IndexJoinColumnFamilies/Cockroach-24                      7.92ms ± 3%    8.27ms ± 3%   +4.42%  (p=0.000 n=10+10)
IndexJoinColumnFamilies/MultinodeCockroach-24             11.3ms ± 5%    11.8ms ± 7%   +5.10%  (p=0.003 n=10+10)
IndexJoinOrdering/Cockroach-24                            6.38ms ± 2%    6.90ms ± 3%   +8.02%  (p=0.000 n=10+10)
IndexJoinOrdering/MultinodeCockroach-24                   8.85ms ± 4%    9.55ms ± 3%   +7.95%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24             5.95ms ± 2%    6.37ms ± 1%   +7.18%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24    7.26ms ± 2%    7.68ms ± 1%   +5.76%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/Cockroach-24               6.07ms ± 1%    6.47ms ± 2%   +6.47%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24      7.43ms ± 1%    7.77ms ± 2%   +4.57%  (p=0.000 n=10+10)
LookupJoinNoOrdering/Cockroach-24                         8.38ms ± 3%    8.80ms ± 4%   +5.04%  (p=0.000 n=10+10)
LookupJoinNoOrdering/MultinodeCockroach-24                11.4ms ± 5%    12.0ms ± 4%   +5.76%  (p=0.000 n=10+10)
LookupJoinOrdering/Cockroach-24                           8.49ms ± 4%    8.91ms ± 3%   +4.94%  (p=0.000 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                  11.7ms ± 4%    11.9ms ± 5%     ~     (p=0.218 n=10+10)

name                                                    old alloc/op   new alloc/op   delta
IndexJoin/Cockroach-24                                     926kB ± 1%    1198kB ± 1%  +29.27%  (p=0.000 n=10+10)
IndexJoin/MultinodeCockroach-24                           1.60MB ± 3%    1.90MB ± 2%  +18.16%  (p=0.000 n=10+10)
IndexJoinColumnFamilies/Cockroach-24                       974kB ± 1%    1259kB ± 1%  +29.26%  (p=0.000 n=10+9)
IndexJoinColumnFamilies/MultinodeCockroach-24             1.81MB ± 3%    2.13MB ± 1%  +17.22%  (p=0.000 n=10+9)
IndexJoinOrdering/Cockroach-24                             921kB ± 1%    1268kB ± 1%  +37.59%  (p=0.000 n=9+9)
IndexJoinOrdering/MultinodeCockroach-24                   1.61MB ± 2%    1.98MB ± 4%  +22.62%  (p=0.000 n=10+9)
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24             1.28MB ± 1%    1.55MB ± 1%  +20.96%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24    2.01MB ± 3%    2.24MB ± 2%  +10.97%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/Cockroach-24               1.33MB ± 1%    1.60MB ± 1%  +20.23%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24      2.08MB ± 4%    2.25MB ± 3%   +8.32%  (p=0.000 n=9+9)
LookupJoinNoOrdering/Cockroach-24                         1.35MB ± 1%    1.62MB ± 1%  +20.28%  (p=0.000 n=9+10)
LookupJoinNoOrdering/MultinodeCockroach-24                1.97MB ± 3%    2.18MB ± 3%  +10.86%  (p=0.000 n=10+10)
LookupJoinOrdering/Cockroach-24                           1.38MB ± 1%    1.65MB ± 1%  +19.80%  (p=0.000 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                  2.02MB ± 3%    2.22MB ± 2%   +9.72%  (p=0.000 n=10+9)

name                                                    old allocs/op  new allocs/op  delta
IndexJoin/Cockroach-24                                     6.11k ± 0%     6.24k ± 1%   +2.00%  (p=0.000 n=7+10)
IndexJoin/MultinodeCockroach-24                            10.8k ± 2%     10.9k ± 2%     ~     (p=0.225 n=10+10)
IndexJoinColumnFamilies/Cockroach-24                       9.91k ± 1%    10.98k ± 1%  +10.72%  (p=0.000 n=10+9)
IndexJoinColumnFamilies/MultinodeCockroach-24              15.5k ± 1%     16.6k ± 1%   +6.93%  (p=0.000 n=10+9)
IndexJoinOrdering/Cockroach-24                             4.18k ± 2%     4.26k ± 2%   +1.79%  (p=0.008 n=9+9)
IndexJoinOrdering/MultinodeCockroach-24                    9.65k ± 2%     9.90k ± 2%   +2.52%  (p=0.003 n=10+8)
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24              8.85k ± 2%     8.96k ± 1%   +1.28%  (p=0.005 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24     13.0k ± 1%     13.2k ± 1%   +1.36%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/Cockroach-24                9.89k ± 1%    10.02k ± 3%   +1.29%  (p=0.001 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24       14.1k ± 1%     14.3k ± 1%   +0.88%  (p=0.002 n=8+10)
LookupJoinNoOrdering/Cockroach-24                          11.1k ± 1%     11.3k ± 1%   +2.16%  (p=0.000 n=9+10)
LookupJoinNoOrdering/MultinodeCockroach-24                 15.7k ± 1%     16.0k ± 1%   +1.87%  (p=0.000 n=10+10)
LookupJoinOrdering/Cockroach-24                            12.2k ± 1%     12.4k ± 1%   +1.86%  (p=0.000 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                   16.8k ± 0%     17.1k ± 1%   +1.55%  (p=0.000 n=9+9)

I think I'm pretty satisfied given that most regressions are under 10% and only "lookup join with ordering" cases are above (those are the least frequent ones). I might optimize it further, but it doesn't seem necessary.

craig bot pushed a commit that referenced this issue Jul 7, 2022
83547: stmtdiagnostics: remove the usage of gossip r=yuzefovich a=yuzefovich

This commit removes the usage of gossip in the stmt diagnostics feature
since it is really an optimization (i.e not required for the feature to
work) and is in a way of the architecture unification (i.e.
multi-tenancy effort).

The gossip was previously used to let all nodes in the cluster know
about the new stmt diagnostics request or about the cancellation of an
existing one (in fact, the gossip propagation of the latter was broken
anyway since we forgot to register the corresponding callback). We now
rely on the polling mechanism on each node to read from the system
table to populate the registry of current requests. This polling takes
place every `sql.stmt_diagnostics.poll_interval` seconds (10 by default).

Additionally, this commit changes the class of that setting from
TenantWritable to TenantReadOnly because
1. we don't charge the user for the polling
2. to prevent the abuse where a tenant could set the polling interval
to a very low value incurring no costs to themselves (we excluded the
polling from the tenant cost recently).

The follow-up work remains to bring the optimization of propagating new
requests quickly using a range feed on the system table.

Epic: CRDB-16702
Informs: #47893

Release note: None

83896: ui/cluster-ui: filter out closed sessions from active exec pages  r=xinhaoz a=xinhaoz

Previously, it was possible for the active transactions page
to show txns from closed sessions. The sessions API was
recently updated to return closed sessions, and it is
possible for the active_txn field in a closed session to be
populated. This commit filters  out the closed sessions when
retrieving active transactions.

Release note (bug fix): active transactions page no longer
shows transactions from closed sessions

83903: ui: persist stmt view  selection in sql activity page r=xinhaoz a=xinhaoz

Previously, the selection of viewing historical or active
executions for statements and transactions tabs in the SQL
activity page did not persist on tab change. This commit
persists the selection between tab changes in the SQL
activity page.

Release note (ui change): In the SQL Activity Page, the
selection to view historical or active executions will
persist between tabs.


https://www.loom.com/share/6990f8e273f64ddcaf971e85f0a8ef89

83926: kvstreamer: reuse the truncation helper and introduce a fast path for a single range case r=yuzefovich a=yuzefovich

**kvstreamer: introduce a single range fast path for request truncation**

This commit introduces a fast path to avoid the usage of the batch
truncation helper when all requests are contained within a single range.
Some modifications needed to be made to the `txnKVStreamer` so that it
didn't nil out the requests slice - we now delay that until right before
the next call to `Enqueue`.

```
ame                                  old time/op    new time/op    delta
IndexJoin/Cockroach-24                  6.21ms ± 1%    5.96ms ± 2%  -4.08%  (p=0.000 n=8+10)
IndexJoinColumnFamilies/Cockroach-24    8.97ms ± 4%    8.79ms ± 7%    ~     (p=0.190 n=10+10)

name                                  old alloc/op   new alloc/op   delta
IndexJoin/Cockroach-24                  1.39MB ± 1%    1.27MB ± 1%  -7.97%  (p=0.000 n=10+10)
IndexJoinColumnFamilies/Cockroach-24    1.46MB ± 1%    1.34MB ± 0%  -8.04%  (p=0.000 n=9+7)

name                                  old allocs/op  new allocs/op  delta
IndexJoin/Cockroach-24                   7.20k ± 1%     7.16k ± 1%  -0.61%  (p=0.022 n=10+10)
IndexJoinColumnFamilies/Cockroach-24     12.0k ± 1%     11.9k ± 0%  -0.83%  (p=0.000 n=9+8)
```

Addresses: #82159.

Release note: None

**kvcoord: refactor the truncation helper for reuse**

This commit refactors the batch truncation helper so that it can be
reused for multiple batches of requests. In particular, that ability is
now utilized by the streamer. Additionally, since the streamer now holds
on to the same truncation helper for possibly a long time, this commit
adds the memory accounting for the internal state of the helper.

Addresses: #82160.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Xin Hao Zhang <[email protected]>
@yuzefovich
Copy link
Member Author

Here is the comparison on TPC-DS queries:

Q2: 18.35s	18.33s	-0.12%
Q3: 0.30s	0.30s	0.17%
Q4: 102.24s	103.02s	0.76%
Q6: 41.57s	40.92s	-1.56%
Q7: 16.90s	16.83s	-0.43%
Q8: 1.52s	1.40s	-7.88%
Q9: 106.30s	101.23s	-4.77%
Q10: 5.91s	5.50s	-6.90%
Q11: 55.64s	56.27s	1.13%
Q12: 2.09s	1.76s	-16.01%
Q13: 11.16s	10.38s	-7.05%
Q15: 4.28s	4.37s	2.05%
Q16: 7.71s	7.83s	1.62%
Q17: 4.05s	3.24s	-19.80%
Q19: 0.79s	0.75s	-5.48%
Q20: 3.49s	2.80s	-19.66%
Q21: 1.72s	1.68s	-2.33%
Q23: 123.38s	123.72s	0.27%
Q24: 0.25s	0.25s	-0.39%
Q25: 3.01s	2.64s	-12.37%
Q26: 14.39s	14.87s	3.35%
Q28: 24.52s	24.22s	-1.22%
Q29: 1.89s	1.71s	-9.39%
Q30: 8.08s	8.42s	4.28%
Q31: 22.17s	22.58s	1.83%
Q32: 0.43s	0.42s	-1.87%
Q33: 8.64s	8.21s	-5.00%
Q34: 6.77s	6.30s	-6.93%
Q35: 6.30s	5.81s	-7.78%
Q37: 0.43s	0.31s	-27.55%
Q38: 6.12s	5.60s	-8.50%
Q39: 97.80s	97.88s	0.08%
Q40: 2.06s	1.84s	-10.61%
Q41: 0.34s	0.34s	1.78%
Q42: 0.57s	0.51s	-10.99%
Q43: 6.92s	6.43s	-7.10%
Q44: 19.25s	17.75s	-7.78%
Q45: 4.25s	4.27s	0.52%
Q46: 10.64s	9.84s	-7.53%
Q47: 16.34s	16.24s	-0.61%
Q48: 9.97s	9.14s	-8.39%
Q49: 1.31s	1.27s	-2.60%
Q50: 1.09s	1.08s	-1.33%
Q51: 12.99s	12.35s	-4.93%
Q52: 0.57s	0.52s	-10.09%
Q53: 2.55s	2.40s	-5.84%
Q54: 6.08s	5.85s	-3.82%
Q55: 0.47s	0.44s	-6.98%
Q56: 2.70s	2.26s	-16.22%
Q57: 8.25s	8.17s	-1.05%
Q58: 13.98s	13.03s	-6.78%
Q59: 20.74s	20.68s	-0.32%
Q60: 20.70s	19.81s	-4.29%
Q61: 0.38s	0.44s	16.95%
Q62: 2.79s	2.78s	-0.11%
Q63: 2.42s	2.22s	-8.30%
Q65: 9.41s	8.68s	-7.81%
Q66: 6.29s	6.33s	0.64%
Q68: 10.05s	9.19s	-8.50%
Q69: 5.67s	5.25s	-7.41%
Q71: 9.67s	9.30s	-3.86%
Q72: 14.07s	13.05s	-7.27%
Q73: 6.35s	5.87s	-7.62%
Q74: 22.65s	22.63s	-0.07%
Q75: 12.22s	8.71s	-28.69%
Q76: 7.29s	6.95s	-4.67%
Q78: 24.66s	23.56s	-4.46%
Q79: 9.97s	9.35s	-6.23%
Q81: 29.62s	30.18s	1.89%
Q82: 1.48s	1.43s	-3.41%
Q83: 1.48s	1.54s	3.96%
Q84: 0.98s	0.99s	1.02%
Q85: 5.87s	4.58s	-21.92%
Q87: 6.06s	5.74s	-5.36%
Q88: 28.15s	23.56s	-16.32%
Q89: 1.95s	1.72s	-12.15%
Q90: 2.26s	2.25s	-0.27%
Q91: 1.02s	1.00s	-1.67%
Q92: 0.37s	0.35s	-4.21%
Q93: 0.23s	0.24s	7.33%
Q94: 4.87s	4.88s	0.11%
Q95: 66.77s	68.29s	2.27%
Q96: 5.51s	5.16s	-6.28%
Q97: 6.40s	5.93s	-7.35%
Q98: 6.75s	6.26s	-7.15%
Q99: 4.43s	4.54s	2.46%

I think the only remaining thing to check is the performance of TPC-E.

@yuzefovich
Copy link
Member Author

Looks like there is a minor performance regression on TPCE queries in P50 and P90 (average over 10 runs):

P50
BrokerVolume:		26.12ms 26.52ms		1.55%
CustomerPosition:	7.24ms 7.63ms		5.42%
DataMaintenance:	6.79ms 6.87ms		1.12%
MarketFeed:		81.47ms 82.50ms		1.27%
MarketWatch:		4.71ms 5.33ms		12.97%
SecurityDetail:		6.38ms 7.27ms		13.87%
TradeLookup:		9.40ms 9.62ms		2.39%
TradeOrder:		10.71ms 11.22ms		4.75%
TradeResult:		29.87ms 30.38ms		1.68%
TradeStatus:		12.14ms 12.56ms		3.46%
TradeUpdate:		15.73ms 16.17ms		2.74%

P90
BrokerVolume:		39.60ms 37.84ms		-4.44%
CustomerPosition:	10.93ms 10.91ms		-0.21%
DataMaintenance:	17.68ms 19.86ms		12.31%
MarketFeed:		120.99ms 116.72ms	-3.53%
MarketWatch:		7.47ms 8.21ms		10.03%
SecurityDetail:		8.51ms 9.43ms		10.77%
TradeLookup:		14.07ms 14.35ms		2.03%
TradeOrder:		15.34ms 15.25ms		-0.56%
TradeResult:		43.40ms 44.12ms		1.66%
TradeStatus:		17.22ms 16.84ms		-2.21%
TradeUpdate:		24.64ms 23.87ms		-3.12%

P99
BrokerVolume:		73.41ms 69.36ms		-5.52%
CustomerPosition:	24.01ms 23.05ms		-4.02%
DataMaintenance:	33.88ms 44.29ms		30.75%
MarketFeed:		179.57ms 176.76ms	-1.56%
MarketWatch:		13.57ms 14.72ms		8.43%
SecurityDetail:		17.62ms 18.92ms		7.37%
TradeLookup:		29.32ms 29.42ms		0.35%
TradeOrder:		29.02ms 27.91ms		-3.82%
TradeResult:		84.11ms 84.81ms		0.82%
TradeStatus:		49.44ms 46.01ms		-6.94%
TradeUpdate:		44.49ms 41.97ms		-5.65%

@yuzefovich
Copy link
Member Author

Alright, I think this issue is done - the usage of the streamer can offer significant speedups on queries processing a lot of data and has regressions on the order of 10% when processing small amounts of data.

craig bot pushed a commit that referenced this issue Jul 18, 2022
83265: kvserver/gc: remove range tombstones during GC r=erikgrinaker a=aliher1911


First commit adds support for range tombstones when removing point keys. Range tombstone will have the same effect as a point tombstone within the mvcc key history.

Second commit adds support for removal of range tombstones below GC threshold.


84493: sql: add parsing support for SHOW CREATE FUNCTION statement r=mgartner a=mgartner

This commit adds a `SHOW CREATE FUNCTION` statement to the SQL grammar.
This statement is not yet implemented and executing it results in an
error.

Release note: None

84504: execinfra: allow tenants to disable the streamer r=yuzefovich a=yuzefovich

Previously, we marked the setting that controls whether the streamer is
used as `TenantReadOnly` since we were not sure whether the streamer fit
well into the tenant cost model. Recently we revamped the cost model so
that it can now correctly predict the usage of the hardware resources by
the streamer, so at this point it seems safe to mark the setting
`TenantWritable`.

Informs: #82167

Release note: None

84515: ui: fix sorting of explain plans r=maryliag a=maryliag

Previously, the sorting on the plans on the
Explain Plans tab on Statement Details wasn't working.
This commit adds the missing code required to sort
that table.

Fixes #84079

https://www.loom.com/share/0f0ed0e1a8d04fc88def3b2460d617e6

Release note (bug fix): Sorting on the plans table inside the
Statement Details page is now properly working.

84519: clusterversion: remove some older versions r=yuzefovich a=yuzefovich

Release note: None

84601: rowexec: use OutOfOrder mode of streamer for lookup joins with ordering r=yuzefovich a=yuzefovich

Currently, the join reader always restores the required order for lookup
joins on its own since all looked up rows are buffered before any output
row is emitted. This observation allows us to use the OutOfOrder mode of
the streamer in such scenarios, so this commit makes such a change.
Previously, we would effectively maintain the order twice - both in the
streamer and in the join reader, and the former is redundant. This will
change in the future, but for now we can use the more-efficient mode.

```
name                                                  old time/op    new time/op    delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24             6.64ms ± 1%    6.48ms ± 1%  -2.34%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24    7.89ms ± 1%    7.75ms ± 1%  -1.80%  (p=0.000 n=10+10)
LookupJoinOrdering/Cockroach-24                         9.01ms ± 3%    8.88ms ± 4%    ~     (p=0.218 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                12.1ms ± 4%    12.0ms ± 3%    ~     (p=0.393 n=10+10)

name                                                  old alloc/op   new alloc/op   delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24             1.68MB ± 1%    1.60MB ± 1%  -4.93%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24    2.37MB ± 2%    2.29MB ± 2%  -3.11%  (p=0.000 n=10+10)
LookupJoinOrdering/Cockroach-24                         1.75MB ± 1%    1.66MB ± 1%  -5.01%  (p=0.000 n=10+9)
LookupJoinOrdering/MultinodeCockroach-24                2.36MB ± 1%    2.25MB ± 1%  -4.68%  (p=0.000 n=8+10)

name                                                  old allocs/op  new allocs/op  delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24              10.0k ± 1%     10.0k ± 1%    ~     (p=0.278 n=10+9)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24     14.3k ± 1%     14.3k ± 1%    ~     (p=0.470 n=10+10)
LookupJoinOrdering/Cockroach-24                          12.4k ± 1%     12.5k ± 1%    ~     (p=0.780 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                 17.1k ± 1%     17.0k ± 1%    ~     (p=0.494 n=10+10)
```

Addresses: #82159.

Release note: None

84607: bench: add a benchmark of index join with ordering r=yuzefovich a=yuzefovich

Release note: None

Co-authored-by: Oleg Afanasyev <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Marylia Gutierrez <[email protected]>
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

No branches or pull requests

2 participants