Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvstreamer: improve the stability #82160

Closed
Tracked by #54680
yuzefovich opened this issue May 31, 2022 · 3 comments
Closed
Tracked by #54680

kvstreamer: improve the stability #82160

yuzefovich opened this issue May 31, 2022 · 3 comments
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. sync-me T-sql-queries SQL Queries Team

Comments

@yuzefovich
Copy link
Member

yuzefovich commented May 31, 2022

Currently, the usage of the streamer library seems to make things less stable memory-wise than before - which goes against the goal of the library. We should achieve higher number in tpch_concurrency roachtest with the streamer enabled. I think the root cause is that the streamer creates many more objects in general (increasing pressure on the GC), and large slices are not accounted for yet.

We should:

  • remove allocations as much as possible
  • make sure all things are accounted for
  • consider making Result an interface so that depending on Hints and Mode we could have smaller-size implementations.

Jira issue: CRDB-16320

@yuzefovich yuzefovich added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label May 31, 2022
@yuzefovich yuzefovich self-assigned this May 31, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label May 31, 2022
craig bot pushed a commit that referenced this issue Jun 2, 2022
81830: kv: only scan intents span for QueryIntent request r=nvanbenschoten a=andrewbaptist

Change QueryIntent to only read from the lock table. Previously the request required merging the read from the MVC iterator with the lock table. This should improve performance.

Also, add unit testing for QueryIntent

Fixes #69716

Release note: None

81981: sql: add logictest config thats disables locality optimized search r=rharding6373 a=rharding6373

This PR increases test coverage by disabling locality optimized search
for some tests. Locality optimized search is enabled by default.

Fixes: #73633

Release note: None

82327: roachtest: add no-streamer config to tpch_concurrency r=yuzefovich a=yuzefovich

This commit adds another configuration to `tpch_concurrency` roachtest
which disables the streamer. My goal is to compare the concurrency
numbers with the streamer on and off while I'm working on stabilizing
the streamer.

Informs: #82160.

Release note: None

82330: workload/tpch: augment Q15 with streamID for concurrency r=yuzefovich a=yuzefovich

According to the TPC-H spec, Q15 can be augmented with the streamID
identifier so that Q15 could run with concurrency, so this commit
implements that. This ability is now used by the `tpch_concurrency`
roachtest in order to remove the number of retries that were occurring
when trying to create and drop the view named `revenue0`.

Fixes: #79870.

Release note: None

Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: rharding6373 <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
craig bot pushed a commit that referenced this issue Jun 6, 2022
82422: kvstreamer: clean up Result struct r=yuzefovich a=yuzefovich

This commit cleans up the `Result` struct in order to reduce its memory
size, bringing it down into 48 byte size class.
```
name                             old time/op    new time/op    delta
IndexJoin/Cockroach-24             6.29ms ± 1%    6.22ms ± 2%   -1.06%  (p=0.024 n=9+9)
IndexJoin/MultinodeCockroach-24    7.99ms ± 1%    7.93ms ± 2%     ~     (p=0.165 n=10+10)

name                             old alloc/op   new alloc/op   delta
IndexJoin/Cockroach-24             1.64MB ± 1%    1.48MB ± 0%   -9.25%  (p=0.000 n=9+10)
IndexJoin/MultinodeCockroach-24    2.37MB ± 1%    2.20MB ± 1%   -7.06%  (p=0.000 n=10+10)

name                             old allocs/op  new allocs/op  delta
IndexJoin/Cockroach-24              8.15k ± 1%     7.15k ± 1%  -12.28%  (p=0.000 n=8+10)
IndexJoin/MultinodeCockroach-24     12.7k ± 1%     11.7k ± 1%   -8.18%  (p=0.000 n=10+10)
```

The main change of this commit is the removal of the concept of "enqueue
keys" from the Streamer API in favor of relying on `Result.Position`.
When requests are unique, then a single `Result` can only satisfy
a single enqueue key; however, for non-unique requests a single `Result`
can satisfy multiple requests and, thus, can have multiple enqueue keys.
At the moment, only unique requests are supported though. Once
non-unique requests are supported too, we'll need to figure out how to
handle those (maybe we'll be returning a `Result` `N` number of times if
it satisfies `N` original requests with different values for
`Position`).

Also, initially multiple "enqueue keys" support was envisioned for the
case of `multiSpanGenerator` in the lookup joins (i.e. multi-equality
lookup joins); however, I believe we should push that complexity out of
the streamer (into `TxnKVStreamer`) which is what this commit does.

Other changes done in this commit:
- unexport `ScanResp.Complete` field since this is currently only
used within the `kvstreamer` package
- reorder all existing fields so that the footprint of the struct is
minimized (in particular, `scanComplete` field is moved to the bottom
and `ScanResp` anonymous struct is removed)
- make `subRequestIdx` `int32` rather than `int`. This value is bound
by the number of ranges in the cluster, so max int32 is more than
sufficient.

Addresses: #82159.
Addresses: #82160.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
@yuzefovich
Copy link
Member Author

With 10 runs of tpch_concurrency test (on master with #82384, #82387, #82865, #83010 cherry-picked) for the streamer OFF I got 78.5 and for the streamer ON I got 84.9, on average.

craig bot pushed a commit that referenced this issue Jun 28, 2022
78085: storage: add `MVCCStats` for range keys r=jbowens a=erikgrinaker

**storage: export some MVCC key encoding functions**

Release note: None

**roachpb: add `Key.Prevish()` to find a previous key**

This patch adds `Key.Prevish()`, which returns a previous key in
lexicographical sort order. This is needed to expand a latch span
leftwards to peek at any left-adjacent range keys.

It is impossible to find the exact immediate predecessor of a key,
because it can have an infinite number of `0xff` bytes at the end, so
this returns the nearest previous key right-padded with `0xff` up to the
given length. It is still possible for an infinite number of keys to
exist between `Key` and `Key.Prevish()`, as keys have unbounded length.

Release note: None

**storage: add `MVCCStats` for range keys**

This patch adds `MVCCStats` tracking for range keys. Four new fields are
added to `MVCCStats`:

* `RangeKeyCount`: the number of (fragmented) range keys, not counting
  historical versions.

* `RangeKeyBytes`: the logical encoded byte size of all range keys.
  The latest version contributes the encoded key bounds, and all
  versions contribute encoded timestamps. Unlike point keys, which for
  historical reasons use a fixed-size timestamp contribution, this uses
  the actual variable-length timestamp size.

* `RangeValCount`: the number of (fragmented) range key versions.

* `RangeValBytes`: the encoded size of all range key values across
  all versions. The same value can be stored in multiple range keys
  due to fragmentation, which will be counted separately. Even though
  all range keys are currently MVCC range tombstones with no value, the
  `MVCCValueHeader` contribution can be non-zero due to e.g. a local
  timestamp.

`ComputeStatsForRange()` has been extended to calculate the above
quantities, and additionally account for range tombstones themselves in
`GCBytesAge` along with their effect on point keys. All relevant call
sites have been updated to surface range keys for the MVCC iterators
passed to `ComputeStatsForRange()`.

Most MVCC operations have been updated to correctly account for MVCC
range tombstones, e.g. during point key writes and intent resolution. KV
APIs are not yet updated, this will be addressed later.

Range key stats are also adjusted during range splits and merges, which
will split and merge any range keys that straddle the split key. This
requires a single range key seek to the left and right of the split key
during these operations.

Touches #70412.

Release note: None

82384: sql: reuse the slice of RequestUnion objects between fetches r=yuzefovich a=yuzefovich

This commit teaches `txnKVFetcher` and `txnKVStreamer` to reuse the same
slice of `RequestUnion` objects between different fetches. It is now
extremely easy to do given the recent refactor. We do perform memory
accounting for this slice (against a memory account bound to an
unlimited memory monitor). Additionally, a similar optimization is
applied to how resume requests are populated by the Streamer.

Addresses: #82160.

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
craig bot pushed a commit that referenced this issue Jun 28, 2022
82384: sql: reuse the slice of RequestUnion objects between fetches r=yuzefovich a=yuzefovich

This commit teaches `txnKVFetcher` and `txnKVStreamer` to reuse the same
slice of `RequestUnion` objects between different fetches. It is now
extremely easy to do given the recent refactor. We do perform memory
accounting for this slice (against a memory account bound to an
unlimited memory monitor). Additionally, a similar optimization is
applied to how resume requests are populated by the Streamer.

Addresses: #82160.

Release note: None

83480: vendor: bump Pebble to 85bb1c759894 r=jbowens a=nicktrav

```
85bb1c75 rangekey: create `rangekey` package
0d272ec1 db: fix bug in deletion estimate computations
```

Release note: None.

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Nick Travers <[email protected]>
craig bot pushed a commit that referenced this issue Jun 30, 2022
82387: kvstreamer: remove temporary allocations for []Result r=yuzefovich a=yuzefovich

**kvstreamer: refactor the loop of processing the batch response**

This commit refactors the code which processes the batch response in
order to separate out two different concerns:
- processing non-empty responses in order to create `Result`s
- processing incomplete responses to populate the resume request.

Now each of these "concerns" is handled in a separate loop making it
easier to reason about, especially so in the following commit.

This commit also extracts out multiple return arguments of a function
into a struct as well as updates some of the comments and moves some of
the error-checking to an earlier stage.

Release note: None

**kvstreamer: remove temporary allocations for []Result**

Previously, the worker goroutine would accumulate all `Result`s it
can create based on the KV response into a slice, and then the
slice would be passed into the results buffer. At that point, the
slice would be discarded since the results buffer would copy all
`Result`s into its internal state.

This commit refactors the streamer as well as the results buffer to
avoid this temporary allocation of `[]Result`. The main idea is that
`Result`s are now passed one-by-one into the results buffer. The worker
goroutine now acquires the results buffer's mutex, processes the KV
responses one at a time, and whenever a `Result` is created, it is added
into the results buffer right away. However, in order to prevent the
results buffer from eagerly returning a single `Result` on `GetResults`
call, the streamer's user goroutine won't be woken up, until a newly
introduced `doneAddingLocked` method is called by the worker goroutine.

Some care needs to be taken to prevent deadlocks with all of the
mutexes. Now, since we're finalizing the results one at a time, we might
need to hold the streamer's mutex (so that we can set `scanComplete`
correctly), and that mutex must be acquired before the results buffer's
one.

This change shows a modest improvement on the microbenchmarks but is
a lot more important on analytical, TPCH-like queries, where this
`[]Result` is one of the largest sources of garbage.

```
name                                                    old time/op    new time/op    delta
IndexJoin/Cockroach-24                                    5.98ms ± 1%    5.95ms ± 1%    ~     (p=0.079 n=9+10)
IndexJoin/MultinodeCockroach-24                           7.55ms ± 1%    7.59ms ± 1%  +0.47%  (p=0.015 n=8+9)
IndexJoinColumnFamilies/Cockroach-24                      8.68ms ± 3%    8.56ms ± 2%    ~     (p=0.133 n=9+10)
IndexJoinColumnFamilies/MultinodeCockroach-24             11.8ms ± 5%    11.7ms ± 3%    ~     (p=0.315 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24             6.67ms ± 1%    6.69ms ± 1%    ~     (p=0.315 n=10+9)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24    7.87ms ± 1%    7.92ms ± 1%  +0.73%  (p=0.015 n=10+10)
LookupJoinEqColsAreKeyOrdering/Cockroach-24               9.30ms ± 2%    9.31ms ± 4%    ~     (p=0.796 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24      10.9ms ± 4%    10.9ms ± 2%    ~     (p=0.971 n=10+10)
LookupJoinNoOrdering/Cockroach-24                         8.99ms ± 1%    9.03ms ± 4%    ~     (p=0.549 n=9+10)
LookupJoinNoOrdering/MultinodeCockroach-24                12.1ms ± 4%    11.9ms ± 6%    ~     (p=0.143 n=10+10)
LookupJoinOrdering/Cockroach-24                           10.9ms ± 3%    10.8ms ± 3%    ~     (p=0.243 n=10+9)
LookupJoinOrdering/MultinodeCockroach-24                  14.2ms ± 5%    13.9ms ± 3%    ~     (p=0.113 n=10+9)

name                                                    old alloc/op   new alloc/op   delta
IndexJoin/Cockroach-24                                    1.36MB ± 1%    1.31MB ± 0%  -3.61%  (p=0.000 n=10+9)
IndexJoin/MultinodeCockroach-24                           2.07MB ± 2%    2.04MB ± 3%    ~     (p=0.063 n=10+10)
IndexJoinColumnFamilies/Cockroach-24                      1.43MB ± 1%    1.38MB ± 0%  -3.56%  (p=0.000 n=9+9)
IndexJoinColumnFamilies/MultinodeCockroach-24             2.27MB ± 1%    2.22MB ± 2%  -2.09%  (p=0.000 n=8+10)
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24             1.71MB ± 0%    1.67MB ± 0%  -2.70%  (p=0.000 n=9+10)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24    2.43MB ± 5%    2.35MB ± 1%  -3.31%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/Cockroach-24               1.72MB ± 1%    1.62MB ± 1%  -6.20%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24      2.39MB ± 2%    2.30MB ± 3%  -3.53%  (p=0.000 n=10+10)
LookupJoinNoOrdering/Cockroach-24                         1.79MB ± 1%    1.74MB ± 1%  -2.80%  (p=0.000 n=10+9)
LookupJoinNoOrdering/MultinodeCockroach-24                2.35MB ± 3%    2.32MB ± 2%    ~     (p=0.079 n=10+9)
LookupJoinOrdering/Cockroach-24                           1.63MB ± 1%    1.53MB ± 1%  -5.77%  (p=0.000 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                  2.30MB ± 4%    2.23MB ± 2%  -3.41%  (p=0.002 n=9+8)

name                                                    old allocs/op  new allocs/op  delta
IndexJoin/Cockroach-24                                     7.15k ± 1%     7.16k ± 1%    ~     (p=0.888 n=10+9)
IndexJoin/MultinodeCockroach-24                            11.9k ± 2%     11.9k ± 2%    ~     (p=0.968 n=10+9)
IndexJoinColumnFamilies/Cockroach-24                       11.9k ± 0%     11.9k ± 0%    ~     (p=0.075 n=9+10)
IndexJoinColumnFamilies/MultinodeCockroach-24              17.6k ± 1%     17.5k ± 1%    ~     (p=0.566 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24              9.86k ± 1%     9.88k ± 1%    ~     (p=0.150 n=9+10)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24     14.1k ± 0%     14.1k ± 1%    ~     (p=0.055 n=8+10)
LookupJoinEqColsAreKeyOrdering/Cockroach-24                12.6k ± 1%     12.5k ± 1%  -0.77%  (p=0.005 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24       17.2k ± 1%     17.0k ± 0%  -0.88%  (p=0.000 n=10+8)
LookupJoinNoOrdering/Cockroach-24                          12.3k ± 1%     12.3k ± 1%    ~     (p=0.929 n=10+10)
LookupJoinNoOrdering/MultinodeCockroach-24                 16.8k ± 1%     16.8k ± 1%    ~     (p=0.968 n=9+10)
LookupJoinOrdering/Cockroach-24                            14.5k ± 1%     14.5k ± 1%    ~     (p=0.271 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                   19.4k ± 1%     19.3k ± 1%    ~     (p=0.056 n=9+8)
```

Addresses: #82160.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
craig bot pushed a commit that referenced this issue Jul 1, 2022
83197: kvserver: bump the in-memory GC threshold as a pre-apply side effect r=aayushshah15 a=aayushshah15

This commit changes where the in-memory GC threshold on a Replica is
bumped during the application of a GC request.

Previously, the in-memory GC threshold was bumped as a post-apply side
effect. Additionally, GC requests do not acquire latches at the
timestamp that they are garbage collecting, and thus, readers need to
take additional care to ensure that results aren't served off of a
partial state.

Readers today rely on the invariant that the in-memory GC threshold is
bumped before the actual garbage collection. This today is true because
the mvccGCQueue issues GC requests in 2 phases: the first that simply
bumps the in-memory GC threshold, and the second one that performs the
actual garbage collection. If the in-memory GC threshold were bumped in
the pre-apply phase of command application, this usage quirk wouldn't
need to exist. That's what this commit does.

Relates to #55293

Release note: None



83659: kvstreamer: optimize the results buffers a bit r=yuzefovich a=yuzefovich

**kvstreamer: make InOrder-mode structs implement heap logic on their own**

This commit refactors the `inOrderRequestsProvider` as well as the
`inOrderResultsBuffer` to make them maintain the min heap over the
requests and results, respectively, on their own. This allows us to
avoid allocations for `interface{}` objects that occur when using
`heap.Interface`. The code was copied, with minor adjustments, from the
standard library.
```
name                                                  old time/op    new time/op    delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24             8.83ms ± 3%    8.58ms ± 3%   -2.82%  (p=0.002 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24    10.3ms ± 1%    10.1ms ± 2%   -1.86%  (p=0.009 n=10+10)
LookupJoinOrdering/Cockroach-24                         10.7ms ± 5%    10.5ms ± 3%     ~     (p=0.123 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                14.0ms ± 5%    13.6ms ± 5%   -3.16%  (p=0.043 n=10+10)

name                                                  old alloc/op   new alloc/op   delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24             1.77MB ± 2%    1.59MB ± 1%   -9.94%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24    2.57MB ± 3%    2.38MB ± 1%   -7.57%  (p=0.000 n=10+10)
LookupJoinOrdering/Cockroach-24                         1.67MB ± 0%    1.50MB ± 1%  -10.08%  (p=0.000 n=9+9)
LookupJoinOrdering/MultinodeCockroach-24                2.53MB ± 2%    2.34MB ± 2%   -7.78%  (p=0.000 n=10+9)

name                                                  old allocs/op  new allocs/op  delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24              12.5k ± 1%     10.4k ± 1%  -16.54%  (p=0.000 n=9+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24     17.6k ± 0%     15.5k ± 0%  -11.64%  (p=0.000 n=8+10)
LookupJoinOrdering/Cockroach-24                          14.4k ± 1%     12.4k ± 1%  -14.17%  (p=0.000 n=9+9)
LookupJoinOrdering/MultinodeCockroach-24                 20.6k ± 1%     18.5k ± 1%  -10.19%  (p=0.000 n=10+9)
```

Addresses: #82159.

Release note: None

**kvstreamer: refactor the OutOfOrder requests provider a bit**

This commit refactors the `requestsProvider` interface renaming a couple
of methods - `s/firstLocked/nextLocked/` and
`s/removeFirstLocked/removeNextLocked/`. The idea is that for the
InOrder mode "first" == "next", but for the OutOfOrder mode we can pick
an arbitrary request that is not necessarily "first". This commit then
makes the OutOfOrder requests provider pick its last request in the
queue - which should reduce the memory usage in case resume requests are
added later.

Previously, we would slice "next" requests off the front and append to
the end, and it could trigger reallocation of the underlying slice. Now,
we'll be updating only the "tail" of the queue, and since a single
request can result in at most one resume request, the initial slice
provided in `enqueue` will definitely have enough capacity.

Release note: None

**kvstreamer: reduce allocations in the InOrder results buffer**

This commit improves the InOrder results buffer by reusing the same
scratch space for returning the results on `get()` calls as well as by
reducing the size of `inOrderBufferedResult` struct from the 80 bytes
size class to the 64 bytes size class.

It also cleans up a couple of comments and clarifies `GetResults` method
a bit.
```
name                                                  old time/op    new time/op    delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24             8.47ms ± 3%    8.54ms ± 2%    ~     (p=0.182 n=9+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24    10.2ms ± 2%    10.1ms ± 2%    ~     (p=0.280 n=10+10)
LookupJoinOrdering/Cockroach-24                         10.4ms ± 3%    10.3ms ± 4%    ~     (p=0.393 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                13.8ms ± 5%    13.7ms ± 5%    ~     (p=0.739 n=10+10)

name                                                  old alloc/op   new alloc/op   delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24             1.59MB ± 2%    1.49MB ± 1%  -6.20%  (p=0.000 n=10+9)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24    2.36MB ± 2%    2.27MB ± 2%  -3.49%  (p=0.000 n=10+10)
LookupJoinOrdering/Cockroach-24                         1.51MB ± 1%    1.42MB ± 1%  -6.27%  (p=0.000 n=9+10)
LookupJoinOrdering/MultinodeCockroach-24                2.37MB ± 6%    2.26MB ± 2%  -4.77%  (p=0.000 n=10+9)

name                                                  old allocs/op  new allocs/op  delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24              10.4k ± 1%     10.3k ± 1%    ~     (p=0.055 n=8+9)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24     15.5k ± 0%     15.5k ± 1%  -0.34%  (p=0.037 n=10+10)
LookupJoinOrdering/Cockroach-24                          12.4k ± 1%     12.3k ± 1%  -0.98%  (p=0.000 n=9+10)
LookupJoinOrdering/MultinodeCockroach-24                 18.5k ± 1%     18.5k ± 2%    ~     (p=0.743 n=8+9)
```

Addresses: #82160.

Release note: None

83705: teamcity: `grep` for pebble version in `go.mod` rather than `DEPS.bzl` r=nicktrav,jbowens a=rickystewart

The `grep` can fail in the case that there is no more recent version of
`pebble` than the one that is vendored in-tree. This fixes that case.

Release note: None

Co-authored-by: Aayush Shah <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
craig bot pushed a commit that referenced this issue Jul 6, 2022
82865: kvcoord: optimize batch truncation loop r=yuzefovich a=yuzefovich

**kvcoord: truncate BatchRequest to range boundaries serially**

This commit refactors the DistSender's loop of iterating over the range
descriptors so that the truncation of the BatchRequest happens
serially. This incurs a minor performance hit when the requests are sent
in parallel, but it makes it possible to apply the optimizations to this
iteration in the following commits.

Release note: None

**kvcoord: merge truncate_test into batch_test**

This commit moves `TestTruncate` as well as `BenchmarkTruncate` into
`batch_test.go` file in the preparation for the refactor done in the
following commit. The code itself wasn't changed at all.

Release note: None

**kvcoord: introduce batch truncation helper**

This commit introduces a batch truncation helper that encompasses logic
of truncating requests to the boundaries of a single range as well as
returning the next key to seek the range iterator to. The helper is now
used both in the DistSender as well as in the Streamer. No modification
to the actual logic of `Truncate` nor `Next`/`prev` functions has been
done other than incorporating the return of the next seek key into
`Truncate` function itself. This is needed since the following commit
will tightly couple the truncation process with the next seek key
determination in order to optimize it.

The helper can be configured with a knob indicating whether `Truncate`
needs to return requests in the original order. This behavior is
necessary by the BatchRequests that contain writes since in several
spots we rely on the ordering assumptions (e.g. of increasing values of
`Sequence`).

The following adjustments were made to the tests:
- `BenchmarkTruncate` has been renamed to `BenchmarkTruncateLegacy`
- `TestTruncate` has been refactored to exercise the new and the old
code-paths
- `TestBatchPrevNext` has been refactored to run through the new code
path, also a few test cases have been adjusted slightly.

This commit also introduces some unit tests for the new code path when
it runs in a loop over multiple ranges as well as a corresponding
benchmark.

Release note: None

**kvcoord: optimize batch truncation loop**

This commit optimizes the batch truncation loop for the case when the
requests only use global keys.

The optimized approach sorts the requests according to their start keys
(with the Ascending scan direction) or in the reverse order of their end
keys (with the Descending scan direction).  Then on each `Truncate` call
it looks only at a subset of the requests (that haven't been fully
processed yet and don't start after the current range), allowing us to
avoid many unnecessary key comparisons. Please see a comment on the new
`truncateAsc` and `truncateDesc` functions for more details and examples.

The optimized approach actually has a worse time complexity in the
absolute worst case (when each request is a range-spanning one and
actually spans all of the ranges against which the requests are
truncated) - because of the need to sort the requests upfront - but in
practice, it is much faster, especially with point requests.

```
name                                                               old time/op    new time/op    delta
TruncateLoop/asc/reqs=128/ranges=4/type=get-24                       59.8µs ± 1%    33.9µs ± 3%   -43.38%  (p=0.000 n=10+10)
TruncateLoop/asc/reqs=128/ranges=4/type=scan-24                      83.0µs ± 4%    69.7µs ± 7%   -15.98%  (p=0.000 n=10+10)
TruncateLoop/asc/reqs=128/ranges=64/type=get-24                       865µs ± 1%      62µs ± 1%   -92.84%  (p=0.000 n=10+10)
TruncateLoop/asc/reqs=128/ranges=64/type=scan-24                     1.07ms ± 5%    0.46ms ± 8%   -56.95%  (p=0.000 n=10+10)
TruncateLoop/asc/reqs=16384/ranges=4/type=get-24                     7.09ms ± 0%    5.99ms ± 1%   -15.56%  (p=0.000 n=10+9)
TruncateLoop/asc/reqs=16384/ranges=4/type=scan-24                    9.22ms ± 0%   10.52ms ± 1%   +14.08%  (p=0.000 n=9+10)
TruncateLoop/asc/reqs=16384/ranges=64/type=get-24                     108ms ± 0%       6ms ± 2%   -94.50%  (p=0.000 n=8+10)
TruncateLoop/asc/reqs=16384/ranges=64/type=scan-24                    129ms ± 1%      71ms ± 1%   -45.06%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=4/type=get-24         60.2µs ± 1%    36.0µs ± 2%   -40.14%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=4/type=scan-24        82.4µs ± 8%    72.1µs ± 4%   -12.51%  (p=0.000 n=10+9)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=64/type=get-24         862µs ± 1%      72µs ± 1%   -91.65%  (p=0.000 n=10+9)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=64/type=scan-24       1.06ms ± 4%    0.49ms ± 8%   -53.39%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=4/type=get-24       7.24ms ± 0%    6.46ms ± 1%   -10.74%  (p=0.000 n=10+9)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=4/type=scan-24      10.1ms ± 2%     9.8ms ± 2%    -2.36%  (p=0.000 n=10+9)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=64/type=get-24       107ms ± 0%       7ms ± 0%   -93.57%  (p=0.000 n=8+10)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=64/type=scan-24      122ms ± 0%      80ms ± 1%   -34.32%  (p=0.000 n=9+9)
TruncateLoop/desc/reqs=128/ranges=4/type=get-24                      78.9µs ± 1%    36.4µs ± 3%   -53.81%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=4/type=scan-24                     79.4µs ± 4%    52.3µs ± 5%   -34.14%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=64/type=get-24                     1.16ms ± 1%    0.07ms ± 1%   -94.39%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=64/type=scan-24                    1.01ms ± 4%    0.46ms ± 5%   -54.56%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=16384/ranges=4/type=get-24                    9.42ms ± 0%    6.26ms ± 1%   -33.52%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=16384/ranges=4/type=scan-24                   8.41ms ± 1%    9.22ms ± 1%    +9.54%  (p=0.000 n=10+9)
TruncateLoop/desc/reqs=16384/ranges=64/type=get-24                    145ms ± 0%       6ms ± 1%   -95.63%  (p=0.000 n=9+9)
TruncateLoop/desc/reqs=16384/ranges=64/type=scan-24                   125ms ± 1%      67ms ± 1%   -46.31%  (p=0.000 n=9+9)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=4/type=get-24        77.8µs ± 1%    39.6µs ± 2%   -49.10%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=4/type=scan-24       74.0µs ± 3%    63.0µs ± 6%   -14.92%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=64/type=get-24       1.16ms ± 1%    0.08ms ± 1%   -93.47%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=64/type=scan-24      1.04ms ± 5%    0.47ms ± 7%   -54.65%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=4/type=get-24      9.50ms ± 0%    6.73ms ± 1%   -29.21%  (p=0.000 n=9+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=4/type=scan-24     8.88ms ± 1%   13.24ms ± 1%   +49.04%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=64/type=get-24      146ms ± 0%       7ms ± 1%   -94.98%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=64/type=scan-24     125ms ± 1%      74ms ± 1%   -40.75%  (p=0.000 n=10+9)

name                                                               old alloc/op   new alloc/op   delta
TruncateLoop/asc/reqs=128/ranges=4/type=get-24                       7.58kB ± 0%   21.00kB ±11%  +176.84%  (p=0.000 n=7+10)
TruncateLoop/asc/reqs=128/ranges=4/type=scan-24                      39.8kB ± 6%    49.1kB ±15%   +23.46%  (p=0.000 n=9+10)
TruncateLoop/asc/reqs=128/ranges=64/type=get-24                      6.48kB ± 5%   18.25kB ± 2%  +181.79%  (p=0.000 n=10+9)
TruncateLoop/asc/reqs=128/ranges=64/type=scan-24                      428kB ±20%     368kB ±13%   -13.85%  (p=0.003 n=10+10)
TruncateLoop/asc/reqs=16384/ranges=4/type=get-24                     1.60MB ± 0%    2.91MB ± 0%   +82.49%  (p=0.000 n=8+10)
TruncateLoop/asc/reqs=16384/ranges=4/type=scan-24                    5.24MB ± 0%    5.89MB ± 0%   +12.41%  (p=0.000 n=10+8)
TruncateLoop/asc/reqs=16384/ranges=64/type=get-24                    1.15MB ± 4%    2.41MB ± 1%  +110.09%  (p=0.000 n=10+10)
TruncateLoop/asc/reqs=16384/ranges=64/type=scan-24                   69.8MB ± 1%    64.6MB ± 1%    -7.55%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=4/type=get-24         9.77kB ±22%   21.98kB ± 4%  +125.07%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=4/type=scan-24        38.9kB ±23%    49.9kB ± 2%   +28.28%  (p=0.000 n=10+8)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=64/type=get-24        6.56kB ± 4%   20.20kB ± 3%  +208.11%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=64/type=scan-24        407kB ±15%     372kB ±13%    -8.68%  (p=0.043 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=4/type=get-24       1.65MB ± 0%    3.62MB ± 0%  +118.55%  (p=0.000 n=8+8)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=4/type=scan-24      6.60MB ± 2%    5.65MB ± 1%   -14.38%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=64/type=get-24      1.10MB ± 5%    2.77MB ± 1%  +152.58%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=64/type=scan-24     60.5MB ± 1%    67.9MB ± 1%   +12.19%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=4/type=get-24                      17.2kB ±10%    21.5kB ± 1%   +24.91%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=4/type=scan-24                     35.5kB ±12%    29.1kB ± 4%   -17.83%  (p=0.000 n=10+9)
TruncateLoop/desc/reqs=128/ranges=64/type=get-24                      138kB ± 1%      20kB ± 3%   -85.34%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=64/type=scan-24                     344kB ±15%     363kB ±10%    +5.50%  (p=0.035 n=10+10)
TruncateLoop/desc/reqs=16384/ranges=4/type=get-24                    2.78MB ± 0%    3.35MB ± 0%   +20.24%  (p=0.000 n=8+10)
TruncateLoop/desc/reqs=16384/ranges=4/type=scan-24                   4.42MB ± 6%    5.29MB ± 0%   +19.67%  (p=0.000 n=10+8)
TruncateLoop/desc/reqs=16384/ranges=64/type=get-24                   17.9MB ± 0%     2.7MB ± 2%   -85.21%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=16384/ranges=64/type=scan-24                  65.3MB ± 0%    61.0MB ± 1%    -6.65%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=4/type=get-24        15.9kB ± 3%    26.7kB ± 1%   +67.87%  (p=0.000 n=10+9)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=4/type=scan-24       29.4kB ± 6%    41.6kB ± 5%   +41.50%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=64/type=get-24        138kB ± 0%      23kB ± 3%   -83.61%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=64/type=scan-24       390kB ±19%     350kB ±11%   -10.16%  (p=0.015 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=4/type=get-24      2.69MB ± 4%    3.51MB ± 1%   +30.22%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=4/type=scan-24     4.89MB ± 1%    8.19MB ± 0%   +67.68%  (p=0.000 n=10+9)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=64/type=get-24     17.9MB ± 0%     3.0MB ± 2%   -83.34%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=64/type=scan-24    65.4MB ± 1%    62.9MB ± 1%    -3.81%  (p=0.000 n=10+10)

name                                                               old allocs/op  new allocs/op  delta
TruncateLoop/asc/reqs=128/ranges=4/type=get-24                         50.0 ± 0%      52.4 ±10%    +4.80%  (p=0.017 n=8+10)
TruncateLoop/asc/reqs=128/ranges=4/type=scan-24                         569 ±12%       557 ±15%      ~     (p=0.617 n=10+10)
TruncateLoop/asc/reqs=128/ranges=64/type=get-24                         207 ± 4%       210 ± 5%      ~     (p=0.380 n=10+10)
TruncateLoop/asc/reqs=128/ranges=64/type=scan-24                      6.97k ±13%     6.02k ± 9%   -13.64%  (p=0.000 n=10+10)
TruncateLoop/asc/reqs=16384/ranges=4/type=get-24                        126 ± 0%       122 ± 0%    -3.17%  (p=0.002 n=8+10)
TruncateLoop/asc/reqs=16384/ranges=4/type=scan-24                     51.9k ± 1%     42.8k ± 1%   -17.59%  (p=0.000 n=10+9)
TruncateLoop/asc/reqs=16384/ranges=64/type=get-24                     1.12k ± 1%     1.12k ± 1%    +0.43%  (p=0.027 n=10+10)
TruncateLoop/asc/reqs=16384/ranges=64/type=scan-24                     786k ± 1%      714k ± 1%    -9.13%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=4/type=get-24           41.2 ± 3%      58.0 ± 3%   +40.78%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=4/type=scan-24           574 ±18%       532 ±11%      ~     (p=0.143 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=64/type=get-24           205 ± 2%       234 ± 3%   +14.26%  (p=0.000 n=9+9)
TruncateLoop/asc/preserveOrder/reqs=128/ranges=64/type=scan-24        6.70k ± 9%     6.00k ± 9%   -10.40%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=4/type=get-24          127 ± 0%       125 ± 0%    -1.57%  (p=0.001 n=8+9)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=4/type=scan-24       63.7k ± 1%     27.8k ± 2%   -56.34%  (p=0.000 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=64/type=get-24       1.14k ± 0%     1.14k ± 1%      ~     (p=0.515 n=10+10)
TruncateLoop/asc/preserveOrder/reqs=16384/ranges=64/type=scan-24       696k ± 1%      752k ± 1%    +7.97%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=4/type=get-24                         554 ± 1%       169 ± 2%   -69.52%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=4/type=scan-24                        519 ± 9%       268 ± 9%   -48.32%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=64/type=get-24                      8.38k ± 0%     0.33k ± 3%   -96.06%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=64/type=scan-24                     5.90k ±10%     5.89k ± 5%      ~     (p=0.796 n=10+10)
TruncateLoop/desc/reqs=16384/ranges=4/type=get-24                     65.7k ± 0%     16.5k ± 0%   -74.87%  (p=0.002 n=8+10)
TruncateLoop/desc/reqs=16384/ranges=4/type=scan-24                    39.5k ± 1%     27.8k ± 2%   -29.54%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=16384/ranges=64/type=get-24                    1.05M ± 0%     0.02M ± 0%   -98.33%  (p=0.000 n=9+10)
TruncateLoop/desc/reqs=16384/ranges=64/type=scan-24                    741k ± 0%      679k ± 1%    -8.32%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=4/type=get-24           559 ± 0%       182 ± 2%   -67.42%  (p=0.000 n=9+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=4/type=scan-24          438 ± 8%       404 ±13%    -7.76%  (p=0.014 n=9+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=64/type=get-24        8.39k ± 0%     0.36k ± 5%   -95.75%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=128/ranges=64/type=scan-24       6.38k ±11%     5.56k ± 9%   -12.85%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=4/type=get-24       65.7k ± 0%     16.5k ± 0%   -74.84%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=4/type=scan-24      46.8k ± 1%     67.6k ± 1%   +44.65%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=64/type=get-24      1.05M ± 0%     0.02M ± 0%   -98.33%  (p=0.000 n=10+10)
TruncateLoop/desc/preserveOrder/reqs=16384/ranges=64/type=scan-24      739k ± 1%      694k ± 1%    -6.08%  (p=0.000 n=10+10)
```

The truncation loops for the optimized strategy are very similar in both
directions, so I tried to extract the differences out into an interface.
However, this showed non-trivial slow down and increase in allocations,
so I chose to have some duplicated code to get the best performance.
Here is a snippet of the comparison when the interface was prototyped:

```
name                                                 old time/op    new time/op    delta
TruncateLoop/desc/reqs=128/ranges=4/type=get-24        36.9µs ± 3%    44.5µs ± 3%  +20.55%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=4/type=scan-24       74.8µs ± 5%    88.8µs ± 3%  +18.73%  (p=0.000 n=9+10)
TruncateLoop/desc/reqs=128/ranges=64/type=get-24       64.9µs ± 1%    78.3µs ± 1%  +20.72%  (p=0.000 n=10+9)
TruncateLoop/desc/reqs=128/ranges=64/type=scan-24       471µs ± 8%     682µs ±13%  +44.73%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=16384/ranges=4/type=get-24      6.34ms ± 1%    7.39ms ± 0%  +16.47%  (p=0.000 n=10+9)
TruncateLoop/desc/reqs=16384/ranges=4/type=scan-24     11.2ms ± 1%    12.4ms ± 1%  +10.36%  (p=0.000 n=10+9)
TruncateLoop/desc/reqs=16384/ranges=64/type=get-24     6.40ms ± 2%    7.39ms ± 1%  +15.47%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=16384/ranges=64/type=scan-24    70.9ms ± 1%   102.0ms ± 2%  +43.87%  (p=0.000 n=9+9)

name                                                 old alloc/op   new alloc/op   delta
TruncateLoop/desc/reqs=128/ranges=4/type=get-24        22.2kB ± 9%    30.4kB ± 0%  +36.55%  (p=0.000 n=10+7)
TruncateLoop/desc/reqs=128/ranges=4/type=scan-24       52.2kB ± 5%    67.6kB ± 4%  +29.47%  (p=0.000 n=8+10)
TruncateLoop/desc/reqs=128/ranges=64/type=get-24       20.0kB ± 2%    32.2kB ± 1%  +60.86%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=64/type=scan-24       372kB ±13%     600kB ±10%  +61.29%  (p=0.000 n=10+8)
TruncateLoop/desc/reqs=16384/ranges=4/type=get-24      3.24MB ± 0%    4.45MB ± 0%  +37.42%  (p=0.000 n=8+7)
TruncateLoop/desc/reqs=16384/ranges=4/type=scan-24     6.61MB ± 0%    7.86MB ± 0%  +18.90%  (p=0.000 n=10+9)
TruncateLoop/desc/reqs=16384/ranges=64/type=get-24     2.75MB ± 2%    3.74MB ± 1%  +36.03%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=16384/ranges=64/type=scan-24    65.7MB ± 1%    97.2MB ± 1%  +47.95%  (p=0.000 n=10+10)

name                                                 old allocs/op  new allocs/op  delta
TruncateLoop/desc/reqs=128/ranges=4/type=get-24           177 ± 2%       314 ± 0%  +77.40%  (p=0.000 n=10+8)
TruncateLoop/desc/reqs=128/ranges=4/type=scan-24          597 ± 8%       847 ± 8%  +41.89%  (p=0.000 n=9+10)
TruncateLoop/desc/reqs=128/ranges=64/type=get-24          329 ± 3%       531 ± 2%  +61.40%  (p=0.000 n=10+10)
TruncateLoop/desc/reqs=128/ranges=64/type=scan-24       6.02k ± 9%     9.56k ± 6%  +58.80%  (p=0.000 n=10+8)
TruncateLoop/desc/reqs=16384/ranges=4/type=get-24       16.5k ± 0%     32.9k ± 0%  +99.17%  (p=0.000 n=8+8)
TruncateLoop/desc/reqs=16384/ranges=4/type=scan-24      53.5k ± 1%     73.1k ± 1%  +36.69%  (p=0.000 n=10+9)
TruncateLoop/desc/reqs=16384/ranges=64/type=get-24      17.5k ± 0%     33.9k ± 0%  +94.02%  (p=0.000 n=6+10)
TruncateLoop/desc/reqs=16384/ranges=64/type=scan-24      727k ± 1%     1194k ± 1%  +64.31%  (p=0.000 n=10+10)
```

An additional knob is introduced to the batch truncation helper to
indicate whether the helper can take ownership of the passed-in requests
slice and reorder it as it pleases. This is the case for the Streamer,
but the DistSender relies on the order of requests not being modified,
so the helper makes a copy of the slice.

Some tests needed an adjustment since now we process requests not
necessarily in the original order, so the population of ResumeSpans
might be different.

Fixes: #68536.
Addresses: #82159.

Release note: None

83358: sql: initialize sql instance during instance provider start r=ajwerner a=rharding6373

Before this change, there was a race condition where the instance
provider and the instance reader would start before the instance
provider created a SQL instance, potentially causing the reader to not
cache the instance before initialization was complete. This is
a problem in multi-tenant environments, where we may not be able to plan
queries if the reader does not know of any SQL instances.

This change moves sql instance initialization into the instance
provider's `Start()` function before starting the reader, so that the
instance is guaranteed to be visible to the reader on its first
rangefeed scan of the `system.sql_instances` table.

Fixes: #82706
Fixes: #81807
Fixes: #81567

Release note: None

83683: kvstreamer: perform more memory accounting r=yuzefovich a=yuzefovich

**kvstreamer: perform more memory accounting**

This commit performs memory accounting for more of the internal state of
the streamer. In particular, it adds accounting for the overhead of
`Result`s in the results buffer (previously, we would only account for
the size of the Get or the Scan response but not for the `Result` struct
itself). It also adds accounting for all slices that are `O(N)` in size
where `N` is the number of requests.

Addresses: #82160.

Release note: None

**kvstreamer: account for the overhead of each request**

Previously, we didn't account for some of the overhead of Get and Scan
requests, and this commit fixes that.

Here is the list of all things that we need to account for a single Get
request:
- (if referenced by `[]roachpb.RequestUnion`) the overhead of the
`roachpb.RequestUnion` object
- the overhead of `roachpb.RequestUnion_Get` object
- the overhead of `roachpb.GetRequest` object
- the footprint of the key inside of `roachpb.GetRequest`.

Previously, we only accounted for the first and the fourth items, and
now we account for all of them. Additionally, we used the
auto-generated `Size` method for the fourth item which I believe
represent the size of the serialized protobuf message (possibly
compressed - I'm not sure), but we're interested in the capacities of
the underlying slices.

Addresses: #82160.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: rharding6373 <[email protected]>
craig bot pushed a commit that referenced this issue Jul 7, 2022
83547: stmtdiagnostics: remove the usage of gossip r=yuzefovich a=yuzefovich

This commit removes the usage of gossip in the stmt diagnostics feature
since it is really an optimization (i.e not required for the feature to
work) and is in a way of the architecture unification (i.e.
multi-tenancy effort).

The gossip was previously used to let all nodes in the cluster know
about the new stmt diagnostics request or about the cancellation of an
existing one (in fact, the gossip propagation of the latter was broken
anyway since we forgot to register the corresponding callback). We now
rely on the polling mechanism on each node to read from the system
table to populate the registry of current requests. This polling takes
place every `sql.stmt_diagnostics.poll_interval` seconds (10 by default).

Additionally, this commit changes the class of that setting from
TenantWritable to TenantReadOnly because
1. we don't charge the user for the polling
2. to prevent the abuse where a tenant could set the polling interval
to a very low value incurring no costs to themselves (we excluded the
polling from the tenant cost recently).

The follow-up work remains to bring the optimization of propagating new
requests quickly using a range feed on the system table.

Epic: CRDB-16702
Informs: #47893

Release note: None

83896: ui/cluster-ui: filter out closed sessions from active exec pages  r=xinhaoz a=xinhaoz

Previously, it was possible for the active transactions page
to show txns from closed sessions. The sessions API was
recently updated to return closed sessions, and it is
possible for the active_txn field in a closed session to be
populated. This commit filters  out the closed sessions when
retrieving active transactions.

Release note (bug fix): active transactions page no longer
shows transactions from closed sessions

83903: ui: persist stmt view  selection in sql activity page r=xinhaoz a=xinhaoz

Previously, the selection of viewing historical or active
executions for statements and transactions tabs in the SQL
activity page did not persist on tab change. This commit
persists the selection between tab changes in the SQL
activity page.

Release note (ui change): In the SQL Activity Page, the
selection to view historical or active executions will
persist between tabs.


https://www.loom.com/share/6990f8e273f64ddcaf971e85f0a8ef89

83926: kvstreamer: reuse the truncation helper and introduce a fast path for a single range case r=yuzefovich a=yuzefovich

**kvstreamer: introduce a single range fast path for request truncation**

This commit introduces a fast path to avoid the usage of the batch
truncation helper when all requests are contained within a single range.
Some modifications needed to be made to the `txnKVStreamer` so that it
didn't nil out the requests slice - we now delay that until right before
the next call to `Enqueue`.

```
ame                                  old time/op    new time/op    delta
IndexJoin/Cockroach-24                  6.21ms ± 1%    5.96ms ± 2%  -4.08%  (p=0.000 n=8+10)
IndexJoinColumnFamilies/Cockroach-24    8.97ms ± 4%    8.79ms ± 7%    ~     (p=0.190 n=10+10)

name                                  old alloc/op   new alloc/op   delta
IndexJoin/Cockroach-24                  1.39MB ± 1%    1.27MB ± 1%  -7.97%  (p=0.000 n=10+10)
IndexJoinColumnFamilies/Cockroach-24    1.46MB ± 1%    1.34MB ± 0%  -8.04%  (p=0.000 n=9+7)

name                                  old allocs/op  new allocs/op  delta
IndexJoin/Cockroach-24                   7.20k ± 1%     7.16k ± 1%  -0.61%  (p=0.022 n=10+10)
IndexJoinColumnFamilies/Cockroach-24     12.0k ± 1%     11.9k ± 0%  -0.83%  (p=0.000 n=9+8)
```

Addresses: #82159.

Release note: None

**kvcoord: refactor the truncation helper for reuse**

This commit refactors the batch truncation helper so that it can be
reused for multiple batches of requests. In particular, that ability is
now utilized by the streamer. Additionally, since the streamer now holds
on to the same truncation helper for possibly a long time, this commit
adds the memory accounting for the internal state of the helper.

Addresses: #82160.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Xin Hao Zhang <[email protected]>
@yuzefovich
Copy link
Member Author

Latest 20 runs of tpch_concurrency showed some regression: 75.65 with the streamer OFF vs 68.9 with the streamer ON. I'm currently looking at the core dumps to see what we're not accounting for. This already exposed #84229 and #84230.

@yuzefovich
Copy link
Member Author

I did 10 runs of different tpch_concurrency configs on a binary that included #84230 and #84319 (as well as already merged #84229), and here is what I got (the higher, the better):

  • high_refresh_spans_bytes: 78.5
  • no streamer: 107.6
  • with streamer: 104.

At this point, it seems like the stability of the streamer is roughly on par with the non-streamer config, so I'm marking this issue as "done", and will make sure to push #84230 and #84319 over the finish line. I also might look into the stability aspect during the stability period.

craig bot pushed a commit that referenced this issue Jul 22, 2022
84319: storage: be tighter with allocations when TargetBytes is present r=yuzefovich a=yuzefovich

Previously, while `put`ting the key into the `repr`, we could make an
allocation that was too large given the remaining `TargetBytes` budget.
This is the case since we're exponentially increasing the capacities of
the buffers until 128MiB and because we're only accounting for the
length of the slice even though the whole capacity would have a memory
footprint.

For example, with 10MiB of `TargetBytes` (which is used by SQL in many
cases) and a ScanResponse that exceeds that limit, we'd allocate
capacities that are powers of two, starting at, say, 256B, and would go
all the way up to 8MiB; however, given that 10MiB limit, we'd only use
2MiB of that last 8MiB `repr` when we encounter the target bytes limit
and stop. Such behavior is kinda ok if the response is marshalled by the
gRPC to be sent to the other node, but it is quite unfortunate in the
local fast-path cases (meaning the SQL and the KV are part of the same
physical machine). In the latter scenario SQL would only account for the
lengths of slices while keeping the whole slices alive for a while,
leading to significant unaccounted for memory. In the example above, on
the order of 6MiB would be unaccounted for - multiply that by some
concurrency, and we have unaccounted memory on the order of hundreds of
MiBs.

The current behavior seems especially bad for the streamer use case
where we issue many requests with the `TargetBytes` set and use
`ScanResponse.NumBytes` field (which tracks the lengths of the slices)
for the memory accounting purposes.

In order to improve here, this commit teaches `put` method about the
maximum capacity it can use. In the example above, the last slice would
be on the order of 2MiB making everything happy: we stay very close to
TargetBytes limit and don't have any wasted space.

Addresses: #64906.
Addresses: #82160.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. sync-me T-sql-queries SQL Queries Team
Projects
Archived in project
Development

No branches or pull requests

2 participants