kvstreamer: improve the performance #82159

yuzefovich · 2022-05-31T16:55:48Z

Currently, the streamer is slower when comparing to the old code path when the requests were already parallelized (index joins as well as lookup joins when lookup columns form a key). In an index join benchmark I saw about 20-25% regression at some point. We should improve the performance. My goal is to bring the gap to about 10%.

Getting to parity is not really feasible since the streamer has to perform additional work / tracking when comparing to the old-code path. However, when comparing against 22.1 release, parity might be possible - this is the aspirational goal.

Jira issue: CRDB-16229

ajwerner · 2022-05-31T17:25:32Z

It seems to me that one way the streamer could do better than the thing that preceded it by accumulating more than one batch of input and in parallel choosing to send to ranges where it expects a large amount of data. To some extent, I guess this is covered by

cockroach/pkg/kv/kvclient/kvstreamer/streamer.go

Lines 405 to 407 in 5349fe4

    
           // Currently, enqueuing new requests while there are still requests in progress 
        
           // from the previous invocation is prohibited. 
        
           // TODO(yuzefovich): lift this restriction and introduce the pipelining.

yuzefovich · 2022-05-31T18:11:19Z

Yeah, I think pipelining of those things essentially implements the idea you're describing. I filed #82163 to track this explicitly. I'm not sure it'll be implemented in 22.2 timeframe though.

82305: colfetcher: increase input batch size limit when using the Streamer r=yuzefovich a=yuzefovich This commit bumps up the input batch size limit in the vectorized index join from 4MiB to 8MiB when the streamer API is used. The old number was copy-pasted from the row-by-row join reader, and the new number is derived based on running TPCH queries 4, 5, 6, 10, 12, 14, 15, 16 using `tpchvec/bench`. I did a quick run of the same queries with the row-by-row engine used by default, and it seemed like the default of 4MiB there is reasonable, so I didn't change that one. With the non-streamer code path we didn't want to go higher than 4MiB since it showed diminishing returns while exposing the cluster to higher instability (since the non-streamer code path doesn't use the memory limits in the KV layer). The streamer does use memory limits for BatchRequests, so it should be safe to increase this limit. Additionally, this commit makes it so that 1/8th (rather than 1/4th) of the workmem limit is reserved for the output batch of the cFetcher. With the default value of 64MiB this would translate to 8MiB which is large enough. Addresses: #82159. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]>

82422: kvstreamer: clean up Result struct r=yuzefovich a=yuzefovich This commit cleans up the `Result` struct in order to reduce its memory size, bringing it down into 48 byte size class. ``` name old time/op new time/op delta IndexJoin/Cockroach-24 6.29ms ± 1% 6.22ms ± 2% -1.06% (p=0.024 n=9+9) IndexJoin/MultinodeCockroach-24 7.99ms ± 1% 7.93ms ± 2% ~ (p=0.165 n=10+10) name old alloc/op new alloc/op delta IndexJoin/Cockroach-24 1.64MB ± 1% 1.48MB ± 0% -9.25% (p=0.000 n=9+10) IndexJoin/MultinodeCockroach-24 2.37MB ± 1% 2.20MB ± 1% -7.06% (p=0.000 n=10+10) name old allocs/op new allocs/op delta IndexJoin/Cockroach-24 8.15k ± 1% 7.15k ± 1% -12.28% (p=0.000 n=8+10) IndexJoin/MultinodeCockroach-24 12.7k ± 1% 11.7k ± 1% -8.18% (p=0.000 n=10+10) ``` The main change of this commit is the removal of the concept of "enqueue keys" from the Streamer API in favor of relying on `Result.Position`. When requests are unique, then a single `Result` can only satisfy a single enqueue key; however, for non-unique requests a single `Result` can satisfy multiple requests and, thus, can have multiple enqueue keys. At the moment, only unique requests are supported though. Once non-unique requests are supported too, we'll need to figure out how to handle those (maybe we'll be returning a `Result` `N` number of times if it satisfies `N` original requests with different values for `Position`). Also, initially multiple "enqueue keys" support was envisioned for the case of `multiSpanGenerator` in the lookup joins (i.e. multi-equality lookup joins); however, I believe we should push that complexity out of the streamer (into `TxnKVStreamer`) which is what this commit does. Other changes done in this commit: - unexport `ScanResp.Complete` field since this is currently only used within the `kvstreamer` package - reorder all existing fields so that the footprint of the struct is minimized (in particular, `scanComplete` field is moved to the bottom and `ScanResp` anonymous struct is removed) - make `subRequestIdx` `int32` rather than `int`. This value is bound by the number of ranges in the cluster, so max int32 is more than sufficient. Addresses: #82159. Addresses: #82160. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]>

81565: roachtest: benchmark node decommission r=AlexTalks a=AlexTalks roachtest: benchmark node decommission While previously some roachtests existed for the purposes of testing the decommission process, we have not had any benchmarks to track how long it takes to decommission a node, making it difficult to reason about how to understand what makes decommission so slow. This change adds benchmarks for node decommission under a number of configurations, including variable numbers of nodes/cpus, TPCC warehouses, and with admission control enabled vs. disabled. Some initial runs of the test have shown the following averages: ``` decommissionBench/nodes=4/cpu=16/warehouses=1000: 16m14s decommissionBench/nodes=4/cpu=16/warehouses=1000/no-admission: 15m48s decommissionBench/nodes=4/cpu=16/warehouses=1000/while-down: 20m36s decommissionBench/nodes=8/cpu=16/warehouses=3000: 18m30s ``` Release note: None 82382: kvstreamer: optimize singleRangeBatch.Less r=yuzefovich a=yuzefovich **bench: add benchmarks for lookup joins** This commit adds benchmarks for lookup joins, both when equality columns are and are not key, both with and without maintaining ordering. Release note: None **kvstreamer: optimize singleRangeBatch.Less** This commit optimizes `singleRangeBatch.Less` method which is used when sorting the requests inside of these objects in the OutOfOrder mode (which is needed to get the low-level Pebble speedups) by storing the start keys explicitly instead of performing a couple of function calls on each `Less` invocation. ``` name old time/op new time/op delta IndexJoin/Cockroach-24 6.30ms ± 1% 5.78ms ± 1% -8.31% (p=0.000 n=10+10) IndexJoin/MultinodeCockroach-24 8.01ms ± 1% 7.51ms ± 1% -6.28% (p=0.000 n=10+10) name old alloc/op new alloc/op delta IndexJoin/Cockroach-24 1.55MB ± 0% 1.57MB ± 0% +0.98% (p=0.000 n=9+10) IndexJoin/MultinodeCockroach-24 2.28MB ± 2% 2.30MB ± 1% ~ (p=0.400 n=10+9) name old allocs/op new allocs/op delta IndexJoin/Cockroach-24 8.16k ± 1% 8.13k ± 1% ~ (p=0.160 n=10+10) IndexJoin/MultinodeCockroach-24 12.7k ± 1% 12.6k ± 0% ~ (p=0.128 n=10+9) ``` ``` name old time/op new time/op delta LookupJoinEqColsAreKeyNoOrdering/Cockroach-24 6.89ms ± 1% 6.43ms ± 1% -6.65% (p=0.000 n=10+10) LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24 8.03ms ± 1% 7.48ms ± 2% -6.92% (p=0.000 n=10+10) LookupJoinNoOrdering/Cockroach-24 9.21ms ± 3% 8.82ms ± 5% -4.23% (p=0.007 n=10+10) LookupJoinNoOrdering/MultinodeCockroach-24 11.9ms ± 3% 11.5ms ± 3% -3.36% (p=0.002 n=9+10) name old alloc/op new alloc/op delta LookupJoinEqColsAreKeyNoOrdering/Cockroach-24 1.81MB ± 1% 1.84MB ± 0% +1.23% (p=0.000 n=10+10) LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24 2.50MB ± 2% 2.54MB ± 1% +1.76% (p=0.004 n=10+10) LookupJoinNoOrdering/Cockroach-24 1.89MB ± 0% 1.91MB ± 1% +1.09% (p=0.000 n=9+9) LookupJoinNoOrdering/MultinodeCockroach-24 2.37MB ± 2% 2.42MB ± 4% +1.85% (p=0.010 n=10+9) name old allocs/op new allocs/op delta LookupJoinEqColsAreKeyNoOrdering/Cockroach-24 10.8k ± 0% 10.8k ± 1% ~ (p=0.615 n=10+10) LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24 15.1k ± 1% 15.0k ± 0% ~ (p=0.101 n=10+10) LookupJoinNoOrdering/Cockroach-24 13.3k ± 1% 13.3k ± 1% ~ (p=0.549 n=10+9) LookupJoinNoOrdering/MultinodeCockroach-24 17.3k ± 1% 17.3k ± 1% ~ (p=0.460 n=10+8) ``` Addresses: #82159 Release note: None 82740: build: remove crdb-protobuf-client node_modules with ui-maintainer-clean r=maryliag,rickystewart a=sjbarag Since version 33 [1], dev ui clean --all removes the pkg/ui/workspaces/db-console/src/js/node_modules tree. Remove that tree with make ui-maintainer-clean to keep parity between the two build systems. [1] 2e9e7a5 (dev: bump to version 33, 2022-05-27) Release note: None 82744: ui: update cluster-ui to v22.2.0-prerelease-2 r=maryliag a=maryliag Update cluster-ui to the latest value publishes Release note: None 82748: ci: skip Docker test in CI r=ZhouXing19 a=rickystewart This has been flaky for a while, skipping until we have more information about what's going on here. Release note: None Co-authored-by: Alex Sarkesian <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Sean Barag <[email protected]> Co-authored-by: Marylia Gutierrez <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>

83166: roachtest: add streamer config to tpchvec r=yuzefovich a=yuzefovich **roachtest: disable merge queue in some TPCH tests** This commit disables the merge queue during most of the TPCH tests. Some of these tests are performance oriented, so we want to keep as many things constant as possible. Also, having more ranges around gives us better testing coverage of the distributed query execution. Release note: None **roachtest: add streamer config to tpchvec** This commit refactors the `tpchvec` roachtest to introduce a new `streamer` config which runs all TPCH queries with the streamer ON and OFF by default. This is used to make it easier to track the performance of the streamer as well as to catch regressions in performance in the future (since the test will fail if OFF config is significantly faster than ON config). At the moment, the test will fail if the streamer ON config is slower by at least 3x than the OFF config, but over time I plan to reducing that threshold. Informs: #82159. Release note: None **roachtest: refactor tpchvec a bit** This commit refactors `tpchvec` roachtest so that queries run in the query-major order rather than the config-major order. Previously, we would perform the cluster setup, run all queries on that setup, then perform the setup for the second test config, run all queries again, and then analyze the results. However, I believe for perf-oriented tests it's better to run each query on all configs right away (so that the chance of range movement was relatively low), and this commit makes such a change. This required the removal of `perf_no_stats` test config (which probably wasn't adding much value). Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]>

yuzefovich · 2022-06-23T16:19:26Z

Here is the comparison of the streamer OFF vs streamer ON (on master with #82384, #82387, #82865, #83010, #83472 cherry-picked):

Q1: 3.66s	3.69s	0.75%
Q2: 0.27s	0.24s	-10.66%
Q3: 2.54s	2.04s	-19.65%
Q4: 2.36s	1.43s	-39.43%
Q5: 3.38s	2.44s	-27.71%
Q6: 9.40s	8.18s	-12.98%
Q7: 8.71s	6.43s	-26.13%
Q8: 1.37s	0.95s	-30.46%
Q9: 7.88s	5.22s	-33.73%
Q10: 2.56s	1.64s	-36.15%
Q11: 0.75s	0.66s	-11.59%
Q12: 10.21s	8.35s	-18.22%
Q13: 1.09s	1.08s	-0.63%
Q14: 0.78s	0.68s	-11.80%
Q15: 6.64s	6.54s	-1.43%
Q16: 1.10s	0.94s	-14.58%
Q17: 0.57s	0.59s	3.15%
Q18: 6.07s	6.06s	-0.20%
Q19: 3.79s	3.78s	-0.19%
Q20: 17.11s	14.27s	-16.62%
Q21: 10.19s	6.00s	-41.08%
Q22: 0.74s	0.66s	-10.61%

83472: kvstreamer: improve the avg response size heuristic r=yuzefovich a=yuzefovich This commit improves the heuristic we use for estimating the average response size. Previously, we used a simple average, and now we multiple the average by 1.5. Having this multiple is good for a couple of reasons: - this allows us to fulfill requests that are slightly larger than the current average. For example, imagine that we're processing three requests of sizes 100B, 110B, and 120B sequentially, one at a time. Without the multiple, after the first request, our estimate would be 100B so the second request would come back empty (with ResumeNextBytes=110), so we'd have to re-issue the second request. At that point the average is 105B, so the third request would again come back empty and need to be re-issued with larger TargetBytes. Having the multiple allows us to handle such a scenario without any requests coming back empty. In particular, TPCH Q17 has similar setup. - this allows us to slowly grow the TargetBytes parameter over time when requests can be returned partially multiple times (i.e. Scan requests spanning multiple rows). Consider a case when a single Scan request has to return 1MB worth of data, but each row is only 100B. With the initial estimate of 1KB, every request would always come back with exactly 10 rows, and the avg response size would always stay at 1KB. We'd end up issuing 1000 of such requests. Having a multiple here allows us to grow the estimate over time, reducing the total number of requests needed. This multiple seems to fix the remaining perf regression on Q17 when comparing against the streamer OFF config. This commit also introduces a cluster setting that controls this multiple. Value of 1.5 was chosen using `tpchvec/bench` and this setting. Additionally, I introduced a similar cluster setting for the initial avg response size estimate (currently hard-coded at 1KiB) and used `tpchvec/bench`, and it showed that 1KiB value is pretty good. It was also the value mentioned in the RFC, so I decided to remove the corresponding setting. Addresses: #82159. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]>

83197: kvserver: bump the in-memory GC threshold as a pre-apply side effect r=aayushshah15 a=aayushshah15 This commit changes where the in-memory GC threshold on a Replica is bumped during the application of a GC request. Previously, the in-memory GC threshold was bumped as a post-apply side effect. Additionally, GC requests do not acquire latches at the timestamp that they are garbage collecting, and thus, readers need to take additional care to ensure that results aren't served off of a partial state. Readers today rely on the invariant that the in-memory GC threshold is bumped before the actual garbage collection. This today is true because the mvccGCQueue issues GC requests in 2 phases: the first that simply bumps the in-memory GC threshold, and the second one that performs the actual garbage collection. If the in-memory GC threshold were bumped in the pre-apply phase of command application, this usage quirk wouldn't need to exist. That's what this commit does. Relates to #55293 Release note: None 83659: kvstreamer: optimize the results buffers a bit r=yuzefovich a=yuzefovich **kvstreamer: make InOrder-mode structs implement heap logic on their own** This commit refactors the `inOrderRequestsProvider` as well as the `inOrderResultsBuffer` to make them maintain the min heap over the requests and results, respectively, on their own. This allows us to avoid allocations for `interface{}` objects that occur when using `heap.Interface`. The code was copied, with minor adjustments, from the standard library. ``` name old time/op new time/op delta LookupJoinEqColsAreKeyOrdering/Cockroach-24 8.83ms ± 3% 8.58ms ± 3% -2.82% (p=0.002 n=10+10) LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24 10.3ms ± 1% 10.1ms ± 2% -1.86% (p=0.009 n=10+10) LookupJoinOrdering/Cockroach-24 10.7ms ± 5% 10.5ms ± 3% ~ (p=0.123 n=10+10) LookupJoinOrdering/MultinodeCockroach-24 14.0ms ± 5% 13.6ms ± 5% -3.16% (p=0.043 n=10+10) name old alloc/op new alloc/op delta LookupJoinEqColsAreKeyOrdering/Cockroach-24 1.77MB ± 2% 1.59MB ± 1% -9.94% (p=0.000 n=10+10) LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24 2.57MB ± 3% 2.38MB ± 1% -7.57% (p=0.000 n=10+10) LookupJoinOrdering/Cockroach-24 1.67MB ± 0% 1.50MB ± 1% -10.08% (p=0.000 n=9+9) LookupJoinOrdering/MultinodeCockroach-24 2.53MB ± 2% 2.34MB ± 2% -7.78% (p=0.000 n=10+9) name old allocs/op new allocs/op delta LookupJoinEqColsAreKeyOrdering/Cockroach-24 12.5k ± 1% 10.4k ± 1% -16.54% (p=0.000 n=9+10) LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24 17.6k ± 0% 15.5k ± 0% -11.64% (p=0.000 n=8+10) LookupJoinOrdering/Cockroach-24 14.4k ± 1% 12.4k ± 1% -14.17% (p=0.000 n=9+9) LookupJoinOrdering/MultinodeCockroach-24 20.6k ± 1% 18.5k ± 1% -10.19% (p=0.000 n=10+9) ``` Addresses: #82159. Release note: None **kvstreamer: refactor the OutOfOrder requests provider a bit** This commit refactors the `requestsProvider` interface renaming a couple of methods - `s/firstLocked/nextLocked/` and `s/removeFirstLocked/removeNextLocked/`. The idea is that for the InOrder mode "first" == "next", but for the OutOfOrder mode we can pick an arbitrary request that is not necessarily "first". This commit then makes the OutOfOrder requests provider pick its last request in the queue - which should reduce the memory usage in case resume requests are added later. Previously, we would slice "next" requests off the front and append to the end, and it could trigger reallocation of the underlying slice. Now, we'll be updating only the "tail" of the queue, and since a single request can result in at most one resume request, the initial slice provided in `enqueue` will definitely have enough capacity. Release note: None **kvstreamer: reduce allocations in the InOrder results buffer** This commit improves the InOrder results buffer by reusing the same scratch space for returning the results on `get()` calls as well as by reducing the size of `inOrderBufferedResult` struct from the 80 bytes size class to the 64 bytes size class. It also cleans up a couple of comments and clarifies `GetResults` method a bit. ``` name old time/op new time/op delta LookupJoinEqColsAreKeyOrdering/Cockroach-24 8.47ms ± 3% 8.54ms ± 2% ~ (p=0.182 n=9+10) LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24 10.2ms ± 2% 10.1ms ± 2% ~ (p=0.280 n=10+10) LookupJoinOrdering/Cockroach-24 10.4ms ± 3% 10.3ms ± 4% ~ (p=0.393 n=10+10) LookupJoinOrdering/MultinodeCockroach-24 13.8ms ± 5% 13.7ms ± 5% ~ (p=0.739 n=10+10) name old alloc/op new alloc/op delta LookupJoinEqColsAreKeyOrdering/Cockroach-24 1.59MB ± 2% 1.49MB ± 1% -6.20% (p=0.000 n=10+9) LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24 2.36MB ± 2% 2.27MB ± 2% -3.49% (p=0.000 n=10+10) LookupJoinOrdering/Cockroach-24 1.51MB ± 1% 1.42MB ± 1% -6.27% (p=0.000 n=9+10) LookupJoinOrdering/MultinodeCockroach-24 2.37MB ± 6% 2.26MB ± 2% -4.77% (p=0.000 n=10+9) name old allocs/op new allocs/op delta LookupJoinEqColsAreKeyOrdering/Cockroach-24 10.4k ± 1% 10.3k ± 1% ~ (p=0.055 n=8+9) LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24 15.5k ± 0% 15.5k ± 1% -0.34% (p=0.037 n=10+10) LookupJoinOrdering/Cockroach-24 12.4k ± 1% 12.3k ± 1% -0.98% (p=0.000 n=9+10) LookupJoinOrdering/MultinodeCockroach-24 18.5k ± 1% 18.5k ± 2% ~ (p=0.743 n=8+9) ``` Addresses: #82160. Release note: None 83705: teamcity: `grep` for pebble version in `go.mod` rather than `DEPS.bzl` r=nicktrav,jbowens a=rickystewart The `grep` can fail in the case that there is no more recent version of `pebble` than the one that is vendored in-tree. This fixes that case. Release note: None Co-authored-by: Aayush Shah <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>

82865: kvcoord: optimize batch truncation loop r=yuzefovich a=yuzefovich **kvcoord: truncate BatchRequest to range boundaries serially** This commit refactors the DistSender's loop of iterating over the range descriptors so that the truncation of the BatchRequest happens serially. This incurs a minor performance hit when the requests are sent in parallel, but it makes it possible to apply the optimizations to this iteration in the following commits. Release note: None **kvcoord: merge truncate_test into batch_test** This commit moves `TestTruncate` as well as `BenchmarkTruncate` into `batch_test.go` file in the preparation for the refactor done in the following commit. The code itself wasn't changed at all. Release note: None **kvcoord: introduce batch truncation helper** This commit introduces a batch truncation helper that encompasses logic of truncating requests to the boundaries of a single range as well as returning the next key to seek the range iterator to. The helper is now used both in the DistSender as well as in the Streamer. No modification to the actual logic of `Truncate` nor `Next`/`prev` functions has been done other than incorporating the return of the next seek key into `Truncate` function itself. This is needed since the following commit will tightly couple the truncation process with the next seek key determination in order to optimize it. The helper can be configured with a knob indicating whether `Truncate` needs to return requests in the original order. This behavior is necessary by the BatchRequests that contain writes since in several spots we rely on the ordering assumptions (e.g. of increasing values of `Sequence`). The following adjustments were made to the tests: - `BenchmarkTruncate` has been renamed to `BenchmarkTruncateLegacy` - `TestTruncate` has been refactored to exercise the new and the old code-paths - `TestBatchPrevNext` has been refactored to run through the new code path, also a few test cases have been adjusted slightly. This commit also introduces some unit tests for the new code path when it runs in a loop over multiple ranges as well as a corresponding benchmark. Release note: None **kvcoord: optimize batch truncation loop** This commit optimizes the batch truncation loop for the case when the requests only use global keys. The optimized approach sorts the requests according to their start keys (with the Ascending scan direction) or in the reverse order of their end keys (with the Descending scan direction). Then on each `Truncate` call it looks only at a subset of the requests (that haven't been fully processed yet and don't start after the current range), allowing us to avoid many unnecessary key comparisons. Please see a comment on the new `truncateAsc` and `truncateDesc` functions for more details and examples. The optimized approach actually has a worse time complexity in the absolute worst case (when each request is a range-spanning one and actually spans all of the ranges against which the requests are truncated) - because of the need to sort the requests upfront - but in practice, it is much faster, especially with point requests. ``` name old time/op new time/op delta TruncateLoop/asc/reqs=128/ranges=4/type=get-24 59.8µs ± 1% 33.9µs ± 3% -43.38% (p=0.000 n=10+10) TruncateLoop/asc/reqs=128/ranges=4/type=scan-24 83.0µs ± 4% 69.7µs ± 7% -15.98% (p=0.000 n=10+10) TruncateLoop/asc/reqs=128/ranges=64/type=get-24 865µs ± 1% 62µs ± 1% -92.84% (p=0.000 n=10+10) TruncateLoop/asc/reqs=128/ranges=64/type=scan-24 1.07ms ± 5% 0.46ms ± 8% -56.95% (p=0.000 n=10+10) TruncateLoop/asc/reqs=16384/ranges=4/type=get-24 7.09ms ± 0% 5.99ms ± 1% -15.56% (p=0.000 n=10+9) TruncateLoop/asc/reqs=16384/ranges=4/type=scan-24 9.22ms ± 0% 10.52ms ± 1% +14.08% (p=0.000 n=9+10) TruncateLoop/asc/reqs=16384/ranges=64/type=get-24 108ms ± 0% 6ms ± 2% -94.50% (p=0.000 n=8+10) TruncateLoop/asc/reqs=16384/ranges=64/type=scan-24 129ms ± 1% 71ms ± 1% -45.06% (p=0.000 n=10+10) TruncateLoop/asc/preserveOrder/reqs=128/ranges=4/type=get-24 60.2µs ± 1% 36.0µs ± 2% -40.14% (p=0.000 n=10+10) TruncateLoop/asc/preserveOrder/reqs=128/ranges=4/type=scan-24 82.4µs ± 8% 72.1µs ± 4% -12.51% (p=0.000 n=10+9) TruncateLoop/asc/preserveOrder/reqs=128/ranges=64/type=get-24 862µs ± 1% 72µs ± 1% -91.65% (p=0.000 n=10+9) TruncateLoop/asc/preserveOrder/reqs=128/ranges=64/type=scan-24 1.06ms ± 4% 0.49ms ± 8% -53.39% (p=0.000 n=10+10) TruncateLoop/asc/preserveOrder/reqs=16384/ranges=4/type=get-24 7.24ms ± 0% 6.46ms ± 1% -10.74% (p=0.000 n=10+9) TruncateLoop/asc/preserveOrder/reqs=16384/ranges=4/type=scan-24 10.1ms ± 2% 9.8ms ± 2% -2.36% (p=0.000 n=10+9) TruncateLoop/asc/preserveOrder/reqs=16384/ranges=64/type=get-24 107ms ± 0% 7ms ± 0% -93.57% (p=0.000 n=8+10) TruncateLoop/asc/preserveOrder/reqs=16384/ranges=64/type=scan-24 122ms ± 0% 80ms ± 1% -34.32% (p=0.000 n=9+9) TruncateLoop/desc/reqs=128/ranges=4/type=get-24 78.9µs ± 1% 36.4µs ± 3% -53.81% (p=0.000 n=10+10) TruncateLoop/desc/reqs=128/ranges=4/type=scan-24 79.4µs ± 4% 52.3µs ± 5% -34.14% (p=0.000 n=10+10) TruncateLoop/desc/reqs=128/ranges=64/type=get-24 1.16ms ± 1% 0.07ms ± 1% -94.39% (p=0.000 n=10+10) TruncateLoop/desc/reqs=128/ranges=64/type=scan-24 1.01ms ± 4% 0.46ms ± 5% -54.56% (p=0.000 n=10+10) TruncateLoop/desc/reqs=16384/ranges=4/type=get-24 9.42ms ± 0% 6.26ms ± 1% -33.52% (p=0.000 n=10+10) TruncateLoop/desc/reqs=16384/ranges=4/type=scan-24 8.41ms ± 1% 9.22ms ± 1% +9.54% (p=0.000 n=10+9) TruncateLoop/desc/reqs=16384/ranges=64/type=get-24 145ms ± 0% 6ms ± 1% -95.63% (p=0.000 n=9+9) TruncateLoop/desc/reqs=16384/ranges=64/type=scan-24 125ms ± 1% 67ms ± 1% -46.31% (p=0.000 n=9+9) TruncateLoop/desc/preserveOrder/reqs=128/ranges=4/type=get-24 77.8µs ± 1% 39.6µs ± 2% -49.10% (p=0.000 n=10+10) TruncateLoop/desc/preserveOrder/reqs=128/ranges=4/type=scan-24 74.0µs ± 3% 63.0µs ± 6% -14.92% (p=0.000 n=10+10) TruncateLoop/desc/preserveOrder/reqs=128/ranges=64/type=get-24 1.16ms ± 1% 0.08ms ± 1% -93.47% (p=0.000 n=10+10) TruncateLoop/desc/preserveOrder/reqs=128/ranges=64/type=scan-24 1.04ms ± 5% 0.47ms ± 7% -54.65% (p=0.000 n=10+10) TruncateLoop/desc/preserveOrder/reqs=16384/ranges=4/type=get-24 9.50ms ± 0% 6.73ms ± 1% -29.21% (p=0.000 n=9+10) TruncateLoop/desc/preserveOrder/reqs=16384/ranges=4/type=scan-24 8.88ms ± 1% 13.24ms ± 1% +49.04% (p=0.000 n=10+10) TruncateLoop/desc/preserveOrder/reqs=16384/ranges=64/type=get-24 146ms ± 0% 7ms ± 1% -94.98% (p=0.000 n=10+10) TruncateLoop/desc/preserveOrder/reqs=16384/ranges=64/type=scan-24 125ms ± 1% 74ms ± 1% -40.75% (p=0.000 n=10+9) name old alloc/op new alloc/op delta TruncateLoop/asc/reqs=128/ranges=4/type=get-24 7.58kB ± 0% 21.00kB ±11% +176.84% (p=0.000 n=7+10) TruncateLoop/asc/reqs=128/ranges=4/type=scan-24 39.8kB ± 6% 49.1kB ±15% +23.46% (p=0.000 n=9+10) TruncateLoop/asc/reqs=128/ranges=64/type=get-24 6.48kB ± 5% 18.25kB ± 2% +181.79% (p=0.000 n=10+9) TruncateLoop/asc/reqs=128/ranges=64/type=scan-24 428kB ±20% 368kB ±13% -13.85% (p=0.003 n=10+10) TruncateLoop/asc/reqs=16384/ranges=4/type=get-24 1.60MB ± 0% 2.91MB ± 0% +82.49% (p=0.000 n=8+10) TruncateLoop/asc/reqs=16384/ranges=4/type=scan-24 5.24MB ± 0% 5.89MB ± 0% +12.41% (p=0.000 n=10+8) TruncateLoop/asc/reqs=16384/ranges=64/type=get-24 1.15MB ± 4% 2.41MB ± 1% +110.09% (p=0.000 n=10+10) TruncateLoop/asc/reqs=16384/ranges=64/type=scan-24 69.8MB ± 1% 64.6MB ± 1% -7.55% (p=0.000 n=10+10) TruncateLoop/asc/preserveOrder/reqs=128/ranges=4/type=get-24 9.77kB ±22% 21.98kB ± 4% +125.07% (p=0.000 n=10+10) TruncateLoop/asc/preserveOrder/reqs=128/ranges=4/type=scan-24 38.9kB ±23% 49.9kB ± 2% +28.28% (p=0.000 n=10+8) TruncateLoop/asc/preserveOrder/reqs=128/ranges=64/type=get-24 6.56kB ± 4% 20.20kB ± 3% +208.11% (p=0.000 n=10+10) TruncateLoop/asc/preserveOrder/reqs=128/ranges=64/type=scan-24 407kB ±15% 372kB ±13% -8.68% (p=0.043 n=10+10) TruncateLoop/asc/preserveOrder/reqs=16384/ranges=4/type=get-24 1.65MB ± 0% 3.62MB ± 0% +118.55% (p=0.000 n=8+8) TruncateLoop/asc/preserveOrder/reqs=16384/ranges=4/type=scan-24 6.60MB ± 2% 5.65MB ± 1% -14.38% (p=0.000 n=10+10) TruncateLoop/asc/preserveOrder/reqs=16384/ranges=64/type=get-24 1.10MB ± 5% 2.77MB ± 1% +152.58% (p=0.000 n=10+10) TruncateLoop/asc/preserveOrder/reqs=16384/ranges=64/type=scan-24 60.5MB ± 1% 67.9MB ± 1% +12.19% (p=0.000 n=10+10) TruncateLoop/desc/reqs=128/ranges=4/type=get-24 17.2kB ±10% 21.5kB ± 1% +24.91% (p=0.000 n=10+10) TruncateLoop/desc/reqs=128/ranges=4/type=scan-24 35.5kB ±12% 29.1kB ± 4% -17.83% (p=0.000 n=10+9) TruncateLoop/desc/reqs=128/ranges=64/type=get-24 138kB ± 1% 20kB ± 3% -85.34% (p=0.000 n=10+10) TruncateLoop/desc/reqs=128/ranges=64/type=scan-24 344kB ±15% 363kB ±10% +5.50% (p=0.035 n=10+10) TruncateLoop/desc/reqs=16384/ranges=4/type=get-24 2.78MB ± 0% 3.35MB ± 0% +20.24% (p=0.000 n=8+10) TruncateLoop/desc/reqs=16384/ranges=4/type=scan-24 4.42MB ± 6% 5.29MB ± 0% +19.67% (p=0.000 n=10+8) TruncateLoop/desc/reqs=16384/ranges=64/type=get-24 17.9MB ± 0% 2.7MB ± 2% -85.21% (p=0.000 n=10+10) TruncateLoop/desc/reqs=16384/ranges=64/type=scan-24 65.3MB ± 0% 61.0MB ± 1% -6.65% (p=0.000 n=10+10) TruncateLoop/desc/preserveOrder/reqs=128/ranges=4/type=get-24 15.9kB ± 3% 26.7kB ± 1% +67.87% (p=0.000 n=10+9) TruncateLoop/desc/preserveOrder/reqs=128/ranges=4/type=scan-24 29.4kB ± 6% 41.6kB ± 5% +41.50% (p=0.000 n=10+10) TruncateLoop/desc/preserveOrder/reqs=128/ranges=64/type=get-24 138kB ± 0% 23kB ± 3% -83.61% (p=0.000 n=10+10) TruncateLoop/desc/preserveOrder/reqs=128/ranges=64/type=scan-24 390kB ±19% 350kB ±11% -10.16% (p=0.015 n=10+10) TruncateLoop/desc/preserveOrder/reqs=16384/ranges=4/type=get-24 2.69MB ± 4% 3.51MB ± 1% +30.22% (p=0.000 n=10+10) TruncateLoop/desc/preserveOrder/reqs=16384/ranges=4/type=scan-24 4.89MB ± 1% 8.19MB ± 0% +67.68% (p=0.000 n=10+9) TruncateLoop/desc/preserveOrder/reqs=16384/ranges=64/type=get-24 17.9MB ± 0% 3.0MB ± 2% -83.34% (p=0.000 n=10+10) TruncateLoop/desc/preserveOrder/reqs=16384/ranges=64/type=scan-24 65.4MB ± 1% 62.9MB ± 1% -3.81% (p=0.000 n=10+10) name old allocs/op new allocs/op delta TruncateLoop/asc/reqs=128/ranges=4/type=get-24 50.0 ± 0% 52.4 ±10% +4.80% (p=0.017 n=8+10) TruncateLoop/asc/reqs=128/ranges=4/type=scan-24 569 ±12% 557 ±15% ~ (p=0.617 n=10+10) TruncateLoop/asc/reqs=128/ranges=64/type=get-24 207 ± 4% 210 ± 5% ~ (p=0.380 n=10+10) TruncateLoop/asc/reqs=128/ranges=64/type=scan-24 6.97k ±13% 6.02k ± 9% -13.64% (p=0.000 n=10+10) TruncateLoop/asc/reqs=16384/ranges=4/type=get-24 126 ± 0% 122 ± 0% -3.17% (p=0.002 n=8+10) TruncateLoop/asc/reqs=16384/ranges=4/type=scan-24 51.9k ± 1% 42.8k ± 1% -17.59% (p=0.000 n=10+9) TruncateLoop/asc/reqs=16384/ranges=64/type=get-24 1.12k ± 1% 1.12k ± 1% +0.43% (p=0.027 n=10+10) TruncateLoop/asc/reqs=16384/ranges=64/type=scan-24 786k ± 1% 714k ± 1% -9.13% (p=0.000 n=10+10) TruncateLoop/asc/preserveOrder/reqs=128/ranges=4/type=get-24 41.2 ± 3% 58.0 ± 3% +40.78% (p=0.000 n=10+10) TruncateLoop/asc/preserveOrder/reqs=128/ranges=4/type=scan-24 574 ±18% 532 ±11% ~ (p=0.143 n=10+10) TruncateLoop/asc/preserveOrder/reqs=128/ranges=64/type=get-24 205 ± 2% 234 ± 3% +14.26% (p=0.000 n=9+9) TruncateLoop/asc/preserveOrder/reqs=128/ranges=64/type=scan-24 6.70k ± 9% 6.00k ± 9% -10.40% (p=0.000 n=10+10) TruncateLoop/asc/preserveOrder/reqs=16384/ranges=4/type=get-24 127 ± 0% 125 ± 0% -1.57% (p=0.001 n=8+9) TruncateLoop/asc/preserveOrder/reqs=16384/ranges=4/type=scan-24 63.7k ± 1% 27.8k ± 2% -56.34% (p=0.000 n=10+10) TruncateLoop/asc/preserveOrder/reqs=16384/ranges=64/type=get-24 1.14k ± 0% 1.14k ± 1% ~ (p=0.515 n=10+10) TruncateLoop/asc/preserveOrder/reqs=16384/ranges=64/type=scan-24 696k ± 1% 752k ± 1% +7.97% (p=0.000 n=10+10) TruncateLoop/desc/reqs=128/ranges=4/type=get-24 554 ± 1% 169 ± 2% -69.52% (p=0.000 n=10+10) TruncateLoop/desc/reqs=128/ranges=4/type=scan-24 519 ± 9% 268 ± 9% -48.32% (p=0.000 n=10+10) TruncateLoop/desc/reqs=128/ranges=64/type=get-24 8.38k ± 0% 0.33k ± 3% -96.06% (p=0.000 n=10+10) TruncateLoop/desc/reqs=128/ranges=64/type=scan-24 5.90k ±10% 5.89k ± 5% ~ (p=0.796 n=10+10) TruncateLoop/desc/reqs=16384/ranges=4/type=get-24 65.7k ± 0% 16.5k ± 0% -74.87% (p=0.002 n=8+10) TruncateLoop/desc/reqs=16384/ranges=4/type=scan-24 39.5k ± 1% 27.8k ± 2% -29.54% (p=0.000 n=10+10) TruncateLoop/desc/reqs=16384/ranges=64/type=get-24 1.05M ± 0% 0.02M ± 0% -98.33% (p=0.000 n=9+10) TruncateLoop/desc/reqs=16384/ranges=64/type=scan-24 741k ± 0% 679k ± 1% -8.32% (p=0.000 n=10+10) TruncateLoop/desc/preserveOrder/reqs=128/ranges=4/type=get-24 559 ± 0% 182 ± 2% -67.42% (p=0.000 n=9+10) TruncateLoop/desc/preserveOrder/reqs=128/ranges=4/type=scan-24 438 ± 8% 404 ±13% -7.76% (p=0.014 n=9+10) TruncateLoop/desc/preserveOrder/reqs=128/ranges=64/type=get-24 8.39k ± 0% 0.36k ± 5% -95.75% (p=0.000 n=10+10) TruncateLoop/desc/preserveOrder/reqs=128/ranges=64/type=scan-24 6.38k ±11% 5.56k ± 9% -12.85% (p=0.000 n=10+10) TruncateLoop/desc/preserveOrder/reqs=16384/ranges=4/type=get-24 65.7k ± 0% 16.5k ± 0% -74.84% (p=0.000 n=10+10) TruncateLoop/desc/preserveOrder/reqs=16384/ranges=4/type=scan-24 46.8k ± 1% 67.6k ± 1% +44.65% (p=0.000 n=10+10) TruncateLoop/desc/preserveOrder/reqs=16384/ranges=64/type=get-24 1.05M ± 0% 0.02M ± 0% -98.33% (p=0.000 n=10+10) TruncateLoop/desc/preserveOrder/reqs=16384/ranges=64/type=scan-24 739k ± 1% 694k ± 1% -6.08% (p=0.000 n=10+10) ``` The truncation loops for the optimized strategy are very similar in both directions, so I tried to extract the differences out into an interface. However, this showed non-trivial slow down and increase in allocations, so I chose to have some duplicated code to get the best performance. Here is a snippet of the comparison when the interface was prototyped: ``` name old time/op new time/op delta TruncateLoop/desc/reqs=128/ranges=4/type=get-24 36.9µs ± 3% 44.5µs ± 3% +20.55% (p=0.000 n=10+10) TruncateLoop/desc/reqs=128/ranges=4/type=scan-24 74.8µs ± 5% 88.8µs ± 3% +18.73% (p=0.000 n=9+10) TruncateLoop/desc/reqs=128/ranges=64/type=get-24 64.9µs ± 1% 78.3µs ± 1% +20.72% (p=0.000 n=10+9) TruncateLoop/desc/reqs=128/ranges=64/type=scan-24 471µs ± 8% 682µs ±13% +44.73% (p=0.000 n=10+10) TruncateLoop/desc/reqs=16384/ranges=4/type=get-24 6.34ms ± 1% 7.39ms ± 0% +16.47% (p=0.000 n=10+9) TruncateLoop/desc/reqs=16384/ranges=4/type=scan-24 11.2ms ± 1% 12.4ms ± 1% +10.36% (p=0.000 n=10+9) TruncateLoop/desc/reqs=16384/ranges=64/type=get-24 6.40ms ± 2% 7.39ms ± 1% +15.47% (p=0.000 n=10+10) TruncateLoop/desc/reqs=16384/ranges=64/type=scan-24 70.9ms ± 1% 102.0ms ± 2% +43.87% (p=0.000 n=9+9) name old alloc/op new alloc/op delta TruncateLoop/desc/reqs=128/ranges=4/type=get-24 22.2kB ± 9% 30.4kB ± 0% +36.55% (p=0.000 n=10+7) TruncateLoop/desc/reqs=128/ranges=4/type=scan-24 52.2kB ± 5% 67.6kB ± 4% +29.47% (p=0.000 n=8+10) TruncateLoop/desc/reqs=128/ranges=64/type=get-24 20.0kB ± 2% 32.2kB ± 1% +60.86% (p=0.000 n=10+10) TruncateLoop/desc/reqs=128/ranges=64/type=scan-24 372kB ±13% 600kB ±10% +61.29% (p=0.000 n=10+8) TruncateLoop/desc/reqs=16384/ranges=4/type=get-24 3.24MB ± 0% 4.45MB ± 0% +37.42% (p=0.000 n=8+7) TruncateLoop/desc/reqs=16384/ranges=4/type=scan-24 6.61MB ± 0% 7.86MB ± 0% +18.90% (p=0.000 n=10+9) TruncateLoop/desc/reqs=16384/ranges=64/type=get-24 2.75MB ± 2% 3.74MB ± 1% +36.03% (p=0.000 n=10+10) TruncateLoop/desc/reqs=16384/ranges=64/type=scan-24 65.7MB ± 1% 97.2MB ± 1% +47.95% (p=0.000 n=10+10) name old allocs/op new allocs/op delta TruncateLoop/desc/reqs=128/ranges=4/type=get-24 177 ± 2% 314 ± 0% +77.40% (p=0.000 n=10+8) TruncateLoop/desc/reqs=128/ranges=4/type=scan-24 597 ± 8% 847 ± 8% +41.89% (p=0.000 n=9+10) TruncateLoop/desc/reqs=128/ranges=64/type=get-24 329 ± 3% 531 ± 2% +61.40% (p=0.000 n=10+10) TruncateLoop/desc/reqs=128/ranges=64/type=scan-24 6.02k ± 9% 9.56k ± 6% +58.80% (p=0.000 n=10+8) TruncateLoop/desc/reqs=16384/ranges=4/type=get-24 16.5k ± 0% 32.9k ± 0% +99.17% (p=0.000 n=8+8) TruncateLoop/desc/reqs=16384/ranges=4/type=scan-24 53.5k ± 1% 73.1k ± 1% +36.69% (p=0.000 n=10+9) TruncateLoop/desc/reqs=16384/ranges=64/type=get-24 17.5k ± 0% 33.9k ± 0% +94.02% (p=0.000 n=6+10) TruncateLoop/desc/reqs=16384/ranges=64/type=scan-24 727k ± 1% 1194k ± 1% +64.31% (p=0.000 n=10+10) ``` An additional knob is introduced to the batch truncation helper to indicate whether the helper can take ownership of the passed-in requests slice and reorder it as it pleases. This is the case for the Streamer, but the DistSender relies on the order of requests not being modified, so the helper makes a copy of the slice. Some tests needed an adjustment since now we process requests not necessarily in the original order, so the population of ResumeSpans might be different. Fixes: #68536. Addresses: #82159. Release note: None 83358: sql: initialize sql instance during instance provider start r=ajwerner a=rharding6373 Before this change, there was a race condition where the instance provider and the instance reader would start before the instance provider created a SQL instance, potentially causing the reader to not cache the instance before initialization was complete. This is a problem in multi-tenant environments, where we may not be able to plan queries if the reader does not know of any SQL instances. This change moves sql instance initialization into the instance provider's `Start()` function before starting the reader, so that the instance is guaranteed to be visible to the reader on its first rangefeed scan of the `system.sql_instances` table. Fixes: #82706 Fixes: #81807 Fixes: #81567 Release note: None 83683: kvstreamer: perform more memory accounting r=yuzefovich a=yuzefovich **kvstreamer: perform more memory accounting** This commit performs memory accounting for more of the internal state of the streamer. In particular, it adds accounting for the overhead of `Result`s in the results buffer (previously, we would only account for the size of the Get or the Scan response but not for the `Result` struct itself). It also adds accounting for all slices that are `O(N)` in size where `N` is the number of requests. Addresses: #82160. Release note: None **kvstreamer: account for the overhead of each request** Previously, we didn't account for some of the overhead of Get and Scan requests, and this commit fixes that. Here is the list of all things that we need to account for a single Get request: - (if referenced by `[]roachpb.RequestUnion`) the overhead of the `roachpb.RequestUnion` object - the overhead of `roachpb.RequestUnion_Get` object - the overhead of `roachpb.GetRequest` object - the footprint of the key inside of `roachpb.GetRequest`. Previously, we only accounted for the first and the fourth items, and now we account for all of them. Additionally, we used the auto-generated `Size` method for the fourth item which I believe represent the size of the serialized protobuf message (possibly compressed - I'm not sure), but we're interested in the capacities of the underlying slices. Addresses: #82160. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: rharding6373 <[email protected]>

yuzefovich · 2022-07-06T21:54:50Z

Here is the comparison of the streamer OFF vs the streamer ON on micro-benchmarks (with #84607):

name                                                    old time/op    new time/op    delta
IndexJoin/Cockroach-24                                    5.26ms ± 1%    5.69ms ± 1%   +8.04%  (p=0.000 n=10+10)
IndexJoin/MultinodeCockroach-24                           6.96ms ± 1%    7.36ms ± 1%   +5.76%  (p=0.000 n=10+9)
IndexJoinColumnFamilies/Cockroach-24                      7.92ms ± 3%    8.27ms ± 3%   +4.42%  (p=0.000 n=10+10)
IndexJoinColumnFamilies/MultinodeCockroach-24             11.3ms ± 5%    11.8ms ± 7%   +5.10%  (p=0.003 n=10+10)
IndexJoinOrdering/Cockroach-24                            6.38ms ± 2%    6.90ms ± 3%   +8.02%  (p=0.000 n=10+10)
IndexJoinOrdering/MultinodeCockroach-24                   8.85ms ± 4%    9.55ms ± 3%   +7.95%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24             5.95ms ± 2%    6.37ms ± 1%   +7.18%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24    7.26ms ± 2%    7.68ms ± 1%   +5.76%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/Cockroach-24               6.07ms ± 1%    6.47ms ± 2%   +6.47%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24      7.43ms ± 1%    7.77ms ± 2%   +4.57%  (p=0.000 n=10+10)
LookupJoinNoOrdering/Cockroach-24                         8.38ms ± 3%    8.80ms ± 4%   +5.04%  (p=0.000 n=10+10)
LookupJoinNoOrdering/MultinodeCockroach-24                11.4ms ± 5%    12.0ms ± 4%   +5.76%  (p=0.000 n=10+10)
LookupJoinOrdering/Cockroach-24                           8.49ms ± 4%    8.91ms ± 3%   +4.94%  (p=0.000 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                  11.7ms ± 4%    11.9ms ± 5%     ~     (p=0.218 n=10+10)

name                                                    old alloc/op   new alloc/op   delta
IndexJoin/Cockroach-24                                     926kB ± 1%    1198kB ± 1%  +29.27%  (p=0.000 n=10+10)
IndexJoin/MultinodeCockroach-24                           1.60MB ± 3%    1.90MB ± 2%  +18.16%  (p=0.000 n=10+10)
IndexJoinColumnFamilies/Cockroach-24                       974kB ± 1%    1259kB ± 1%  +29.26%  (p=0.000 n=10+9)
IndexJoinColumnFamilies/MultinodeCockroach-24             1.81MB ± 3%    2.13MB ± 1%  +17.22%  (p=0.000 n=10+9)
IndexJoinOrdering/Cockroach-24                             921kB ± 1%    1268kB ± 1%  +37.59%  (p=0.000 n=9+9)
IndexJoinOrdering/MultinodeCockroach-24                   1.61MB ± 2%    1.98MB ± 4%  +22.62%  (p=0.000 n=10+9)
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24             1.28MB ± 1%    1.55MB ± 1%  +20.96%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24    2.01MB ± 3%    2.24MB ± 2%  +10.97%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/Cockroach-24               1.33MB ± 1%    1.60MB ± 1%  +20.23%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24      2.08MB ± 4%    2.25MB ± 3%   +8.32%  (p=0.000 n=9+9)
LookupJoinNoOrdering/Cockroach-24                         1.35MB ± 1%    1.62MB ± 1%  +20.28%  (p=0.000 n=9+10)
LookupJoinNoOrdering/MultinodeCockroach-24                1.97MB ± 3%    2.18MB ± 3%  +10.86%  (p=0.000 n=10+10)
LookupJoinOrdering/Cockroach-24                           1.38MB ± 1%    1.65MB ± 1%  +19.80%  (p=0.000 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                  2.02MB ± 3%    2.22MB ± 2%   +9.72%  (p=0.000 n=10+9)

name                                                    old allocs/op  new allocs/op  delta
IndexJoin/Cockroach-24                                     6.11k ± 0%     6.24k ± 1%   +2.00%  (p=0.000 n=7+10)
IndexJoin/MultinodeCockroach-24                            10.8k ± 2%     10.9k ± 2%     ~     (p=0.225 n=10+10)
IndexJoinColumnFamilies/Cockroach-24                       9.91k ± 1%    10.98k ± 1%  +10.72%  (p=0.000 n=10+9)
IndexJoinColumnFamilies/MultinodeCockroach-24              15.5k ± 1%     16.6k ± 1%   +6.93%  (p=0.000 n=10+9)
IndexJoinOrdering/Cockroach-24                             4.18k ± 2%     4.26k ± 2%   +1.79%  (p=0.008 n=9+9)
IndexJoinOrdering/MultinodeCockroach-24                    9.65k ± 2%     9.90k ± 2%   +2.52%  (p=0.003 n=10+8)
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24              8.85k ± 2%     8.96k ± 1%   +1.28%  (p=0.005 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24     13.0k ± 1%     13.2k ± 1%   +1.36%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/Cockroach-24                9.89k ± 1%    10.02k ± 3%   +1.29%  (p=0.001 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24       14.1k ± 1%     14.3k ± 1%   +0.88%  (p=0.002 n=8+10)
LookupJoinNoOrdering/Cockroach-24                          11.1k ± 1%     11.3k ± 1%   +2.16%  (p=0.000 n=9+10)
LookupJoinNoOrdering/MultinodeCockroach-24                 15.7k ± 1%     16.0k ± 1%   +1.87%  (p=0.000 n=10+10)
LookupJoinOrdering/Cockroach-24                            12.2k ± 1%     12.4k ± 1%   +1.86%  (p=0.000 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                   16.8k ± 0%     17.1k ± 1%   +1.55%  (p=0.000 n=9+9)

I think I'm pretty satisfied given that most regressions are under 10% and only "lookup join with ordering" cases are above (those are the least frequent ones). I might optimize it further, but it doesn't seem necessary.

83547: stmtdiagnostics: remove the usage of gossip r=yuzefovich a=yuzefovich This commit removes the usage of gossip in the stmt diagnostics feature since it is really an optimization (i.e not required for the feature to work) and is in a way of the architecture unification (i.e. multi-tenancy effort). The gossip was previously used to let all nodes in the cluster know about the new stmt diagnostics request or about the cancellation of an existing one (in fact, the gossip propagation of the latter was broken anyway since we forgot to register the corresponding callback). We now rely on the polling mechanism on each node to read from the system table to populate the registry of current requests. This polling takes place every `sql.stmt_diagnostics.poll_interval` seconds (10 by default). Additionally, this commit changes the class of that setting from TenantWritable to TenantReadOnly because 1. we don't charge the user for the polling 2. to prevent the abuse where a tenant could set the polling interval to a very low value incurring no costs to themselves (we excluded the polling from the tenant cost recently). The follow-up work remains to bring the optimization of propagating new requests quickly using a range feed on the system table. Epic: CRDB-16702 Informs: #47893 Release note: None 83896: ui/cluster-ui: filter out closed sessions from active exec pages r=xinhaoz a=xinhaoz Previously, it was possible for the active transactions page to show txns from closed sessions. The sessions API was recently updated to return closed sessions, and it is possible for the active_txn field in a closed session to be populated. This commit filters out the closed sessions when retrieving active transactions. Release note (bug fix): active transactions page no longer shows transactions from closed sessions 83903: ui: persist stmt view selection in sql activity page r=xinhaoz a=xinhaoz Previously, the selection of viewing historical or active executions for statements and transactions tabs in the SQL activity page did not persist on tab change. This commit persists the selection between tab changes in the SQL activity page. Release note (ui change): In the SQL Activity Page, the selection to view historical or active executions will persist between tabs. https://www.loom.com/share/6990f8e273f64ddcaf971e85f0a8ef89 83926: kvstreamer: reuse the truncation helper and introduce a fast path for a single range case r=yuzefovich a=yuzefovich **kvstreamer: introduce a single range fast path for request truncation** This commit introduces a fast path to avoid the usage of the batch truncation helper when all requests are contained within a single range. Some modifications needed to be made to the `txnKVStreamer` so that it didn't nil out the requests slice - we now delay that until right before the next call to `Enqueue`. ``` ame old time/op new time/op delta IndexJoin/Cockroach-24 6.21ms ± 1% 5.96ms ± 2% -4.08% (p=0.000 n=8+10) IndexJoinColumnFamilies/Cockroach-24 8.97ms ± 4% 8.79ms ± 7% ~ (p=0.190 n=10+10) name old alloc/op new alloc/op delta IndexJoin/Cockroach-24 1.39MB ± 1% 1.27MB ± 1% -7.97% (p=0.000 n=10+10) IndexJoinColumnFamilies/Cockroach-24 1.46MB ± 1% 1.34MB ± 0% -8.04% (p=0.000 n=9+7) name old allocs/op new allocs/op delta IndexJoin/Cockroach-24 7.20k ± 1% 7.16k ± 1% -0.61% (p=0.022 n=10+10) IndexJoinColumnFamilies/Cockroach-24 12.0k ± 1% 11.9k ± 0% -0.83% (p=0.000 n=9+8) ``` Addresses: #82159. Release note: None **kvcoord: refactor the truncation helper for reuse** This commit refactors the batch truncation helper so that it can be reused for multiple batches of requests. In particular, that ability is now utilized by the streamer. Additionally, since the streamer now holds on to the same truncation helper for possibly a long time, this commit adds the memory accounting for the internal state of the helper. Addresses: #82160. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Xin Hao Zhang <[email protected]>

yuzefovich · 2022-07-10T01:07:28Z

Here is the comparison on TPC-DS queries:

Q2: 18.35s	18.33s	-0.12%
Q3: 0.30s	0.30s	0.17%
Q4: 102.24s	103.02s	0.76%
Q6: 41.57s	40.92s	-1.56%
Q7: 16.90s	16.83s	-0.43%
Q8: 1.52s	1.40s	-7.88%
Q9: 106.30s	101.23s	-4.77%
Q10: 5.91s	5.50s	-6.90%
Q11: 55.64s	56.27s	1.13%
Q12: 2.09s	1.76s	-16.01%
Q13: 11.16s	10.38s	-7.05%
Q15: 4.28s	4.37s	2.05%
Q16: 7.71s	7.83s	1.62%
Q17: 4.05s	3.24s	-19.80%
Q19: 0.79s	0.75s	-5.48%
Q20: 3.49s	2.80s	-19.66%
Q21: 1.72s	1.68s	-2.33%
Q23: 123.38s	123.72s	0.27%
Q24: 0.25s	0.25s	-0.39%
Q25: 3.01s	2.64s	-12.37%
Q26: 14.39s	14.87s	3.35%
Q28: 24.52s	24.22s	-1.22%
Q29: 1.89s	1.71s	-9.39%
Q30: 8.08s	8.42s	4.28%
Q31: 22.17s	22.58s	1.83%
Q32: 0.43s	0.42s	-1.87%
Q33: 8.64s	8.21s	-5.00%
Q34: 6.77s	6.30s	-6.93%
Q35: 6.30s	5.81s	-7.78%
Q37: 0.43s	0.31s	-27.55%
Q38: 6.12s	5.60s	-8.50%
Q39: 97.80s	97.88s	0.08%
Q40: 2.06s	1.84s	-10.61%
Q41: 0.34s	0.34s	1.78%
Q42: 0.57s	0.51s	-10.99%
Q43: 6.92s	6.43s	-7.10%
Q44: 19.25s	17.75s	-7.78%
Q45: 4.25s	4.27s	0.52%
Q46: 10.64s	9.84s	-7.53%
Q47: 16.34s	16.24s	-0.61%
Q48: 9.97s	9.14s	-8.39%
Q49: 1.31s	1.27s	-2.60%
Q50: 1.09s	1.08s	-1.33%
Q51: 12.99s	12.35s	-4.93%
Q52: 0.57s	0.52s	-10.09%
Q53: 2.55s	2.40s	-5.84%
Q54: 6.08s	5.85s	-3.82%
Q55: 0.47s	0.44s	-6.98%
Q56: 2.70s	2.26s	-16.22%
Q57: 8.25s	8.17s	-1.05%
Q58: 13.98s	13.03s	-6.78%
Q59: 20.74s	20.68s	-0.32%
Q60: 20.70s	19.81s	-4.29%
Q61: 0.38s	0.44s	16.95%
Q62: 2.79s	2.78s	-0.11%
Q63: 2.42s	2.22s	-8.30%
Q65: 9.41s	8.68s	-7.81%
Q66: 6.29s	6.33s	0.64%
Q68: 10.05s	9.19s	-8.50%
Q69: 5.67s	5.25s	-7.41%
Q71: 9.67s	9.30s	-3.86%
Q72: 14.07s	13.05s	-7.27%
Q73: 6.35s	5.87s	-7.62%
Q74: 22.65s	22.63s	-0.07%
Q75: 12.22s	8.71s	-28.69%
Q76: 7.29s	6.95s	-4.67%
Q78: 24.66s	23.56s	-4.46%
Q79: 9.97s	9.35s	-6.23%
Q81: 29.62s	30.18s	1.89%
Q82: 1.48s	1.43s	-3.41%
Q83: 1.48s	1.54s	3.96%
Q84: 0.98s	0.99s	1.02%
Q85: 5.87s	4.58s	-21.92%
Q87: 6.06s	5.74s	-5.36%
Q88: 28.15s	23.56s	-16.32%
Q89: 1.95s	1.72s	-12.15%
Q90: 2.26s	2.25s	-0.27%
Q91: 1.02s	1.00s	-1.67%
Q92: 0.37s	0.35s	-4.21%
Q93: 0.23s	0.24s	7.33%
Q94: 4.87s	4.88s	0.11%
Q95: 66.77s	68.29s	2.27%
Q96: 5.51s	5.16s	-6.28%
Q97: 6.40s	5.93s	-7.35%
Q98: 6.75s	6.26s	-7.15%
Q99: 4.43s	4.54s	2.46%

I think the only remaining thing to check is the performance of TPC-E.

yuzefovich · 2022-07-14T22:52:47Z

Looks like there is a minor performance regression on TPCE queries in P50 and P90 (average over 10 runs):

P50
BrokerVolume:		26.12ms 26.52ms		1.55%
CustomerPosition:	7.24ms 7.63ms		5.42%
DataMaintenance:	6.79ms 6.87ms		1.12%
MarketFeed:		81.47ms 82.50ms		1.27%
MarketWatch:		4.71ms 5.33ms		12.97%
SecurityDetail:		6.38ms 7.27ms		13.87%
TradeLookup:		9.40ms 9.62ms		2.39%
TradeOrder:		10.71ms 11.22ms		4.75%
TradeResult:		29.87ms 30.38ms		1.68%
TradeStatus:		12.14ms 12.56ms		3.46%
TradeUpdate:		15.73ms 16.17ms		2.74%

P90
BrokerVolume:		39.60ms 37.84ms		-4.44%
CustomerPosition:	10.93ms 10.91ms		-0.21%
DataMaintenance:	17.68ms 19.86ms		12.31%
MarketFeed:		120.99ms 116.72ms	-3.53%
MarketWatch:		7.47ms 8.21ms		10.03%
SecurityDetail:		8.51ms 9.43ms		10.77%
TradeLookup:		14.07ms 14.35ms		2.03%
TradeOrder:		15.34ms 15.25ms		-0.56%
TradeResult:		43.40ms 44.12ms		1.66%
TradeStatus:		17.22ms 16.84ms		-2.21%
TradeUpdate:		24.64ms 23.87ms		-3.12%

P99
BrokerVolume:		73.41ms 69.36ms		-5.52%
CustomerPosition:	24.01ms 23.05ms		-4.02%
DataMaintenance:	33.88ms 44.29ms		30.75%
MarketFeed:		179.57ms 176.76ms	-1.56%
MarketWatch:		13.57ms 14.72ms		8.43%
SecurityDetail:		17.62ms 18.92ms		7.37%
TradeLookup:		29.32ms 29.42ms		0.35%
TradeOrder:		29.02ms 27.91ms		-3.82%
TradeResult:		84.11ms 84.81ms		0.82%
TradeStatus:		49.44ms 46.01ms		-6.94%
TradeUpdate:		44.49ms 41.97ms		-5.65%

yuzefovich · 2022-07-14T23:06:14Z

Alright, I think this issue is done - the usage of the streamer can offer significant speedups on queries processing a lot of data and has regressions on the order of 10% when processing small amounts of data.

83265: kvserver/gc: remove range tombstones during GC r=erikgrinaker a=aliher1911 First commit adds support for range tombstones when removing point keys. Range tombstone will have the same effect as a point tombstone within the mvcc key history. Second commit adds support for removal of range tombstones below GC threshold. 84493: sql: add parsing support for SHOW CREATE FUNCTION statement r=mgartner a=mgartner This commit adds a `SHOW CREATE FUNCTION` statement to the SQL grammar. This statement is not yet implemented and executing it results in an error. Release note: None 84504: execinfra: allow tenants to disable the streamer r=yuzefovich a=yuzefovich Previously, we marked the setting that controls whether the streamer is used as `TenantReadOnly` since we were not sure whether the streamer fit well into the tenant cost model. Recently we revamped the cost model so that it can now correctly predict the usage of the hardware resources by the streamer, so at this point it seems safe to mark the setting `TenantWritable`. Informs: #82167 Release note: None 84515: ui: fix sorting of explain plans r=maryliag a=maryliag Previously, the sorting on the plans on the Explain Plans tab on Statement Details wasn't working. This commit adds the missing code required to sort that table. Fixes #84079 https://www.loom.com/share/0f0ed0e1a8d04fc88def3b2460d617e6 Release note (bug fix): Sorting on the plans table inside the Statement Details page is now properly working. 84519: clusterversion: remove some older versions r=yuzefovich a=yuzefovich Release note: None 84601: rowexec: use OutOfOrder mode of streamer for lookup joins with ordering r=yuzefovich a=yuzefovich Currently, the join reader always restores the required order for lookup joins on its own since all looked up rows are buffered before any output row is emitted. This observation allows us to use the OutOfOrder mode of the streamer in such scenarios, so this commit makes such a change. Previously, we would effectively maintain the order twice - both in the streamer and in the join reader, and the former is redundant. This will change in the future, but for now we can use the more-efficient mode. ``` name old time/op new time/op delta LookupJoinEqColsAreKeyOrdering/Cockroach-24 6.64ms ± 1% 6.48ms ± 1% -2.34% (p=0.000 n=10+10) LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24 7.89ms ± 1% 7.75ms ± 1% -1.80% (p=0.000 n=10+10) LookupJoinOrdering/Cockroach-24 9.01ms ± 3% 8.88ms ± 4% ~ (p=0.218 n=10+10) LookupJoinOrdering/MultinodeCockroach-24 12.1ms ± 4% 12.0ms ± 3% ~ (p=0.393 n=10+10) name old alloc/op new alloc/op delta LookupJoinEqColsAreKeyOrdering/Cockroach-24 1.68MB ± 1% 1.60MB ± 1% -4.93% (p=0.000 n=10+10) LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24 2.37MB ± 2% 2.29MB ± 2% -3.11% (p=0.000 n=10+10) LookupJoinOrdering/Cockroach-24 1.75MB ± 1% 1.66MB ± 1% -5.01% (p=0.000 n=10+9) LookupJoinOrdering/MultinodeCockroach-24 2.36MB ± 1% 2.25MB ± 1% -4.68% (p=0.000 n=8+10) name old allocs/op new allocs/op delta LookupJoinEqColsAreKeyOrdering/Cockroach-24 10.0k ± 1% 10.0k ± 1% ~ (p=0.278 n=10+9) LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24 14.3k ± 1% 14.3k ± 1% ~ (p=0.470 n=10+10) LookupJoinOrdering/Cockroach-24 12.4k ± 1% 12.5k ± 1% ~ (p=0.780 n=10+10) LookupJoinOrdering/MultinodeCockroach-24 17.1k ± 1% 17.0k ± 1% ~ (p=0.494 n=10+10) ``` Addresses: #82159. Release note: None 84607: bench: add a benchmark of index join with ordering r=yuzefovich a=yuzefovich Release note: None Co-authored-by: Oleg Afanasyev <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Marylia Gutierrez <[email protected]>

yuzefovich added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label May 31, 2022

yuzefovich self-assigned this May 31, 2022

blathers-crl bot added the T-sql-queries SQL Queries Team label May 31, 2022

yuzefovich mentioned this issue May 31, 2022

kvcoord: permit parallelization of scans with limits #54680

Open

16 tasks

This was referenced Jun 2, 2022

kvstreamer: optimize singleRangeBatch.Less #82382

Merged

colfetcher: increase input batch size limit when using the Streamer #82305

Merged

kvstreamer: clean up Result struct #82422

Merged

yuzefovich mentioned this issue Jun 9, 2022

kvstreamer: remove temporary allocations for []Result #82387

Merged

This was referenced Jun 15, 2022

kvcoord: optimize batch truncation loop #82865

Merged

roachtest: add streamer config to tpchvec #83166

Merged

yuzefovich mentioned this issue Jun 27, 2022

kvstreamer: improve the avg response size heuristic #83472

Merged

yuzefovich mentioned this issue Jun 30, 2022

kvstreamer: optimize the results buffers a bit #83659

Merged

yuzefovich mentioned this issue Jul 6, 2022

kvstreamer: reuse the truncation helper and introduce a fast path for a single range case #83926

Merged

yuzefovich closed this as completed Jul 14, 2022

yuzefovich mentioned this issue Jul 18, 2022

rowexec: use OutOfOrder mode of streamer for lookup joins with ordering #84601

Merged

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Done in SQL Queries Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvstreamer: improve the performance #82159

kvstreamer: improve the performance #82159

yuzefovich commented May 31, 2022 •

edited by cockroach-jira-scripts

Loading

ajwerner commented May 31, 2022

yuzefovich commented May 31, 2022

yuzefovich commented Jun 23, 2022 •

edited

Loading

yuzefovich commented Jul 6, 2022 •

edited

Loading

yuzefovich commented Jul 10, 2022

yuzefovich commented Jul 14, 2022

yuzefovich commented Jul 14, 2022

kvstreamer: improve the performance #82159

kvstreamer: improve the performance #82159

Comments

yuzefovich commented May 31, 2022 • edited by cockroach-jira-scripts Loading

ajwerner commented May 31, 2022

yuzefovich commented May 31, 2022

yuzefovich commented Jun 23, 2022 • edited Loading

yuzefovich commented Jul 6, 2022 • edited Loading

yuzefovich commented Jul 10, 2022

yuzefovich commented Jul 14, 2022

yuzefovich commented Jul 14, 2022

yuzefovich commented May 31, 2022 •

edited by cockroach-jira-scripts

Loading

yuzefovich commented Jun 23, 2022 •

edited

Loading

yuzefovich commented Jul 6, 2022 •

edited

Loading