colexec: improve planning and execution of many projection operators in a single query #85632

yuzefovich · 2022-08-04T18:44:20Z

Currently, the vectorized engine works by planning a separate operator for each projection (e.g. binary) expression. Obviously, this requires creating a separate object (extending the planning time) but also can incur significant latency increase during the execution if a query has on the order of hundreds of such expressions. The row-by-row engine handles such scenarios better, especially when the number of rows flowing through the operators is low.

We should improve things here, but it doesn't seem simple since it would probably go against assumptions of the vectorized projection operators (working on a single column-at-a-time and having expr-specific generated code). Probably, we should just fall back to using the noop row-by-row processor once the number of expressions to render exceeds some threshold.

Jira issue: CRDB-18350

The text was updated successfully, but these errors were encountered:

84514: sql/schemachanger: implemented ALTER PRIMARY KEY for vanilla case r=postamar a=Xiang-Gu The PR implements `ALTER PRIMARY KEY` under the declarative schema changer framework that handles the simplest, "vanilla" case like ``` CREATE TABLE t (i INT PRIMARY KEY, j INT NOT NULL) ALTER TABLE t ALTER PRIMARY KEY USING COLUMNS (j) ``` This is the first of a series PRs where followup PRs will expand its capabilities to be able to handle more complex cases, including 1. Allow the requested new primary key to be hash-sharded; 2. Consider the case where altering primary key requires us to modify existing secondary indexes(see the legacy schema change about in what cases we should rewrite) 3. Consider the case where the old primary index is on the implicitly created `rowid` column, in which case we also need to drop that column; 5. Consider partitioning and locality (I'm not sure what they are, and why they play a role when `ALTER PRIMARY KEY` but I've seen them in the old schema changer, so I assume we ought to do something about them too here). 6. Support `ALTER PRIMARY KEY` with concurrent schema change statements. E.g. ```ALTER TABLE t ADD COLUMN k INT NOT NULL DEFAULT 30, ALTER PRIMARY KEY USING COLUMNS (j);``` related: #83932 Release note: None 84718: sql: populate query-level stats earlier & add contention to telemetry log r=THardy98 a=THardy98 Addresses: #71328 This change adds contention time (measured in nanoseconds) to the `SampledQuery` telemetry log. To accomodate this change, we needed to collect query-level statistics earlier. Previously, query-level statistics were fetched when we called `Finish` under the `instrumentationHelper`, however this occurred after we had already emitted our query execution logs. Now, we collect query-level stats in `dispatchToExecutionEngine` after we've executed the query. As a tradeoff to collecting query-level stats earlier, we need to fetch the trace twice: - once when populating query-level stats (trace is required to compute query-level stats) at `populateQueryLevelStats` in `dispatchToExecutionEngine` after query execution - once during the instrumentation helper's `Finish` (as we do currently) This allows us to collect query-level stats earlier without omitting any tracing events we record currently. This approach is safer, with the additional overhead of fetching the trace twice only occuring at the tracing sampling rate of 1-2%, which is fairly conservative. The concern with only fetching the trace at query-level stats population was the ommission of a number of events that occur in `commitSQLTransactionInternal` (or any execution paths that don't lead to `dispatchToExecutionEngine`). Release note (sql change): Add `ContentionTime` field to the `SampledQuery` telemetry log. Query-level statistics are collected earlier to facilitate the adding of contention time to query execution logs. The earlier collection of query-level statistics requires the additional overhead of fetching the query's trace twice (instead of previously once). 85280: sql, server: add new system privileges for observability r=Santamaura a=Santamaura This patch introduces 2 new system privileges VIEWDEBUG and VIEWCLUSTERMETADATA. VIEWDEBUG will now be used to gate taking traces and viewing debug endpoints. VIEWCLUSTERMETADATA will now be used to gate the node and range reports. Resolves #83844, #83856, #83857, #83858, #83861 Release note (sql change): add VIEWDEBUG and VIEWCLUSTERMETADATA system privileges. 85458: changefeedccl: add retries to sinkless changefeeds r=jayshrivastava a=jayshrivastava This change updates core/sinkless changefeeds to run in a retry loop, allowing for changefeed restarts in case of transient errors or declarative schema changes. See commit notes for more details. Fixes #85008 85819: kv: use max timestamp during below-Raft scan to gossip liveness r=nvanbenschoten a=nvanbenschoten Related to https://github.com/cockroachlabs/support/issues/1573. This commit switches `MaybeGossipNodeLivenessRaftMuLocked` to evaluate its scan at the maximum timestamp instead of at the local node's HLC time. This ensures that we gossip the most recent liveness record, regardless of what timestamp it is written at. 85822: colbuilder: fall back to row-by-row processor wrapping for many renders r=yuzefovich a=yuzefovich **colbuilder: add a microbenchmark for running many render expressions** This commit adds a microbenchmark of queries with many render expressions. It'll be used in the following commit to tune when we fall back to wrapping a row-by-row processor to handle those renders. Release note: None **colbuilder: fall back to row-by-row processor wrapping for many renders** This commit introduces a mechanism to handle render expressions by wrapping a row-by-row processor into the vectorized flow when 1. the estimated number of rows to go through the renders is relatively small 2. the number of renders is relatively high. The idea is that the vectorized projection operators have higher overhead when many of them are planned AND there is not enough data to amortize the overhead, so to improve the performance in those cases we'll use the row-by-row noop processor. Both of the thresholds are controlled by cluster settings and the defaults were chosen based on a representative microbenchmark. It's worth pointing out that we only have the estimated row count for the scan operators, so the change has limited applicability. ``` RenderPlanning/rows=1/renders=1-24 407µs ± 2% 408µs ± 2% ~ (p=0.684 n=10+10) RenderPlanning/rows=1/renders=8-24 516µs ± 1% 537µs ± 1% +4.05% (p=0.000 n=10+10) RenderPlanning/rows=1/renders=32-24 832µs ± 1% 811µs ± 1% -2.59% (p=0.000 n=10+10) RenderPlanning/rows=1/renders=64-24 1.22ms ± 0% 1.14ms ± 1% -6.62% (p=0.000 n=9+10) RenderPlanning/rows=1/renders=128-24 2.02ms ± 0% 1.80ms ± 1% -11.18% (p=0.000 n=8+9) RenderPlanning/rows=1/renders=512-24 7.75ms ± 1% 5.75ms ± 1% -25.77% (p=0.000 n=10+9) RenderPlanning/rows=1/renders=4096-24 160ms ± 1% 62ms ± 1% -61.51% (p=0.000 n=10+9) RenderPlanning/rows=4/renders=1-24 438µs ± 2% 438µs ± 1% ~ (p=0.853 n=10+10) RenderPlanning/rows=4/renders=8-24 603µs ± 1% 633µs ± 2% +5.06% (p=0.000 n=10+10) RenderPlanning/rows=4/renders=32-24 1.08ms ± 1% 1.08ms ± 1% ~ (p=0.105 n=10+10) RenderPlanning/rows=4/renders=64-24 1.72ms ± 0% 1.62ms ± 0% -5.83% (p=0.000 n=9+9) RenderPlanning/rows=4/renders=128-24 3.01ms ± 1% 2.75ms ± 1% -8.78% (p=0.000 n=10+10) RenderPlanning/rows=4/renders=512-24 11.6ms ± 1% 9.6ms ± 2% -17.58% (p=0.000 n=10+10) RenderPlanning/rows=4/renders=4096-24 192ms ± 2% 91ms ± 2% -52.58% (p=0.000 n=10+10) RenderPlanning/rows=16/renders=1-24 494µs ± 1% 499µs ± 1% +1.03% (p=0.006 n=10+8) RenderPlanning/rows=16/renders=8-24 855µs ± 1% 901µs ± 1% +5.37% (p=0.000 n=10+10) RenderPlanning/rows=16/renders=32-24 2.03ms ± 1% 2.04ms ± 1% ~ (p=0.190 n=10+10) RenderPlanning/rows=16/renders=64-24 3.58ms ± 1% 3.42ms ± 1% -4.56% (p=0.000 n=10+9) RenderPlanning/rows=16/renders=128-24 6.74ms ± 1% 6.31ms ± 1% -6.37% (p=0.000 n=10+10) RenderPlanning/rows=16/renders=512-24 26.9ms ± 1% 24.7ms ± 1% -8.24% (p=0.000 n=9+10) RenderPlanning/rows=16/renders=4096-24 329ms ± 2% 218ms ± 2% -33.66% (p=0.000 n=10+10) RenderPlanning/rows=64/renders=1-24 666µs ± 1% 659µs ± 2% -1.07% (p=0.007 n=10+10) RenderPlanning/rows=64/renders=8-24 1.79ms ± 1% 1.84ms ± 1% +3.01% (p=0.000 n=10+10) RenderPlanning/rows=64/renders=32-24 5.53ms ± 1% 5.79ms ± 2% +4.74% (p=0.000 n=10+10) RenderPlanning/rows=64/renders=64-24 10.8ms ± 1% 11.0ms ± 1% +1.87% (p=0.000 n=10+9) RenderPlanning/rows=64/renders=128-24 21.2ms ± 1% 21.7ms ± 1% +2.71% (p=0.000 n=10+10) RenderPlanning/rows=64/renders=512-24 83.6ms ± 0% 84.9ms ± 0% +1.47% (p=0.000 n=10+7) RenderPlanning/rows=64/renders=4096-24 824ms ± 1% 751ms ± 2% -8.88% (p=0.000 n=10+10) RenderPlanning/rows=128/renders=1-24 853µs ± 1% 851µs ± 1% ~ (p=0.481 n=10+10) RenderPlanning/rows=128/renders=8-24 2.98ms ± 1% 3.11ms ± 1% +4.32% (p=0.000 n=10+10) RenderPlanning/rows=128/renders=32-24 10.4ms ± 1% 10.9ms ± 1% +5.44% (p=0.000 n=10+10) RenderPlanning/rows=128/renders=64-24 20.1ms ± 1% 21.3ms ± 1% +5.99% (p=0.000 n=10+10) RenderPlanning/rows=128/renders=128-24 39.7ms ± 1% 42.1ms ± 2% +5.98% (p=0.000 n=10+10) RenderPlanning/rows=128/renders=512-24 160ms ± 1% 168ms ± 2% +5.13% (p=0.000 n=9+10) RenderPlanning/rows=128/renders=4096-24 1.44s ± 1% 1.48s ± 2% +3.15% (p=0.000 n=9+10) RenderPlanning/rows=256/renders=1-24 1.22ms ± 1% 1.21ms ± 1% -1.01% (p=0.000 n=10+10) RenderPlanning/rows=256/renders=8-24 5.22ms ± 0% 5.19ms ± 1% -0.54% (p=0.011 n=8+9) RenderPlanning/rows=256/renders=32-24 19.9ms ± 1% 20.0ms ± 1% ~ (p=0.182 n=9+10) RenderPlanning/rows=256/renders=64-24 39.0ms ± 0% 38.9ms ± 0% -0.33% (p=0.023 n=10+10) RenderPlanning/rows=256/renders=128-24 76.8ms ± 1% 76.7ms ± 1% ~ (p=0.739 n=10+10) RenderPlanning/rows=256/renders=512-24 316ms ± 1% 319ms ± 1% +1.15% (p=0.001 n=9+10) RenderPlanning/rows=256/renders=4096-24 2.75s ± 1% 2.73s ± 1% -0.64% (p=0.002 n=8+9) ``` Fixes: #85632. Release note: None Co-authored-by: Xiang Gu <[email protected]> Co-authored-by: Thomas Hardy <[email protected]> Co-authored-by: Santamaura <[email protected]> Co-authored-by: Jayant Shrivastava <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>

yuzefovich added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 4, 2022

blathers-crl bot added the T-sql-queries SQL Queries Team label Aug 4, 2022

yuzefovich self-assigned this Aug 8, 2022

yuzefovich mentioned this issue Aug 9, 2022

colbuilder: fall back to row-by-row processor wrapping for many renders #85822

Merged

craig bot closed this as completed in #85822 Aug 9, 2022

This was referenced Aug 9, 2022

release-22.1: colbuilder: fall back to row-by-row processor wrapping for many renders #85856

Merged

release-21.2: colbuilder: fall back to row-by-row processor wrapping for many renders #85860

Merged

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Done in SQL Queries Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colexec: improve planning and execution of many projection operators in a single query #85632

colexec: improve planning and execution of many projection operators in a single query #85632

yuzefovich commented Aug 4, 2022 •

edited

Loading

colexec: improve planning and execution of many projection operators in a single query #85632

colexec: improve planning and execution of many projection operators in a single query #85632

Comments

yuzefovich commented Aug 4, 2022 • edited Loading

yuzefovich commented Aug 4, 2022 •

edited

Loading