Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: improve functional dependencies performance for equalities #83963

Closed
DrewKimball opened this issue Jul 7, 2022 · 1 comment · Fixed by #137571
Closed

opt: improve functional dependencies performance for equalities #83963

DrewKimball opened this issue Jul 7, 2022 · 1 comment · Fixed by #137571
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team

Comments

@DrewKimball
Copy link
Collaborator

DrewKimball commented Jul 7, 2022

Describe the problem

Currently, sets of equivalent columns are represented inefficiently in functional dependencies. Consider a set of three equivalent columns (a, b, c). A FuncDepSet would represent this equivalence as (a=b,c), (b=a,c), (c=a,b), which would require 6 opt.ColSet structs to be allocated. In general, the number of opt.ColSets needed is 2n where n is the number of columns in the equivalence. This can cause a large number of allocations when there are a large number of expressions in the memo with many equivalent columns, especially once the columns exceed the size of the small set.

To Reproduce

Profiling TPCH Q2 shows many allocations resulting from handling equivalencies. Additionally, customers have run into this issue causing long planning time.

Expected behavior

Each equivalence group should be represented by a single opt.ColSet rather than 2n of them.

@DrewKimball DrewKimball added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 7, 2022
@DrewKimball DrewKimball changed the title opt opt: improve functional dependencies performance for equalities Jul 7, 2022
@DrewKimball DrewKimball self-assigned this Jul 7, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Jul 7, 2022
craig bot pushed a commit that referenced this issue Jul 29, 2022
…85329

84975: storage: add `MVCCRangeKeyStack` for range keys r=nicktrav,jbowens a=erikgrinaker

**storage: add `MVCCRangeKeyStack` for range keys**

This patch adds `MVCCRangeKeyStack` and `MVCCRangeKeyVersion`, a new
range key representation that will be returned by `SimpleMVCCIterator`.
It is more compact, for efficiency, and comes with a set of convenience
methods to simplify common range key processing.

Resolves #83895.

Release note: None
  
**storage: return `MVCCRangeKeyStack` from `SimpleMVCCIterator`**

This patch changes `SimpleMVCCIterator.RangeKeys()` to return
`MVCCRangeKeyStack` instead of `[]MVCCRangeKeyValue`. Callers have not
been migrated to properly make use of this -- instead, they call
`AsRangeKeyValues()` and construct and use the old data structure.

The MVCC range tombstones tech note is also updated to reflect this.

Release note: None
  
**storage: migrate MVCC code to `MVCCRangeKeyStack`**

Release note: None
  
***: migrate higher-level code to `MVCCRangeKeyStack`**

Release note: None
  
**kvserver/gc: partially migrate to `MVCCRangeKeyStack`**

Some parts require invasive changes to MVCC stats helpers. These will
shortly be consolidated with other MVCC stats logic elsewhere, so the
existing logic is retained for now by using `AsRangeKeyValues()`.

Release note: None
  
**storage: remove `FirstRangeKeyAbove()` and `HasRangeKeyBetween()`**

Release note: None

85017: Revert "sql: Add database ID to sampled query log" r=THardy98 a=THardy98

Reverts: #84195
This reverts commit 307817e.

Removes the DatabaseID field from the
`SampledQuery` telemetry log due to the potential of indefinite blocking
in the case of a lease acquisition failure. Protobuf field not reserved as 
no official build was released with these changes yet.

Release note (sql change): Removes the DatabaseID field from the
`SampledQuery` telemetry log due to the potential of indefinite blocking
in the case of a lease acquisition failure.

85024: cloud/gcp: add custom retryer for gcs storage, retry on stream INTERNAL_ERROR r=rhu713 a=rhu713

Currently, errors like
`stream error: stream ID <x>; INTERNAL_ERROR; received from peer`
are not being retried. Create a custom retryer to retry these errors as
suggested by:

googleapis/google-cloud-go#3735
googleapis/google-cloud-go#784

Fixes: #85217, #85216, #85204, #84162

Release note: None


85069: optbuilder: handle unnest returning a tuple r=DrewKimball a=DrewKimball

Currently, the return types of SRFs that return multiple columns are
represented as tuples with labels. The tuple labels are used to decide
whether or not to create a single output column for the SRF, or multiple.
The `unnest` function can return a single column if it has a single argument,
and the type of that column can be a tuple with labels. This could cause the
old logic to mistakenly create multiple output columns for `unnest`, which
could lead to panics down the line and incorrect behavior otherwise.

This commit adds a special case for `unnest` in the `optbuilder` to only expand
tuple return types if there is more than one argument (implying more than one
output column). Other SRFs do not have the same problem because they either
always return the same number of columns, cannot return tuples, or both.

Fixes #58438

Release note (bug fix): Fixed a bug existing since release 20.1 that could
cause a panic in rare cases when the unnest function was used with a
tuple return type.

85100: opt: perf improvements for large queries r=DrewKimball a=DrewKimball

**opt: add bench test for slow queries**

This commit adds two slow-planning queries pulled from #64793 to be used
in benchmarking the optimizer. In addition, the `ReorderJoinsLimit` has been
set to the default 8 for benchmarking tests.

**opt: add struct for tracking column equivalence sets**

Previously, the `JoinOrderBuilder` would construct a `FuncDepSet` from
scratch on each call to `addJoins` in order to eliminate redundant join
filters. This led to unnecessary large allocations because `addJoins` is
called an exponential number of times in query size.

This commit adds a struct `EquivSet` that efficiently stores equivalence
relations as `ColSets` in a slice. Rather than being constructed on each
call to `addJoins`, a `Reset` method is called that maintains slice memory.

In the future, `EquivSet` can be used to handle equivalencies within `FuncDepSet`
structs as well. This well avoid a significant number of allocations in cases with
many equivalent columns, as outlined in #83963.

**opt: avoid usage of FastIntMap in optimizer hot paths**

Previously, `computeHashJoinCost` would use a `FastIntMap` to represent join
equality filters to pass to `computeFiltersCost`. In addition,
`GenerateMergeJoins` used a `FastIntMap` to look up columns among its join
equality columns. This lead to unnecessary allocations since column IDs are
often large enough to exceed the small field of `FastIntMap`.

This commit modifies `computeFiltersCost` to take an anonymous function
that is used to decide whether to skip an equality condition, removing the
need for a mapping between columns.

This commit also refactors `GenerateMergeJoins` to simply perform a linear
scan of its equality columns; this avoids the allocation issue, and should be
fast in practice because the number of equalities will not generally be large.

Release note: None

85146: [backupccl] Use Expr for backup's Detached and Revision History options r=benbardin a=benbardin

This will allow us to set them to null, which will be helpful for ALTER commands.

Release note: None

85234: dev: add rewritable paths for norm tests r=mgartner a=mgartner

Tests in `pkg/sql/opt/norm` are similar to tests in `pkg/sql/opt/xform`
and `pkg/sql/opt/memo` in that they rely on fixtures in
`pkg/sql/opt/testutils/opttester/testfixtures`. This commit adds these
fixtures as rewritable paths for norm tests so that
`./dev test pkg/sql/opt/xform --rewrite` does not fail with errors like:

    open pkg/sql/opt/testutils/opttester/testfixtures/tpcc_schema: operation not permitted

Release note: None

85325: sql: fix explain gist output to show number of scan span constraints r=cucaroach a=cucaroach

If there were span constraints we would always print 1, need to actually
append them to get the count right.

Fixes: #85324

Release note: None


85327: sql: fix udf logic test r=chengxiong-ruan a=chengxiong-ruan

Fixes: #85303

Release note: None

85329: colexec: fix recent concat fix r=yuzefovich a=yuzefovich

The recent fix of the Concat operator in the vectorized engine doesn't
handle the array concatenation correctly and this is now fixed.

Fixes: #85295.

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Thomas Hardy <[email protected]>
Co-authored-by: Rui Hu <[email protected]>
Co-authored-by: DrewKimball <[email protected]>
Co-authored-by: Andrew Kimball <[email protected]>
Co-authored-by: Ben Bardin <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Tommy Reilly <[email protected]>
Co-authored-by: Chengxiong Ruan <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@mgartner mgartner moved this to Backlog (DO NOT ADD NEW ISSUES) in SQL Queries Jul 24, 2023
Copy link

github-actions bot commented Jan 1, 2024

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jan 3, 2024
Previously, `FuncDepSet` was fairly wasteful in how it tracked sets of
equivalent columns: for each column in the equiv group, an FD was maintained
from that column to all other columns in the group. This meant that there
were `2n` `ColSets` for each equiv group (where `n` is the number of columns
in the group).

This patch modifies `FuncDepSet` and its internals to use an `EquivSet`
instead, which keeps a single `ColSet` for each equiv group. This significantly
cuts down on allocations for queries with many columns and equalities, both
because less `ColSets` spill to heap, and because less FDs are added to
the `deps` slice.

Fixes cockroachdb#83963

Release note: None
@mgartner mgartner moved this from Backlog (DO NOT ADD NEW ISSUES) to New Backlog in SQL Queries Apr 23, 2024
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Dec 17, 2024
This commit removes the inline buffer from `EquivGroups`, as well as the
`NewEquivGroups()` that was previously used to set up the buffer. This
simplifies usage, and doesn't appear to affect performance when the
set is used in `JoinOrderBuilder` or in `FuncDepSet`.

Informs cockroachdb#83963

Release note: None
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Dec 17, 2024
This commit adds several new methods along with unit tests to `EquivGroups`
to prepare its use in tracking equivalencies in `FuncDepSet`.

Informs cockroachdb#83963

Release note: None
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Dec 17, 2024
Previously, `FuncDepSet` was fairly wasteful in how it tracked sets of
equivalent columns: for each column in the equiv group, an FD was maintained
from that column to all other columns in the group. This meant that there
were `2n` `ColSets` for each equiv group (where `n` is the number of columns
in the group).

This patch modifies `FuncDepSet` and its internals to use `props.EquivGroups`
instead, which keeps a single `ColSet` for each equiv group. This significantly
cuts down on allocations for queries with many columns and equalities, both
because less `ColSets` spill to heap, and because less FDs are added to
the `deps` slice.

Fixes cockroachdb#83963

Release note: None
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jan 3, 2025
This commit removes the inline buffer from `EquivGroups`, as well as the
`NewEquivGroups()` that was previously used to set up the buffer. This
simplifies usage, and doesn't appear to affect performance when the
set is used in `JoinOrderBuilder` or in `FuncDepSet`.

Informs cockroachdb#83963

Release note: None
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jan 3, 2025
This commit adds several new methods along with unit tests to `EquivGroups`
to prepare its use in tracking equivalencies in `FuncDepSet`.

Informs cockroachdb#83963

Release note: None
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jan 3, 2025
Previously, `FuncDepSet` was fairly wasteful in how it tracked sets of
equivalent columns: for each column in the equiv group, an FD was maintained
from that column to all other columns in the group. This meant that there
were `2n` `ColSets` for each equiv group (where `n` is the number of columns
in the group).

This patch modifies `FuncDepSet` and its internals to use `props.EquivGroups`
instead, which keeps a single `ColSet` for each equiv group. This significantly
cuts down on allocations for queries with many columns and equalities, both
because less `ColSets` spill to heap, and because less FDs are added to
the `deps` slice.

Fixes cockroachdb#83963

Release note: None
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jan 3, 2025
Previously, `FuncDepSet` was fairly wasteful in how it tracked sets of
equivalent columns: for each column in the equiv group, an FD was maintained
from that column to all other columns in the group. This meant that there
were `2n` `ColSets` for each equiv group (where `n` is the number of columns
in the group).

This patch modifies `FuncDepSet` and its internals to use `props.EquivGroups`
instead, which keeps a single `ColSet` for each equiv group. This significantly
cuts down on allocations for queries with many columns and equalities, both
because less `ColSets` spill to heap, and because less FDs are added to
the `deps` slice.

Fixes cockroachdb#83963

Release note: None
craig bot pushed a commit that referenced this issue Jan 4, 2025
137571: opt/props: use EquivGroups in FuncDepSet for tracking equivalences r=DrewKimball a=DrewKimball

#### props: don't leave unmerged equiv groups in EquivSet

This commit fixes a bug in `EquivSet.tryMergeGroups`, which could cause
an invariant (that equiv groups are non-intersecting) to be violated.
This would potentially cause re-ordered joins to have redundant equality
filters, and prevent join filter push-down in rare cases. This commit
also adds verification to the set for test builds, and a unit test for
the `Add` method.

Fixes #137381

Release note: None

#### props: rename EquivSet to EquivGroups

This commit renames `EquivSet` to `EquivGroups` to better reflect the
fact that it tracks _groups_ of equivalent groups, rather than a single
set of equivalent columns.

Epic: None

Release note: None

#### props: remove initial buffer from EquivGroups

This commit removes the inline buffer from `EquivGroups`, as well as the
`NewEquivGroups()` that was previously used to set up the buffer. This
simplifies usage, and doesn't appear to affect performance when the
set is used in `JoinOrderBuilder` or in `FuncDepSet`.

Informs #83963

Release note: None

#### props: add methods to EquivGroups for use in FuncDepSet

This commit adds several new methods along with unit tests to `EquivGroups`
to prepare its use in tracking equivalencies in `FuncDepSet`.

Informs #83963

Release note: None

#### opt/props: use EquivGroups in FuncDepSet for tracking equivalences

Previously, `FuncDepSet` was fairly wasteful in how it tracked sets of
equivalent columns: for each column in the equiv group, an FD was maintained
from that column to all other columns in the group. This meant that there
were `2n` `ColSets` for each equiv group (where `n` is the number of columns
in the group).

This patch modifies `FuncDepSet` and its internals to use `props.EquivGroups`
instead, which keeps a single `ColSet` for each equiv group. This significantly
cuts down on allocations for queries with many columns and equalities, both
because less `ColSets` spill to heap, and because less FDs are added to
the `deps` slice.

Fixes #83963

Release note: None

Co-authored-by: Drew Kimball <[email protected]>
@craig craig bot closed this as completed in 15b7d7e Jan 4, 2025
@github-project-automation github-project-automation bot moved this from Backlog to Done in SQL Queries Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team
Projects
Status: Done
1 participant