Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt/props: use EquivSet in FuncDepSet for tracking equivalences #117272

Closed
wants to merge 1 commit into from

Conversation

DrewKimball
Copy link
Collaborator

Previously, FuncDepSet was fairly wasteful in how it tracked sets of equivalent columns: for each column in the equiv group, an FD was maintained from that column to all other columns in the group. This meant that there were 2n ColSets for each equiv group (where n is the number of columns in the group).

This patch modifies FuncDepSet and its internals to use an EquivSet instead, which keeps a single ColSet for each equiv group. This significantly cuts down on allocations for queries with many columns and equalities, both because less ColSets spill to heap, and because less FDs are added to the deps slice.

Fixes #83963

Release note: None

Previously, `FuncDepSet` was fairly wasteful in how it tracked sets of
equivalent columns: for each column in the equiv group, an FD was maintained
from that column to all other columns in the group. This meant that there
were `2n` `ColSets` for each equiv group (where `n` is the number of columns
in the group).

This patch modifies `FuncDepSet` and its internals to use an `EquivSet`
instead, which keeps a single `ColSet` for each equiv group. This significantly
cuts down on allocations for queries with many columns and equalities, both
because less `ColSets` spill to heap, and because less FDs are added to
the `deps` slice.

Fixes cockroachdb#83963

Release note: None
@DrewKimball DrewKimball requested a review from a team as a code owner January 3, 2024 09:53
@DrewKimball DrewKimball requested review from michae2 and removed request for a team January 3, 2024 09:53
Copy link

blathers-crl bot commented Jan 3, 2024

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@DrewKimball
Copy link
Collaborator Author

name                                        old time/op    new time/op    delta
SlowQueries/slow-query-1/reorder-join-0-10    1.01ms ± 0%    0.90ms ± 0%  -10.57%  (p=0.000 n=8+8)
SlowQueries/slow-query-1/reorder-join-8-10    13.7ms ± 0%    11.9ms ± 0%  -12.86%  (p=0.000 n=10+10)
SlowQueries/slow-query-2/reorder-join-0-10    2.08ms ± 0%    1.79ms ± 2%  -14.14%  (p=0.000 n=10+9)
SlowQueries/slow-query-2/reorder-join-8-10     257ms ± 1%     235ms ± 1%   -8.65%  (p=0.000 n=10+10)
SlowQueries/slow-query-3/reorder-join-0-10    42.0ms ± 2%    42.2ms ± 1%     ~     (p=0.079 n=9+10)
SlowQueries/slow-query-3/reorder-join-8-10    49.1ms ± 2%    48.6ms ± 2%   -0.94%  (p=0.035 n=10+10)
SlowQueries/slow-query-4/reorder-join-0-10    5.81ms ± 0%    4.54ms ± 0%  -21.86%  (p=0.000 n=8+9)
SlowQueries/slow-query-4/reorder-join-8-10     295ms ± 0%     243ms ± 1%  -17.48%  (p=0.000 n=10+10)
SlowQueries/slow-query-5/reorder-join-0-10    39.7ms ± 0%    33.9ms ± 0%  -14.70%  (p=0.000 n=10+10)
SlowQueries/slow-query-5/reorder-join-8-10     708ms ± 0%     655ms ± 1%   -7.47%  (p=0.000 n=10+10)
SlowQueries/slow-query-6/reorder-join-0-10    26.8ms ± 0%    14.4ms ± 0%  -46.43%  (p=0.000 n=9+9)
SlowQueries/slow-query-6/reorder-join-8-10     1.00s ± 0%     0.79s ± 1%  -21.33%  (p=0.000 n=9+9)

name                                        old alloc/op   new alloc/op   delta
SlowQueries/slow-query-1/reorder-join-0-10     668kB ± 0%     553kB ± 0%  -17.21%  (p=0.000 n=9+9)
SlowQueries/slow-query-1/reorder-join-8-10    6.74MB ± 0%    6.00MB ± 0%  -11.00%  (p=0.000 n=10+10)
SlowQueries/slow-query-2/reorder-join-0-10     993kB ± 0%     847kB ± 0%  -14.70%  (p=0.000 n=9+9)
SlowQueries/slow-query-2/reorder-join-8-10    61.1MB ± 0%    59.8MB ± 0%   -2.20%  (p=0.000 n=10+10)
SlowQueries/slow-query-3/reorder-join-0-10    41.0MB ± 0%    40.9MB ± 0%   -0.33%  (p=0.000 n=8+8)
SlowQueries/slow-query-3/reorder-join-8-10    46.0MB ± 0%    45.3MB ± 0%   -1.51%  (p=0.000 n=10+9)
SlowQueries/slow-query-4/reorder-join-0-10    4.23MB ± 0%    3.75MB ± 0%  -11.37%  (p=0.000 n=9+10)
SlowQueries/slow-query-4/reorder-join-8-10     127MB ± 0%     112MB ± 0%  -11.36%  (p=0.000 n=10+10)
SlowQueries/slow-query-5/reorder-join-0-10    34.3MB ± 0%    33.2MB ± 0%   -3.17%  (p=0.000 n=10+10)
SlowQueries/slow-query-5/reorder-join-8-10     260MB ± 0%     253MB ± 0%   -2.91%  (p=0.000 n=9+10)
SlowQueries/slow-query-6/reorder-join-0-10    13.0MB ± 0%     7.8MB ± 0%  -39.97%  (p=0.000 n=9+10)
SlowQueries/slow-query-6/reorder-join-8-10     489MB ± 0%     393MB ± 0%  -19.58%  (p=0.000 n=8+9)

name                                        old allocs/op  new allocs/op  delta
SlowQueries/slow-query-1/reorder-join-0-10     3.89k ± 0%     3.27k ± 0%  -15.87%  (p=0.000 n=9+8)
SlowQueries/slow-query-1/reorder-join-8-10     69.0k ± 0%     55.8k ± 0%  -19.10%  (p=0.000 n=10+10)
SlowQueries/slow-query-2/reorder-join-0-10     4.29k ± 0%     4.12k ± 0%   -3.97%  (p=0.000 n=10+9)
SlowQueries/slow-query-2/reorder-join-8-10      491k ± 0%      476k ± 0%   -2.94%  (p=0.000 n=10+10)
SlowQueries/slow-query-3/reorder-join-0-10      332k ± 0%      329k ± 0%   -0.98%  (p=0.000 n=8+8)
SlowQueries/slow-query-3/reorder-join-8-10      378k ± 0%      364k ± 0%   -3.63%  (p=0.000 n=10+9)
SlowQueries/slow-query-4/reorder-join-0-10     43.3k ± 0%     31.9k ± 0%  -26.20%  (p=0.000 n=9+10)
SlowQueries/slow-query-4/reorder-join-8-10     2.77M ± 0%     2.30M ± 0%  -16.85%  (p=0.000 n=10+10)
SlowQueries/slow-query-5/reorder-join-0-10      214k ± 0%      201k ± 0%   -5.64%  (p=0.000 n=10+10)
SlowQueries/slow-query-5/reorder-join-8-10     4.55M ± 0%     4.50M ± 0%   -1.19%  (p=0.000 n=10+10)
SlowQueries/slow-query-6/reorder-join-0-10      152k ± 0%       81k ± 0%  -46.95%  (p=0.000 n=9+10)
SlowQueries/slow-query-6/reorder-join-8-10     8.78M ± 0%     7.23M ± 0%  -17.60%  (p=0.000 n=8+10)

@DrewKimball
Copy link
Collaborator Author

TODO: unit tests for the new EquivSet methods + invariants.

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! I agree that we should have extensive unit tests for the new methods in EquivSet.

Now that you've got a satisfying benchmark result, and you've mapped out all the functionality EquivSet needs, I recommend trying to split this up into multiple commits. You could add methods to EquivSet in a separate commit (or even multiple commits) along with unit tests, and then a final commit to use EquivSet in FuncDepSet. Even that last commit will be a bit large, and ideally it could be done incrementally, but I don't see any easy way to do that—it may be possible for the two types of equiv sets to live side-by-side and incrementally phase out the old one, but use your best judgement on that.

Reviewed 1 of 58 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @michae2)


pkg/sql/opt/props/equiv_set.go line 21 at r1 (raw file):

// TODO(drewk): incorporate EquivSets into FuncDepSets.
type EquivSet struct {
	buf    [equalityBufferSize]opt.ColSet

Why did you remove buf? Does it add too much complexity or not provide any benefit?


pkg/sql/opt/props/equiv_set.go line 22 at r1 (raw file):

// queries about which columns are equivalent to one another. Equivalence groups
// are always non-empty and disjoint.
type EquivSet struct {

I had a thought that EquivGroups might be a better name. What do you think?


pkg/sql/opt/props/equiv_set.go line 55 at r1 (raw file):

func (eq *EquivSet) AreColsEquiv(left, right opt.ColumnID) bool {
	if buildutil.CrdbTestBuild {
		defer eq.verify()

Is this assertion needed here if this method doesn't mutate the set? I suppose it doesn't hurt to have and it gives us additional coverage.


pkg/sql/opt/props/equiv_set.go line 72 at r1 (raw file):

// Empty returns true if the set stores no equalities.
func (eq *EquivSet) empty() bool {

The external API would be better defined by exporting the methods that are used outside this file (besides test-only ones) and that are safe to use without intimate knowledge of the inner workings of EquivSet. So even though empty and count and get are only used in the same package in func_dep.go, naming them Empy and Count and Get denotes that they are part of the "public" API for EquivSet.


pkg/sql/opt/props/equiv_set.go line 77 at r1 (raw file):

// Count returns the number of equiv groups stored in the set.
func (eq *EquivSet) count() int {

nit: Name this GroupCount()


pkg/sql/opt/props/equiv_set.go line 83 at r1 (raw file):

// get returns the equiv group at the given index. The returned ColSet should be
// considered immutable.
func (eq *EquivSet) get(idx int) opt.ColSet {

nit: Name this Group(idx int)


pkg/sql/opt/props/equiv_set.go line 166 at r1 (raw file):

		if eq.groups[idx].Intersects(eq.groups[i]) {
			eq.groups[idx] = eq.groups[idx].Union(eq.groups[i])
			eq.groups[i] = eq.groups[len(eq.groups)-1]

Could this be broken? After moving the last group, let's call it lg, to position i we'll increment i and effectively skip over lg. We'll never check if lg intersects eq.groups[idx]. That would mean that some groups might not be merged when they could be.


pkg/sql/opt/props/equiv_set.go line 229 at r1 (raw file):

// makePartition divides the equiv groups according to the given columns. If an
// equiv group intersects the given ColSet but is not a subset, it is split into
// the intersection and difference with the given ColSet.

I think this deserves more explanation. An example would be helpful. Why the name partitionBy be better?


pkg/sql/opt/props/equiv_set.go line 236 at r1 (raw file):

	for i := len(eq.groups) - 1; i >= 0; i-- {
		if eq.groups[i].Intersects(cols) && !eq.groups[i].SubsetOf(cols) {
			// This group references both sides of the join, so split it.

The term "join" seems out of place here, or is lacking some context.


pkg/sql/opt/props/func_dep.go line 647 at r1 (raw file):

		if !inToSet {
			for j := 0; j < f.equiv.count(); j++ {
				if f.equiv.get(j).Contains(i) {

nit: Adding a EquivSet.Contains(col opt.ColumnID) method might be a nice addition to the API and simplify this a bit.


pkg/sql/opt/props/func_dep.go line 1685 at r1 (raw file):

			needComma = true
			from := opt.MakeColSet(col)
			fmt.Fprintf(b, "%s==%s", from, group.Difference(from))

nit: this can be part of a Format or String method of EquivSet.

Copy link
Collaborator Author

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgartner I'll respond to the comments here, but open up a new PR elsewhere.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)


pkg/sql/opt/props/equiv_set.go line 21 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Why did you remove buf? Does it add too much complexity or not provide any benefit?

Yeah, I don't expect much benefit since it's (almost) always embedded in a struct on the heap that gets reused, and it requires us to worry about properly initializing the set. This way, using the zero value doesn't leave any memory unused.


pkg/sql/opt/props/equiv_set.go line 22 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

I had a thought that EquivGroups might be a better name. What do you think?

Sure, works for me.


pkg/sql/opt/props/equiv_set.go line 55 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Is this assertion needed here if this method doesn't mutate the set? I suppose it doesn't hurt to have and it gives us additional coverage.

Having the assertion in non-mutating methods helps to ensure that no one is mutating any of the ColSets (not just the EquivGroups). It actually caught a bug or two while I was writing tests.


pkg/sql/opt/props/equiv_set.go line 72 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

The external API would be better defined by exporting the methods that are used outside this file (besides test-only ones) and that are safe to use without intimate knowledge of the inner workings of EquivSet. So even though empty and count and get are only used in the same package in func_dep.go, naming them Empy and Count and Get denotes that they are part of the "public" API for EquivSet.

Good idea, done.


pkg/sql/opt/props/equiv_set.go line 77 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: Name this GroupCount()

Done.


pkg/sql/opt/props/equiv_set.go line 83 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: Name this Group(idx int)

Done.


pkg/sql/opt/props/equiv_set.go line 166 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Could this be broken? After moving the last group, let's call it lg, to position i we'll increment i and effectively skip over lg. We'll never check if lg intersects eq.groups[idx]. That would mean that some groups might not be merged when they could be.

You're right, this is broken. I think this means we would potentially have redundant filters in re-ordered joins, and we might fail to push down some join filters. I'll add a commit fixing it that we could backport.

#137558


pkg/sql/opt/props/equiv_set.go line 229 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

I think this deserves more explanation. An example would be helpful. Why the name partitionBy be better?

Added an example, and changed the name. LMK if you think anything more would be helpful.


pkg/sql/opt/props/equiv_set.go line 236 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

The term "join" seems out of place here, or is lacking some context.

Good point, I'll remove it.


pkg/sql/opt/props/func_dep.go line 647 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: Adding a EquivSet.Contains(col opt.ColumnID) method might be a nice addition to the API and simplify this a bit.

Nice idea, that does simplify things.


pkg/sql/opt/props/func_dep.go line 1685 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: this can be part of a Format or String method of EquivSet.

There's already a String method for debugging that uses a more condensed format. I'd prefer to keep this logic here, since it's "func-dep only", and doesn't really make sense to use in other contexts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

opt: improve functional dependencies performance for equalities
3 participants